We hear a lot about big data these days, and its use to guide decisions by public agencies and private firms and even individuals. But most of the discussion concerns finding the central tendency in a data set or relationships among data sets. We seek to find the median family income in an area, or the background of a typical customer or client. We seek the most frequent or “normal” relationship between age and interest in our product or need for our services. We want to serve the bulk of our customers or citizens.
Anomalies and Averages
Yet often we would be well advised to devote attention to the anomalies—those measurements, items, or individuals far from the center of the normal pattern or relationship. One common example is crime prevention, where we seek to identify and stop behavior way worse than normal, often by spotting non-criminal behavior or characteristics that are themselves way beyond normal. One finds cybercrime and software glitches most often by looking for edge cases, places where the data is far from normal. Note that in the computer world, far from normal may be a tiny offset in a situation that should have no offset, as when a Berkeley computer scientist followed a tiny error that uncovered worldwide cybercriminal activity. Many anomalies show up due simply to errors in measurement, but finding these can at least lead to improved techniques for measurement.
Thankfully, many of the same statistics we use to find central tendencies and patterns can also help us find anomalies. To begin, however, we start with distributions.
Three Types of Distributions of Data
Many data sets themselves follow patterns that statisticians call distributions.
The Normal Distribution (The Bell Curve)
We have all heard of the normal distribution or bell curve, which is a shape for data we often find when the causes for variation are random or so numerous that they appear random.
Source: U.S. Census Bureau, U.S. Bureau of Labor Statistics, FFIEC. Values are adjusted for local cost of living and to the national inflation rate between 1990 and 2015. Statistical category values are depicted below.
The columns represent the number of Metropolitan Statistical Areas (MSAs) that fall within that range. Here are the values for each distribution category.
- Between minimum and Tukey low value (Tukey values are explained in detail below)
- Between Tukey low and two standard deviations below the mean
- Between two standard deviations below the mean and one standard deviation below the mean
- Between one standard deviation below the mean and the first quartile
- Between the first quartile and the median
- Between the median and the third quartile
- Between the third quartile and one standard deviation above the mean
- Between one standard deviation above the mean and two standard deviations above the mean
- Between two standard deviations above the mean and Tukey high
- Between Tukey high and the maximum
Heights of adult humans is a common example. It also turns out that the deviations from a pattern (say between population size and median family income) often fall into this shape.
The Exponential Distribution
Another common distribution is the exponential distribution. It looks like a ski jump from low to high or vice versa.
Source: U.S. Census Bureau.
Source: U.S. Census Bureau. Statistical category values are depicted above.
Population sizes often follow this pattern–many small cities, fewer middle-sized ones, and a few very large ones. The same distribution shows up in populations of counties and metro areas (which are collections of counties deemed economically connected). It also shows up in arrival times at places like toll booths and football stadiums, and in time to failure in computer hard drives.
The U-Shaped Distribution
A third distribution occurs much less frequently but is often important when it does. That is the U-shaped distribution, where the data points are fewest in the middle. The voting records compiled for Congressmen have moved more into this pattern in recent years—either high or low depending on who is doing the measuring, but fewer with medium scores than in past years.
Source: 2016 American Conservative Union data, available here.
Source: Mike Martin analysis of 2007 American Civil Liberties Union data, available here.
Another common case is the time to the first failure on many machines; they either have manufacturing flaws that show up early, or they last until their weakest parts wear out. Note that the techniques we describe below focus on finding anomalies at the high or low end; we would have to do something different when the anomalies might be in the middle. There are ways to do this, involving recalculating the data to make it more like a normal distribution, or splitting the data set into two, but they are beyond the scope of this article.
If we have data in a normal distribution, we have at least two ways of identifying anomalies.
The Standard Deviation Technique
The first notes that approximately 68% of the data points are usually within a standard deviation of the arithmetic mean. (The arithmetic mean is the “average” we learned in grade school: Add the measurements and divide by the number of measurements. The standard deviation is a well-established way of measuring how much the data points vary from the average. One lists the distance from the average for each data point, squares that value, adds the squares together, takes the average of that, called the variance, and then takes the square root of the variance.)
It also notes that 95% are usually within two standard deviations of the mean. Points above and below either boundary are candidates for anomalies, which might be errors in measurement or might be something we should study carefully. Although the same regularities do not apply to non-normal distributions, like population size, the same calculations have often given us useful results.
The Tukey Technique
John Tukey, who was a Professor of Statistics at Princeton University, defined an alternative and much higher standard for anomalies. His standard is 1.5 times the interquartile range outside the interquartile range. A quartile is like a median, but instead of dividing the data points into half by the number of data points above and below it, a quartile divides the data into fourths. The first quartile has one fourth of the data points below it and three-fourths above; the second quartile is the median, with one half of the data points below it and one half above; the third quartile has three-fourths below it and one-fourth above. The interquartile range is the difference between the third quartile and the first quartile.
That range is a favorite measure of central tendency for Tukey; he suggests we plot it for any distribution of interest. If we take that range number, multiply it by 1.5 and subtract it from the first quartile and add it to the third quartile, this definition labels any data points beyond those limits as anomalies. Note that for most close-to-normal distributions, this standard is far higher than one or two standard deviations from the mean—it will identify far fewer data points. For those close-to-exponential distributions, however, it will rarely identify any points on the most populous side of the distribution (whether high or low) and will identify many more points than even the one standard deviation method.
A Distribution from a Relationship Among Two or More Variables
So far, we have discussed using these techniques on a single data set. But note that deviation from a pattern (for example, a regression line calculated between two data sets) is itself a distribution—and often more like normal than either of the original data sets. Finding the pattern in these deviations is often useful. (If they are markedly bigger at one end of the range than the other, it questions the relationship itself.) But for our purposes, we can apply these techniques to the distribution of deviations to identify anomalies we might want to investigate further. For instance, why is a particular city so far above or below the expected relationship between population size and median income?
We use either or both techniques, depending on the data involved and our purpose in looking for anomalies. The Tukey technique, for instance, is a useful first check for measurement error, and for situations where extreme deviation is of interest. The one standard deviation technique is a useful check for data points, often individuals or places, that are “clearly above (or below) average.” One example might be prospective employees who took a test where the scores were “normalized” (meaning recalculated to put them into a normal distribution). If you believe your test, drop consideration of those “clearly below normal” and start further scrutiny with those “clearly above normal.”
Business and Policy Implications
So what? How would a public or private organization use the concept and techniques of anomaly detection? We identify two major ways.
Abnormal Is More Important Than Normal
In many situations the anomalies are more important than the averages. First are those where we are selecting one or a few out of a pool—job applicants, service vendors, advertising approaches, photographs from a shoot. We always want the better than average or the best, not the middle of the pool. We want to make sure the anomaly is not simply a mismeasurement, but an actual attribute or performance that is outstanding.
Second are those when we are most concerned with preventing or containing “bad” anomalies—law enforcement, cybersecurity, product quality control, health inspection. We appreciate the better than average, but we are most interested in the way less than average, the abnormal in bad ways.
Other examples include monitoring for health, safety, or fraud. In elections, for instance, monitors devote most of their attention to where the turnout was unusually large or small, or the party split was unusual in either direction. Rainfall is of most concern when it is way above average or way below average. The same is true for temperature and for outdoor sounds. Hunters in the wild need to be most careful when normal sounds cease; that generally means a large predator is nearby. Those monitoring city noises are generally grateful for less than average noise and looking for times and places when the noise is unusually loud.
When Abnormal Adds Value to Normal
We focus on the average, but we can also gain considerable value by considering the extremes. A political campaign spends a lot of time identifying and working with its “average voter” (who, in a highly partisan contest, might differ greatly from an opponent’s average voter—remember the U-shaped distribution of Congressional voting). But for financial donors and campaign volunteers, the campaign wants to find and work with those who are extremely supportive. The campaign might also want to identify those least likely to vote for the candidate to avoid exciting them into voting or otherwise supporting the other side. To find more of these extremes, it makes sense to study the ones already found.
Similarly, a sales campaign will work with its average customer but will seek testimonials and referrals from its extremely favorable buyers and might seek to avoid bad reviews or other negative activity from its least favorable. Traffic designers need to understand the normal range of vehicles and their speeds, but also prepare for the extremes, both stopped and speeding vehicles and the high end number of vehicles. Indeed, anyone sizing anything, or designing it to withstand variable use or weather, needs to consider both the range of normal conditions and the extremes. A few months before we wrote this piece, floods around the Gulf of Mexico highlighted this point.
The point of this article is not to teach statistics at any level, but to get us all thinking about how we use data to guide our decisions. Beyond finding patterns and relationships, we can use data to help us detect anomalies.
Exploratory Data Analysis, by John Wilder Tukey. Introduction to Anomaly Detection by Pramit Choudhary. Anomaly Detection: A Survey by Varun Chandola, Arindam Banerjee, and Vipin Kumar. Introduction to Anomaly Detection: Concepts and Techniques by Srinath Perera. Math is Fun.