Tuesday, January 7, 2020
Determining Outliers in Statistics
Outliers are data values that differ greatly from the majority of a set of data. These values fall outside of an overall trend that is present in the data.Ã A careful examination of a set of data to look for outliers causes some difficulty. Although it is easy to see, possibly by use of a stemplot, that some values differ from the rest of the data, how much different does the value have to be to be considered an outlier?Ã We will look at a specific measurement that will give us an objective standard of what constitutes an outlier. Interquartile Range The interquartile range is what we can use to determine if an extreme value is indeed an outlier. The interquartile range is based upon part of the five-number summary of a data set, namely the first quartile and the third quartile. The calculation of the interquartile range involves a single arithmetic operation. All that we have to do to find the interquartile range is to subtract the first quartile from the third quartile. The resulting difference tells us how spread out the middle half of our data is. Determining Outliers Multiplying the interquartile range (IQR) by 1.5 will give us a way to determine whether a certain value is an outlier. If we subtract 1.5 x IQR from the first quartile, any data values that are less than this number are considered outliers. Similarly, if we add 1.5 x IQR to the third quartile, any data values that are greater than this number are considered outliers. Strong Outliers Some outliers show extreme deviation from the rest of a data set. In these cases we can take the steps from above, changing only the number that we multiply the IQR by, and define a certain type of outlier. If we subtract 3.0 x IQR from the first quartile, any point that is below this number is called a strong outlier. In the same way, the addition of 3.0 x IQR to the third quartile allows us to define strong outliers by looking at points which are greater than this number. Weak Outliers Besides strong outliers, there is another category for outliers. If a data value is an outlier, but not a strong outlier, then we say that the value is a weak outlier. We will look at these concepts by exploring a few examples. Example 1 First, suppose that we have the data set {1, 2, 2, 3, 3, 4, 5, 5, 9}. The number 9 certainly looks like it could be an outlier. It is much greater than any other value from the rest of the set. To objectively determine if 9 is an outlier, we use the above methods. The first quartile is 2 and the third quartile is 5, which means that the interquartile range is 3. We multiply the interquartile range by 1.5, obtaining 4.5, and then add this number to the third quartile. The result, 9.5, is greater than any of our data values. Therefore there are no outliers. Example 2 Now we look at the same data set as before, with the exception that the largest value is 10 rather than 9: {1, 2, 2, 3, 3, 4, 5, 5, 10}. The first quartile, third quartile, and interquartile range are identical to example 1. When we add 1.5 x IQR 4.5 to the third quartile, the sum is 9.5. Since 10 is greater than 9.5 it is considered an outlier. Is 10 a strong or weak outlier? For this, we need to look at 3 x IQR 9. When we add 9 to the third quartile, we end up with a sum of 14. Since 10 is not greater than 14, it is not a strong outlier. Thus we conclude that 10 is a weak outlier. Reasons for Identifying Outliers We always need to be on the lookout for outliers. Sometimes they are caused by an error. Other times outliers indicate the presence of a previously unknown phenomenon. Another reason that we need to be diligent about checking for outliers is because of all the descriptive statistics that are sensitive to outliers. The mean, standard deviation and correlation coefficient for paired data are just a few of these types of statistics.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.