Image by editor
The standard deviation is a measure of dispersion, i.e. how much individual data points are spread out from the mean.
For example, consider the two data sets:
27 23 25 22 23 20 20 25 29 29
and
12 31 31 16 28 47 9 5 40 47
Both have the same mean of 25. However, the first data set has values ​​closer to the mean and the second data set has values ​​that are more spread out.
To be more precise, the standard deviation for the first set of data is 3.13 and for the second set it is 14.67.
However, it’s not easy to wrap your head around numbers like 3.13 or 14.67. Currently, we only know that the second data set is more scattered than the first.
Let’s put this to more practical use.
When performing analyses, we often encounter data that follows a pattern with values ​​that cluster around the mean and have nearly equal scores below and above it, e.g.
- human growth
- blood pressure values
- test scores
Such values ​​follow a normal distribution.
According to Wikipedia article on the normal distribution, about 68% of the values ​​drawn from a normal distribution are within one standard deviation σ of the mean; about 95% of the values ​​are within two standard deviations; and about 99.7% are within three standard deviations.
This fact is known as the 68-95-99.7 rule (empirical) or the 3-sigma rule.
I applied this rule successfully when I needed to clean data from millions of IoT devices generating data for heating equipment. Each data point contained the electricity consumption at a given time.
However, sometimes the devices are not 100% accurate and give very high or very low values.
We had to remove these outliers because they made our graph scales unrealistic. The challenge was that the number of these outliers was never fixed. Sometimes we would get all the valid values ​​and sometimes these wrong readings covered up to 10% of the data points.
Our approach was to remove outlier points by eliminating all points that are above (mean + 2*SD) and all points below (mean – 2*SD) before plotting the frequencies.
You don’t have to use 2 though, you can change it a bit to get a better outlier detection formula for your data.
Here is a usage example Python programming. The data set is a classic normal distribution, but as you can see there are some values ​​like 10, 20 that will disturb our analysis and spoil the scales of our graphs.
As you can see in this case, we have removed the outliers and if we plot this data set, our graph will look much better.
[386, 479, 627, 523, 482, 483, 542, 699, 535, 617, 577, 471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 566, 554, 472, 335, 440, 579, 341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405, 487, 490, 496, 398, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565, 415, 486, 668, 414, 665, 557, 304, 404, 454, 689, 610, 483, 441, 657, 590, 492, 476, 437, 483, 529, 363, 711, 543]
As you can see, we were able to remove the outliers. I wouldn’t recommend this method for all statistical analysis though, the outliers have an import function in the statistics and they are there for a reason!
But in our case, the deviations were clearly due to error in the data, and the data were normally distributed, so the standard deviation made sense.
Punit Jajodia is an entrepreneur and software developer from Kathmandu, Nepal. Flexibility is his greatest strength, as he has worked on projects ranging from real-time 3D simulations in the browser and big data analytics to Windows application development. He is also the co-founder of Programiz.comone of the largest Python and R tutorial websites.
https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html?utm_source=rss&utm_medium=rss&utm_campaign=removing-outliers-using-standard-deviation-in-python