Detecting outliers on normally distributed data

Dataset: Met Office Heathrow Weather Station Historic Station Data

In [1]:
import pandas as pd

fi = pd.read_csv("heathrowdata.txt", delim_whitespace=True)
fi.head(10)
Out[1]:
yyyy mm tmax tmin af rain sun
0 1948 1 8.9 3.3 --- 85 ---
1 1948 2 7.9 2.2 --- 26 ---
2 1948 3 14.2 3.8 --- 14 ---
3 1948 4 15.4 5.1 --- 35 ---
4 1948 5 18.1 6.9 --- 57 ---
5 1948 6 19.1 10.3 --- 67 ---
6 1948 7 21.7 12.0 --- 21 ---
7 1948 8 20.8 11.7 --- 67 ---
8 1948 9 19.6 10.2 --- 35 ---
9 1948 10 14.9 6.0 --- 50 ---

Specify which column to operate on and generate DataFrame that contains only that column along with Date and Hour. Define constant r which specifies how many units of dispersion away from the center would be considered "too far away" and the value be considered as an outlier.

In [2]:
col = "rain"
df = fi[[col]]
r = 2

Using the mean and standard deviation

The most common metric for the "acceptable range" is "2 standard deviations from the mean".

In [3]:
mean = df[col].mean()
mean
Out[3]:
50.52948402948413
In [4]:
std_r = r * df[col].std()
std_r
Out[4]:
59.96564544897468
In [5]:
print df[(df[col] >= (mean + std_r)) | (df[col] <= (mean - std_r))]
      rain
21   139.6
37   121.4
46   132.6
102  130.5
153  155.5
190  118.8
225  122.1
248  131.4
274  151.2
281  127.7
320  140.0
322  142.8
332  111.2
355  150.3
477  174.8
503  119.3
549  123.0
564  113.7
619  124.8
633  155.4
658  151.0
659  119.0
670  113.8
714  115.2
742  148.0
773  110.8
792  162.4
802  128.4
811  116.8

Using the trimmed mean and trimmed standard deviation

Both the mean and standard deviation (which uses the mean) are not considered to be robust centers/dispersions as they are prone to the "masking" effect. Using robust centers/dispersions may be more appropriate for data with extreme values.

One such robust center is the trimmed mean. A k% trimmed mean discards the lowest and higest k% of the values.

In [6]:
k = 0.05                     # 5% trimmed mean
rec_len = len(df[col])
start_ix = int(rec_len * k)
fin_ix = rec_len - start_ix
dfs = df.sort_values(col)
trimmed_mean = dfs[col][start_ix:fin_ix].mean()
trimmed_mean
Out[6]:
48.81539509536784

A natural dispersion metric for the trimmed mean is the trimmed standard deviation, which is the standard deviation of the trimmed data.

In [7]:
trimmed_std_r = r * dfs[col][start_ix:fin_ix].std()
trimmed_std_r
Out[7]:
47.320608966236435
In [8]:
print df[(df[col] >= (trimmed_mean + trimmed_std_r)) | (df[col] <= (trimmed_mean - trimmed_std_r))]
      rain
21   139.6
34   108.2
37   121.4
46   132.6
88   103.2
102  130.5
103   99.5
125  104.3
153  155.5
190  118.8
197  110.1
212  109.4
219   98.6
225  122.1
232  102.1
237  103.8
248  131.4
274  151.2
281  127.7
320  140.0
322  142.8
332  111.2
344   96.6
355  150.3
371  108.6
389  106.2
393   96.5
417  109.1
423  101.5
456  105.6
..     ...
503  119.3
548   96.3
549  123.0
564  113.7
571    0.3
608  101.2
609  103.8
617   98.2
619  124.8
627   99.6
632  105.8
633  155.4
634   99.5
645  108.2
658  151.0
659  119.0
670  113.8
679  107.8
681   96.6
706   98.0
714  115.2
742  148.0
745  100.4
771   98.4
773  110.8
791   98.2
792  162.4
799   97.6
802  128.4
811  116.8

[62 rows x 1 columns]

Using the median and median absolute deviation

One other robust center is the median.

In [9]:
median = df[col].median()
median
Out[9]:
46.25

The Median Absolute Deviation (MAD) can be used to work along with the median. The MAD is the median of the absolute deviations from the data's median.

In [10]:
dfm = df.copy()
dfm[col] = dfm.apply(lambda x: abs(x - median))
mad_r = r * 1.4826 * dfm[col].median()              # MAD is approx. 1.4826 * s.d. (Hampel X84)
mad_r
Out[10]:
60.0453
In [11]:
print df[(df[col] >= (median + mad_r)) | (df[col] <= (median - mad_r))]
      rain
21   139.6
34   108.2
37   121.4
46   132.6
102  130.5
153  155.5
190  118.8
197  110.1
212  109.4
225  122.1
248  131.4
274  151.2
281  127.7
320  140.0
322  142.8
332  111.2
355  150.3
371  108.6
417  109.1
477  174.8
480  106.9
503  119.3
549  123.0
564  113.7
619  124.8
633  155.4
645  108.2
658  151.0
659  119.0
670  113.8
679  107.8
714  115.2
742  148.0
773  110.8
792  162.4
802  128.4
811  116.8