import pandas as pd
fi = pd.read_csv("heathrowdata.txt", delim_whitespace=True)
fi.head(10)
Specify which column to operate on and generate DataFrame that contains only that column along with Date and Hour. Define constant r which specifies how many units of dispersion away from the center would be considered "too far away" and the value be considered as an outlier.
col = "rain"
df = fi[[col]]
r = 2
The most common metric for the "acceptable range" is "2 standard deviations from the mean".
mean = df[col].mean()
mean
std_r = r * df[col].std()
std_r
print df[(df[col] >= (mean + std_r)) | (df[col] <= (mean - std_r))]
Both the mean and standard deviation (which uses the mean) are not considered to be robust centers/dispersions as they are prone to the "masking" effect. Using robust centers/dispersions may be more appropriate for data with extreme values.
One such robust center is the trimmed mean. A k% trimmed mean discards the lowest and higest k% of the values.
k = 0.05 # 5% trimmed mean
rec_len = len(df[col])
start_ix = int(rec_len * k)
fin_ix = rec_len - start_ix
dfs = df.sort_values(col)
trimmed_mean = dfs[col][start_ix:fin_ix].mean()
trimmed_mean
A natural dispersion metric for the trimmed mean is the trimmed standard deviation, which is the standard deviation of the trimmed data.
trimmed_std_r = r * dfs[col][start_ix:fin_ix].std()
trimmed_std_r
print df[(df[col] >= (trimmed_mean + trimmed_std_r)) | (df[col] <= (trimmed_mean - trimmed_std_r))]
One other robust center is the median.
median = df[col].median()
median
The Median Absolute Deviation (MAD) can be used to work along with the median. The MAD is the median of the absolute deviations from the data's median.
dfm = df.copy()
dfm[col] = dfm.apply(lambda x: abs(x - median))
mad_r = r * 1.4826 * dfm[col].median() # MAD is approx. 1.4826 * s.d. (Hampel X84)
mad_r
print df[(df[col] >= (median + mad_r)) | (df[col] <= (median - mad_r))]