Cleaning datasets with missing values

Dataset: Past record of Air Quality Health Index (English) Jul 1999 Hong Kong

In [1]:
import pandas as pd
%pylab inline

fi = pd.read_csv("hr071999.csv", skiprows=8)  # First 8 rows contains notes and comments

fi.head(30)
Populating the interactive namespace from numpy and matplotlib
Out[1]:
Date Hour Causeway Bay Central Mong Kok Central/Western Eastern Kwai Chung Kwun Tong Sha Tin Sham Shui Po Tai Po Tap Mun Tsuen Wan Tung Chung Yuen Long
0 01/07/1999 0 67 44 31 14 13 29 32 19 21 21 10 21 14 20
1 NaN 1 67 43 31 14 13 29 32 19 21 21 9 21 15 20
2 NaN 2 67 43 31 14 13 30 32 19 21 21 9 21 13 20
3 NaN 3 67 42 31 14 13 30 32 19 21 20 9 21 11 20
4 NaN 4 67 43 31 14 12 29 32 19 20 20 9 21 11 20
5 NaN 5 67 44 31 14 12 29 32 19 20 20 9 21 12 20
6 NaN 6 68 44 31 14 12 30 32 20 20 21 9 22 11 20
7 NaN 7 67 43 30 13 12 29 31 19 20 22 9 21 13 20
8 NaN 8 67 41 30 13 12 27 31 19 19 21 9 21 13 20
9 NaN 9 67 38 30 12 12 26 30 20 19 20 10 20 13 20
10 NaN 10 67 36 30 12 12 25 30 19 18 21 12 20 15 20
11 NaN 11 67 35 29 12 12 25 30 19 17 21 15 20 13 19
12 NaN 12 67 35 29 12 12 24 30 18 17 21 13 20 13 20
13 NaN 13 66 35 28 11 13 24 29 18 17 22 17 19 13 19
14 NaN 14 66 36 28 12 13 23 29 16 16 21 19 20 18 18
15 NaN 15 67 37 27 12 13 22 28 16 15 22 16 20 18 18
16 NaN 16 67 38 27 12 14 22 29 16 15 22 19 20 17 18
17 NaN 17 66 37 27 14 14 21 29 16 14 22 16 19 15 18
18 NaN 18 66 36 27 12 14 21 28 16 14 22 16 19 12 18
19 NaN 19 66 36 27 12 14 21 28 16 14 22 15 19 12 18
20 NaN 20 66 36 27 12 14 21 28 16 14 23 12 19 10 18
21 NaN 21 66 36 27 12 14 21 28 16 14 23 10 19 9 18
22 NaN 22 66 36 27 12 14 21 28 17 14 23 11 18 11 18
23 NaN 23 66 35 27 12 14 20 28 17 14 23 10 19 11 18
24 02/07/1999 0 65 35 27 12 14 20 27 17 14 23 10 19 10 18
25 NaN 1 65 35 27 12 14 20 27 17 13 23 10 19 10 19
26 NaN 2 65 35 27 13 14 19 27 16 13 23 10 19 10 19
27 NaN 3 65 36 NaN 12 14 19 27 16 13 23 10 19 10 18
28 NaN 4 65 36 27 12 14 19 27 16 13 23 10 19 11 18
29 NaN 5 65 35 27 12 15 19 27 16 14 23 10 19 10 19
In [ ]:
 

There are a lot of NULL values in the Date column. Upon closer inspection, it is obvious that rows with missing Date value simply implies it's the same as above. Using the filldown method for replacing NaN values fixes this.

In [2]:
fi["Date"] = fi["Date"].fillna(method='pad')
fi.head(10)
Out[2]:
Date Hour Causeway Bay Central Mong Kok Central/Western Eastern Kwai Chung Kwun Tong Sha Tin Sham Shui Po Tai Po Tap Mun Tsuen Wan Tung Chung Yuen Long
0 01/07/1999 0 67 44 31 14 13 29 32 19 21 21 10 21 14 20
1 01/07/1999 1 67 43 31 14 13 29 32 19 21 21 9 21 15 20
2 01/07/1999 2 67 43 31 14 13 30 32 19 21 21 9 21 13 20
3 01/07/1999 3 67 42 31 14 13 30 32 19 21 20 9 21 11 20
4 01/07/1999 4 67 43 31 14 12 29 32 19 20 20 9 21 11 20
5 01/07/1999 5 67 44 31 14 12 29 32 19 20 20 9 21 12 20
6 01/07/1999 6 68 44 31 14 12 30 32 20 20 21 9 22 11 20
7 01/07/1999 7 67 43 30 13 12 29 31 19 20 22 9 21 13 20
8 01/07/1999 8 67 41 30 13 12 27 31 19 19 21 9 21 13 20
9 01/07/1999 9 67 38 30 12 12 26 30 20 19 20 10 20 13 20

It is not instantly obvious if any other columns have NULL values. A simple check shows that there are quite a lot of rows with at least one NULL value.

In [3]:
len(fi[fi.isnull().any(axis=1)].index)
Out[3]:
553

Plotting one of the columns reveals that there might potentially be a lot of missing data.

In [12]:
fi["Tai Po"].plot()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x101d959d0>

Clearly data are missing (ie. they should be there but aren't). One way to fix this is to fill these missing entries with the mean value of that column.

In [4]:
fim = fi.copy()
fim = fim.fillna(fim.mean()['Causeway Bay':])
fim["Tai Po"].plot()
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x108d20a50>
In [5]:
fim["Tai Po"].mean()
Out[5]:
31.208261617900085

Another way to fix this is by interpolating.

In [6]:
fii = fi.copy()
fii = fii.interpolate()
fii["Tai Po"].plot()
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1063a6190>
In [7]:
fii["Tai Po"].mean()
Out[7]:
30.816958277254376
In [8]:
fiic = fi.copy()
fiic = fiic.interpolate(method='cubic')
fiic["Tai Po"].plot()
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a933bd0>
In [9]:
fiic["Tai Po"].mean()
Out[9]:
30.72415120179377
In [10]:
fiip = fi.copy()
fiip = fiip.interpolate(method='pchip')
fiip["Tai Po"].plot()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1098d7750>
In [11]:
fiip["Tai Po"].mean()
Out[11]:
30.81459036242166

Round all float values to the nearest integer.

In [29]:
for col in fiip:
    if fiip[col].dtype == np.float64:
        fiip[col] = fiip[col].round()
fiip["Tai Po"].plot()
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c9ba390>