Cleaning datasets with missing values¶

Dataset: Past record of Air Quality Health Index (English) Jul 1999 Hong Kong¶

import pandas as pd
%pylab inline

fi = pd.read_csv("hr071999.csv", skiprows=8)  # First 8 rows contains notes and comments

fi.head(30)

Populating the interactive namespace from numpy and matplotlib

There are a lot of NULL values in the Date column. Upon closer inspection, it is obvious that rows with missing Date value simply implies it's the same as above. Using the filldown method for replacing NaN values fixes this.

fi["Date"] = fi["Date"].fillna(method='pad')
fi.head(10)

It is not instantly obvious if any other columns have NULL values. A simple check shows that there are quite a lot of rows with at least one NULL value.

len(fi[fi.isnull().any(axis=1)].index)

553

Plotting one of the columns reveals that there might potentially be a lot of missing data.

fi["Tai Po"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x101d959d0>

Clearly data are missing (ie. they should be there but aren't). One way to fix this is to fill these missing entries with the mean value of that column.

fim = fi.copy()
fim = fim.fillna(fim.mean()['Causeway Bay':])
fim["Tai Po"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x108d20a50>

fim["Tai Po"].mean()

31.208261617900085

Another way to fix this is by interpolating.

fii = fi.copy()
fii = fii.interpolate()
fii["Tai Po"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1063a6190>

fii["Tai Po"].mean()

30.816958277254376

fiic = fi.copy()
fiic = fiic.interpolate(method='cubic')
fiic["Tai Po"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x10a933bd0>

fiic["Tai Po"].mean()

30.72415120179377

fiip = fi.copy()
fiip = fiip.interpolate(method='pchip')
fiip["Tai Po"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1098d7750>

fiip["Tai Po"].mean()

30.81459036242166

Round all float values to the nearest integer.

for col in fiip:
    if fiip[col].dtype == np.float64:
        fiip[col] = fiip[col].round()
fiip["Tai Po"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x10c9ba390>

	Date	Hour	Causeway Bay	Central	Mong Kok	Central/Western	Eastern	Kwai Chung	Kwun Tong	Sha Tin	Sham Shui Po	Tai Po	Tap Mun	Tsuen Wan	Tung Chung	Yuen Long
0	01/07/1999	0	67	44	31	14	13	29	32	19	21	21	10	21	14	20
1	NaN	1	67	43	31	14	13	29	32	19	21	21	9	21	15	20
2	NaN	2	67	43	31	14	13	30	32	19	21	21	9	21	13	20
3	NaN	3	67	42	31	14	13	30	32	19	21	20	9	21	11	20
4	NaN	4	67	43	31	14	12	29	32	19	20	20	9	21	11	20
5	NaN	5	67	44	31	14	12	29	32	19	20	20	9	21	12	20
6	NaN	6	68	44	31	14	12	30	32	20	20	21	9	22	11	20
7	NaN	7	67	43	30	13	12	29	31	19	20	22	9	21	13	20
8	NaN	8	67	41	30	13	12	27	31	19	19	21	9	21	13	20
9	NaN	9	67	38	30	12	12	26	30	20	19	20	10	20	13	20
10	NaN	10	67	36	30	12	12	25	30	19	18	21	12	20	15	20
11	NaN	11	67	35	29	12	12	25	30	19	17	21	15	20	13	19
12	NaN	12	67	35	29	12	12	24	30	18	17	21	13	20	13	20
13	NaN	13	66	35	28	11	13	24	29	18	17	22	17	19	13	19
14	NaN	14	66	36	28	12	13	23	29	16	16	21	19	20	18	18
15	NaN	15	67	37	27	12	13	22	28	16	15	22	16	20	18	18
16	NaN	16	67	38	27	12	14	22	29	16	15	22	19	20	17	18
17	NaN	17	66	37	27	14	14	21	29	16	14	22	16	19	15	18
18	NaN	18	66	36	27	12	14	21	28	16	14	22	16	19	12	18
19	NaN	19	66	36	27	12	14	21	28	16	14	22	15	19	12	18
20	NaN	20	66	36	27	12	14	21	28	16	14	23	12	19	10	18
21	NaN	21	66	36	27	12	14	21	28	16	14	23	10	19	9	18
22	NaN	22	66	36	27	12	14	21	28	17	14	23	11	18	11	18
23	NaN	23	66	35	27	12	14	20	28	17	14	23	10	19	11	18
24	02/07/1999	0	65	35	27	12	14	20	27	17	14	23	10	19	10	18
25	NaN	1	65	35	27	12	14	20	27	17	13	23	10	19	10	19
26	NaN	2	65	35	27	13	14	19	27	16	13	23	10	19	10	19
27	NaN	3	65	36	NaN	12	14	19	27	16	13	23	10	19	10	18
28	NaN	4	65	36	27	12	14	19	27	16	13	23	10	19	11	18
29	NaN	5	65	35	27	12	15	19	27	16	14	23	10	19	10	19