import pandas as pd
%pylab inline
fi = pd.read_csv("hr071999.csv", skiprows=8) # First 8 rows contains notes and comments
fi.head(30)
There are a lot of NULL values in the Date column. Upon closer inspection, it is obvious that rows with missing Date value simply implies it's the same as above. Using the filldown method for replacing NaN values fixes this.
fi["Date"] = fi["Date"].fillna(method='pad')
fi.head(10)
It is not instantly obvious if any other columns have NULL values. A simple check shows that there are quite a lot of rows with at least one NULL value.
len(fi[fi.isnull().any(axis=1)].index)
Plotting one of the columns reveals that there might potentially be a lot of missing data.
fi["Tai Po"].plot()
Clearly data are missing (ie. they should be there but aren't). One way to fix this is to fill these missing entries with the mean value of that column.
fim = fi.copy()
fim = fim.fillna(fim.mean()['Causeway Bay':])
fim["Tai Po"].plot()
fim["Tai Po"].mean()
Another way to fix this is by interpolating.
fii = fi.copy()
fii = fii.interpolate()
fii["Tai Po"].plot()
fii["Tai Po"].mean()
fiic = fi.copy()
fiic = fiic.interpolate(method='cubic')
fiic["Tai Po"].plot()
fiic["Tai Po"].mean()
fiip = fi.copy()
fiip = fiip.interpolate(method='pchip')
fiip["Tai Po"].plot()
fiip["Tai Po"].mean()
Round all float values to the nearest integer.
for col in fiip:
if fiip[col].dtype == np.float64:
fiip[col] = fiip[col].round()
fiip["Tai Po"].plot()