Data Normalisation & Standardisation

Dataset: Past record of Air Quality Health Index (English) Jul 1999 Hong Kong

In [1]:
import pandas as pd
import numpy as np

fi = pd.read_csv("hr071999.csv", skiprows=8)  # First 8 rows contains notes and comments
fi["Date"] = fi["Date"].fillna(method='pad')                              # Fill down dates
fi = fi.interpolate(method='pchip')                                       # Interpolation for missing values
for col in fi:                                                            # Round values to nearest integer
    if fi[col].dtype == np.float64:
        fi[col] = fi[col].round()
fi.head(10)
Out[1]:
Date Hour Causeway Bay Central Mong Kok Central/Western Eastern Kwai Chung Kwun Tong Sha Tin Sham Shui Po Tai Po Tap Mun Tsuen Wan Tung Chung Yuen Long
0 01/07/1999 0 67 44 31 14 13 29 32 19 21 21 10 21 14 20
1 01/07/1999 1 67 43 31 14 13 29 32 19 21 21 9 21 15 20
2 01/07/1999 2 67 43 31 14 13 30 32 19 21 21 9 21 13 20
3 01/07/1999 3 67 42 31 14 13 30 32 19 21 20 9 21 11 20
4 01/07/1999 4 67 43 31 14 12 29 32 19 20 20 9 21 11 20
5 01/07/1999 5 67 44 31 14 12 29 32 19 20 20 9 21 12 20
6 01/07/1999 6 68 44 31 14 12 30 32 20 20 21 9 22 11 20
7 01/07/1999 7 67 43 30 13 12 29 31 19 20 22 9 21 13 20
8 01/07/1999 8 67 41 30 13 12 27 31 19 19 21 9 21 13 20
9 01/07/1999 9 67 38 30 12 12 26 30 20 19 20 10 20 13 20

Specify columns to normalise/standardise.

In [2]:
cols = list(fi.loc[:,'Causeway Bay':])

Normalisation

Normalisation rescales the values to a specified range, usually between 0 and 1. Otherwise known as feature scaling.

Define normalise function, which takes a DataFrame, applies normalisation to each column of the DataFrame, and returns the DataFrame. An optional parameter range, a tuple, defines the restriction on the range after normalisation, default is (0,1).

In [3]:
def normalise(df, range=(0,1)):
    df = range[0] + ((df - df.min()) * (range[1] - range[0])) / (df.max() - df.min())
    return df

Apply normalisation with specified range of 0 to 10.

In [4]:
fi_norm = fi.copy()
fi_norm[cols] = normalise(fi_norm[cols], range=(0,10))
fi_norm.head(10)
Out[4]:
Date Hour Causeway Bay Central Mong Kok Central/Western Eastern Kwai Chung Kwun Tong Sha Tin Sham Shui Po Tai Po Tap Mun Tsuen Wan Tung Chung Yuen Long
0 01/07/1999 0 5.172414 4.285714 0.952381 0.714286 0.357143 1.923077 1.320755 1.132075 1.818182 1.063830 0.31250 0.652174 0.675676 0.714286
1 01/07/1999 1 5.172414 4.047619 0.952381 0.714286 0.357143 1.923077 1.320755 1.132075 1.818182 1.063830 0.15625 0.652174 0.810811 0.714286
2 01/07/1999 2 5.172414 4.047619 0.952381 0.714286 0.357143 2.115385 1.320755 1.132075 1.818182 1.063830 0.15625 0.652174 0.540541 0.714286
3 01/07/1999 3 5.172414 3.809524 0.952381 0.714286 0.357143 2.115385 1.320755 1.132075 1.818182 0.851064 0.15625 0.652174 0.270270 0.714286
4 01/07/1999 4 5.172414 4.047619 0.952381 0.714286 0.000000 1.923077 1.320755 1.132075 1.590909 0.851064 0.15625 0.652174 0.270270 0.714286
5 01/07/1999 5 5.172414 4.285714 0.952381 0.714286 0.000000 1.923077 1.320755 1.132075 1.590909 0.851064 0.15625 0.652174 0.405405 0.714286
6 01/07/1999 6 5.517241 4.285714 0.952381 0.714286 0.000000 2.115385 1.320755 1.320755 1.590909 1.063830 0.15625 0.869565 0.270270 0.714286
7 01/07/1999 7 5.172414 4.047619 0.714286 0.476190 0.000000 1.923077 1.132075 1.132075 1.590909 1.276596 0.15625 0.652174 0.540541 0.714286
8 01/07/1999 8 5.172414 3.571429 0.714286 0.476190 0.000000 1.538462 1.132075 1.132075 1.363636 1.063830 0.15625 0.652174 0.540541 0.714286
9 01/07/1999 9 5.172414 2.857143 0.714286 0.238095 0.000000 1.346154 0.943396 1.320755 1.363636 0.851064 0.31250 0.434783 0.540541 0.714286

Standardisation

Standardisation rescales the data so that it has a mean of 0 and a standard deviation of 1. Otherwise known as Z-score scaling.

Define standardise function, which takes a DataFrame, applies standardisation to each column of the DataFrame, and returns the DataFrame.

In [5]:
def standardise(df):
    df = (df - df.mean()) / df.std()
    return df

Apply standardisation.

In [6]:
fi_stds = fi.copy()
fi_stds[cols] = standardise(fi_stds[cols])
fi_stds.head(10)
Out[6]:
Date Hour Causeway Bay Central Mong Kok Central/Western Eastern Kwai Chung Kwun Tong Sha Tin Sham Shui Po Tai Po Tap Mun Tsuen Wan Tung Chung Yuen Long
0 01/07/1999 0 0.483218 -0.667016 -1.411001 -1.498267 -1.527556 -1.094393 -0.431131 -0.916338 -1.283588 -1.018649 -1.235283 -1.161153 -0.708754 -0.930821
1 01/07/1999 1 0.483218 -0.795775 -1.411001 -1.498267 -1.527556 -1.094393 -0.431131 -0.916338 -1.283588 -1.018649 -1.348860 -1.161153 -0.628011 -0.930821
2 01/07/1999 2 0.483218 -0.795775 -1.411001 -1.498267 -1.527556 -0.998289 -0.431131 -0.916338 -1.283588 -1.018649 -1.348860 -1.161153 -0.789497 -0.930821
3 01/07/1999 3 0.483218 -0.924534 -1.411001 -1.498267 -1.527556 -0.998289 -0.431131 -0.916338 -1.283588 -1.122442 -1.348860 -1.161153 -0.950982 -0.930821
4 01/07/1999 4 0.483218 -0.795775 -1.411001 -1.498267 -1.680869 -1.094393 -0.431131 -0.916338 -1.386302 -1.122442 -1.348860 -1.161153 -0.950982 -0.930821
5 01/07/1999 5 0.483218 -0.667016 -1.411001 -1.498267 -1.680869 -1.094393 -0.431131 -0.916338 -1.386302 -1.122442 -1.348860 -1.161153 -0.870239 -0.930821
6 01/07/1999 6 0.617939 -0.667016 -1.411001 -1.498267 -1.680869 -0.998289 -0.431131 -0.790443 -1.386302 -1.018649 -1.348860 -1.037640 -0.950982 -0.930821
7 01/07/1999 7 0.483218 -0.795775 -1.517923 -1.619651 -1.680869 -1.094393 -0.550836 -0.916338 -1.386302 -0.914857 -1.348860 -1.161153 -0.789497 -0.930821
8 01/07/1999 8 0.483218 -1.053293 -1.517923 -1.619651 -1.680869 -1.286601 -0.550836 -0.916338 -1.489017 -1.018649 -1.348860 -1.161153 -0.789497 -0.930821
9 01/07/1999 9 0.483218 -1.439569 -1.517923 -1.741035 -1.680869 -1.382704 -0.670541 -0.790443 -1.489017 -1.122442 -1.235283 -1.284666 -0.789497 -0.930821