import pandas as pd
import numpy as np
fi = pd.read_csv("hr071999.csv", skiprows=8) # First 8 rows contains notes and comments
fi["Date"] = fi["Date"].fillna(method='pad') # Fill down dates
fi = fi.interpolate(method='pchip') # Interpolation for missing values
for col in fi: # Round values to nearest integer
if fi[col].dtype == np.float64:
fi[col] = fi[col].round()
fi.head(10)
Specify columns to normalise/standardise.
cols = list(fi.loc[:,'Causeway Bay':])
Normalisation rescales the values to a specified range, usually between 0 and 1. Otherwise known as feature scaling.
Define normalise function, which takes a DataFrame, applies normalisation to each column of the DataFrame, and returns the DataFrame. An optional parameter range, a tuple, defines the restriction on the range after normalisation, default is (0,1).
def normalise(df, range=(0,1)):
df = range[0] + ((df - df.min()) * (range[1] - range[0])) / (df.max() - df.min())
return df
Apply normalisation with specified range of 0 to 10.
fi_norm = fi.copy()
fi_norm[cols] = normalise(fi_norm[cols], range=(0,10))
fi_norm.head(10)
Standardisation rescales the data so that it has a mean of 0 and a standard deviation of 1. Otherwise known as Z-score scaling.
Define standardise function, which takes a DataFrame, applies standardisation to each column of the DataFrame, and returns the DataFrame.
def standardise(df):
df = (df - df.mean()) / df.std()
return df
Apply standardisation.
fi_stds = fi.copy()
fi_stds[cols] = standardise(fi_stds[cols])
fi_stds.head(10)