Since our clients were unable to provide us with sample data for Open Banking, we proceeded to
manually generate a Microsoft Excel file that had 4 months worth of logical and varied financial
transactions for 3 current accounts and 3 credit card accounts. I proceeded to analyse and
understand the documentation and dataset examples provided on the Open Banking page to manually
generate a sample dataset with 1 transaction that would be used as a format. The sample dataset I
generated ensured that the data provided would have its use in the web app.
Since each account had a large number of transactions and manually editing and generating a complete
dataset would be cumbersome, I proceeded to write a Python script to save time and effort and
improve the efficiency of obtaining a complete dataset for the others to work with. The Python
script reads the table from a particular named sheet and stores it in a pandas dataframe. It will
then read the format generated and copy it into the desired JSON output file. Before outputing the
file, it will change specific fields and populate the transactions based on the number of
transactions stored in the dataframe. For each transaction, a deep copy of each transaction is
appended to a list which would later be a list of transactions. I will then replace the transaction
in the sample dataset with our list of transactions and produce a JSON file for that account.
The dataset that is generated would presumably be what we would have if we were licensed to use the
Open banking API. This allows us to work under the impression that we have access to the Open
Banking API.
We were tasked with producing a tool that can take the user’s transactional data and categorise it.
Raghib took the initiative.
From some initial research, we tried to use machine learning (ML) and Natural Language Processing (NLP).
We started with our own code on Python using TensorFlow. Later, AzureML was found to be one of the easiest to use, which helped us to
build a very basic, category-predicting API which could take a string of text (transactional data)
and return a number from 1-12 (which each represented a category). This model however has very low
accuracy, and the reason for that was because we had close to no training data. The data that we used
to train the model had less than 400 sets of handwritten data which is vastly insufficient
to build an ML model from. Another issue was that a lot of transactional data is simply the name of
an organisation which does not always have (useful) hints to category of its merchandise- the keyword
“McDonald’s” does not say anything about food, making it impossible to predict if this is a purchase
at a fast food restaurant or just someone’s name. To solve this, one would have to teach the
algorithm the name of every brand which is inefficient to say the least.
Our next thought was to use pre-existing data analytics tools such as wikidata
and wikifier. The plan
was to use the transaction reference to get a Wikipedia page of the merchant associated with the
purchase, and then use this page to extract some sort of category. The resources and help available
for these API’s, however, were minimal and we ran into a problem with the wikidata API for Python
immediately.
During this acquisition however, we realised that the “Merchant Category Code
(MCC)” of each transaction provides essential information, and MCC is available on the data files generated
from the Open Banking API (Michael, 2018). So, we went through the arduous task of manually
categorising every MCC in a CSV file and creating a Python lookup (using JSON and Python
dictionaries). With this method, the Django web app gets the MCC from the data, and then this MCC is
mapped to a number (representing a category) using some simple Python code.
Based on users’ spending history, we hope to help users predict their account balance from now to the
next statement day. This is the day users usually get their statement from the bank. During this
period of time, we will warn the user if we predict that they will go into overdraft. Users can then
adjust their spending accordingly to avoid the overdraft fee. The current algorithm takes users’
spending history from last month. It will then exclude all the fixed costs, which are direct debits
such as rent and electricity bills. The spendings left, such as groceries and transport, are more
flexible and the algorithm will assume users have the same daily spending for this month.
The algorithm then goes through all the direct debits on the account. Open banking api provides
information about direct debits on its status and previous payment date. We assume the next payment
happens in one month. The algo will then check if the direct debit is active and if the next payment
date is in between the prediction period. If both conditions are true, it will lead to a decrease in
the predicted balance.
Similarly, the algo looks into the salary in the past three months. If the monthly salary comes in
during the rest of this month, it will be reflected as a sharp increase in the predicted balance.
Combining all these factors, this algorithm outputs a dictionary with the date as keys and predicted
remaining balance as values.
We have also looked into Prophet, a prediction tool developed by Facebook. It forecasts time series
data. However, our data is generated. It does not follow typical users’ spending pattern and is not
big enough. The result generated is not accurate so we did not choose to use it.
Most credit cards are random spending and thus makes it less meaningful to predict based on the
spending history. Our algorithm takes factors such as the promotion period of a credit card and
maximum purchase interest free length days. During the promotion period, users do not need to pay
interest. Maximum purchase interest free length means that users do not have to pay interest for
this period of time after they make a purchase with the credit card.
If the user has money to pay off the credit card bill, they do not have to pay the part that does
not cause an interest fee. Instead, they could save the money to earn some interest etc. to make the
best of their money. Our algorithm calculates the amount that users should pay to avoid the high
interest fee. That is the sum of all the transactions happened before maximum purchase interest free
length days ago and all the interest fees caused.