Data Generation

Since our clients were unable to provide us with sample data for Open Banking, we proceeded to manually generate a Microsoft Excel file that had 4 months worth of logical and varied financial transactions for 3 current accounts and 3 credit card accounts. I proceeded to analyse and understand the documentation and dataset examples provided on the Open Banking page to manually generate a sample dataset with 1 transaction that would be used as a format. The sample dataset I generated ensured that the data provided would have its use in the web app.

Since each account had a large number of transactions and manually editing and generating a complete dataset would be cumbersome, I proceeded to write a Python script to save time and effort and improve the efficiency of obtaining a complete dataset for the others to work with. The Python script reads the table from a particular named sheet and stores it in a pandas dataframe. It will then read the format generated and copy it into the desired JSON output file. Before outputing the file, it will change specific fields and populate the transactions based on the number of transactions stored in the dataframe. For each transaction, a deep copy of each transaction is appended to a list which would later be a list of transactions. I will then replace the transaction in the sample dataset with our list of transactions and produce a JSON file for that account.

The dataset that is generated would presumably be what we would have if we were licensed to use the Open banking API. This allows us to work under the impression that we have access to the Open Banking API.

Data Categorisation

We were tasked with producing a tool that can take the user’s transactional data and categorise it. Raghib took the initiative. From some initial research, we tried to use machine learning (ML) and Natural Language Processing (NLP).

We started with our own code on Python using TensorFlow. Later, AzureML was found to be one of the easiest to use, which helped us to build a very basic, category-predicting API which could take a string of text (transactional data) and return a number from 1-12 (which each represented a category). This model however has very low accuracy, and the reason for that was because we had close to no training data. The data that we used to train the model had less than 400 sets of handwritten data which is vastly insufficient to build an ML model from. Another issue was that a lot of transactional data is simply the name of an organisation which does not always have (useful) hints to category of its merchandise- the keyword “McDonald’s” does not say anything about food, making it impossible to predict if this is a purchase at a fast food restaurant or just someone’s name. To solve this, one would have to teach the algorithm the name of every brand which is inefficient to say the least.

Our next thought was to use pre-existing data analytics tools such as wikidata and wikifier. The plan was to use the transaction reference to get a Wikipedia page of the merchant associated with the purchase, and then use this page to extract some sort of category. The resources and help available for these API’s, however, were minimal and we ran into a problem with the wikidata API for Python immediately.

During this acquisition however, we realised that the “Merchant Category Code (MCC)” of each transaction provides essential information, and MCC is available on the data files generated from the Open Banking API (Michael, 2018). So, we went through the arduous task of manually categorising every MCC in a CSV file and creating a Python lookup (using JSON and Python dictionaries). With this method, the Django web app gets the MCC from the data, and then this MCC is mapped to a number (representing a category) using some simple Python code.

Current Account Balance Prediction

Based on users’ spending history, we hope to help users predict their account balance from now to the next statement day. This is the day users usually get their statement from the bank. During this period of time, we will warn the user if we predict that they will go into overdraft. Users can then adjust their spending accordingly to avoid the overdraft fee. The current algorithm takes users’ spending history from last month. It will then exclude all the fixed costs, which are direct debits such as rent and electricity bills. The spendings left, such as groceries and transport, are more flexible and the algorithm will assume users have the same daily spending for this month. The algorithm then goes through all the direct debits on the account. Open banking api provides information about direct debits on its status and previous payment date. We assume the next payment happens in one month. The algo will then check if the direct debit is active and if the next payment date is in between the prediction period. If both conditions are true, it will lead to a decrease in the predicted balance.

Similarly, the algo looks into the salary in the past three months. If the monthly salary comes in during the rest of this month, it will be reflected as a sharp increase in the predicted balance. Combining all these factors, this algorithm outputs a dictionary with the date as keys and predicted remaining balance as values.

We have also looked into Prophet, a prediction tool developed by Facebook. It forecasts time series data. However, our data is generated. It does not follow typical users’ spending pattern and is not big enough. The result generated is not accurate so we did not choose to use it.

Credit Card Prediction

Most credit cards are random spending and thus makes it less meaningful to predict based on the spending history. Our algorithm takes factors such as the promotion period of a credit card and maximum purchase interest free length days. During the promotion period, users do not need to pay interest. Maximum purchase interest free length means that users do not have to pay interest for this period of time after they make a purchase with the credit card.

If the user has money to pay off the credit card bill, they do not have to pay the part that does not cause an interest fee. Instead, they could save the money to earn some interest etc. to make the best of their money. Our algorithm calculates the amount that users should pay to avoid the high interest fee. That is the sum of all the transactions happened before maximum purchase interest free length days ago and all the interest fees caused.