The first step we took when starting this project was communicating with our clients, we asked as many questions of them as possible to get a sense of the project. This initial communication was marred by a few misunderstandings, most notably confusion between our teams and the Master project teams also working for the ANSSC, however we had the confusion resolved in good time and created a good understanding of our requirements. We also utilised the advice and guidance of Dean Mohamedally and Joseph Greener, who helped us understand the big picture of our project and begin to see a way forward.
In terms of data collection via forms, we looked at platforms such as Google forms and Microsoft forms, as well as other tools such as Excel. Different data collection methods suit different needs, so for more general data, we decided that Microsoft forms would be the most appropriate. For more complex financial information, a form in an excel spreadsheet provides the flexibility we need whilst still providing ample familiarity to users.
Our initial research focused heavily on few key aspects. First was how we were going to implement the PDF extractor tool. Following the advice of Joseph Greener, we looked at the LDA and t-SNE algorithms. This lead us to a project written by a previous UCL student, David Rudolph, which could analyse an uploaded PDF. The project consisted of a Python program and it's corresponding dissertation. A screenshot of the python program is shown below, having uploaded one of our PDF files.
Eventually we decided to move away from these algorithms, believing them to be overkill for our situation. We instead, with some assistance from Microsoft, moved to looking at the Azure cognitive services as a method for analysing these pdf files.
Yansong in particular read a lot of documents relating to using machine learning to analyse and extract text including:
After moving away from LDA and t-SNE, we began looking at services that could perform the tasks we needed, after looking at tools such as Google's OCR service, we came across Azure's Cognitive Services. This service can be queried via an API and seems to contain many of the features we are looking for, such as the Ink Recogniser and Text Analytics tools.