Research

Related project review

One of the last year projects is very similar to our project. They used text and give out prediction number which is essentially what we r trying to do but in a different context. But the context of the problem is different that's why we cant take inspiration from the projects. The project name - Review Analysis for Ocado.[1]


Predicting citation prediction has been a trending topic as all the publication firms face this problem. A lot of researchers have written research papers in order to find the quickest way possible and most efficient way. One of the articles we used was called "Citation Count Prediction: Learning to Estimate Future Citations for Literature" [2] This article gave us an initial start and helped us narrow down the factors we should use to train our model from our data set. The research paper also recommended some machine learning models which we tried for our dataset and decided to add 3 of those models as they gave the best peformance for our dataset.

OUR SOLUTION

After a lot of research, we decide the start to structure our solution.
The programming language for backend was python as it was one of the requirements stated by the client. All the team members were also comfortable in working with python.

We decided on first cleaning out data set and get rid of any extra columns in the dataset as well as looking out of null values in the columns. This will help us to only look at the important ones. Inorder to clean up the data we used a few python libraries - numpy, pandas and matplotlib. We decided to use pandas and numpy as the two libraries allow to work with multi-dimensional arrays and dataframe. We put the dataset in these dataframes to clean it up. We also used matplotlib to draw graph in order to see what the columns looked like.

We used allen AI spectre vectores to transform our input data so that we could pass it through our models for prediction. It appiles NLP on the data and returns in a desirable vector form which is then used by the prediction models.

There are a lot machine learning libraries in python for example - tensorflow, scikit-learn ,pytorch. We chose to work with scikit-learn because it is one of the most rohbust libraries, giving us a lot of options ( for models ) to choose from. It also takes in vectors and given a number prediction. Moreover, it has a good selection of tools to check the efficienct of a model for example - regression , classification.
There was a number of models we could choose from but we chose 3 from the library as it giving was an acceptable score for R2 (Regression score model). We chose linear regression, random forest and Xg boost. These models solve the problem but with a different approach and accuracy level. The way we decided which model to choose from the list was by looking at the R2, the higher the R2 value, better the model. We tried a few models but we only got acceptable scores from these models with our data set (small data set).

We used the Flaskr library in python for our frontend to creat the web app. We chose to do it in python as it becomes easier to connect the frontend and backend if they are coded in the same language.

Technical Decisions

Technlogy Decisions
Frontend Flaskr in Python
Backend Python
Text transformation AllenAI SPECTER model
Prediction Model scikit-learn
Data Sanitisation Numpy, Pandas, Matplotlib (python libraries)

References

[1] - http://students.cs.ucl.ac.uk/2018/group24/index.html
[2] - http://keg.cs.tsinghua.edu.cn/jietang/publications/CIKM11-Yan-Citation-Count-Prediction.pdf