System Architecture

File reader

The file reader is the only part where input is parsed. It extracts the contents of the file given by the user and passes that data to the backend. This is the only component that allows for input from the user. Combined with the display, it forms the main interface with the user.

Text transformation via API

The text transformation is done via a pre-trained model, the AllenAI SPECTER model. The model takes as input a title and an abstract, and outputs a SPECTER vector of 768 dimensions. It is pre-trained on a powerful signal of document-level relatedness: the citation graph. Using this model for NLP (Natural Language Processing) is ideal since it is not feasible to build one from the ground, and it does not require task-specific fine-tuning.

Dataset

The dataset we have originates from Microsoft Academic Graph(MAG). The Microsoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. This graph is used to power experiences in Bing, Cortana, Word, and in Microsoft Academic.

We run all the papers in this dataset through the same text transformation to obtain a set of SPECTER vectors and their corresponding citation count. Having a dataset with vectors and citation count allows us to build a model that can predict the latter.

Prediction model

In our system, there are three separate models (linear regression, random forest, and xgboost). These models all have varying performances and can be chosen by the user.

The models are trained, then stored using the pickle function from Python. When the web application starts, the models are unpacked and loaded. It is not feasible to constantly update the models since that takes a considerable amount of time. In the structure of our implementation, it is easy to update, remove or add models in the source code.

Display

The display, as its name suggests, is responsible for displaying data and options to the user. It is the only output mechanism to the user.