
Posted on December 30, 2018

Existing Solutions

Our task consists of predicting the rating (from 1 star to 5 stars) and estimating the helpfulness percentage (any number between 0 and 100) from a comment text. First one is considered to be a classification problem, where you need to take an object and classify it into several different categories. The second one is more likely a regression problem, where you need to predict a continuous value. However, the values must be between 0% to 100%. Besides, most of the reviews don’t have many votes, values such as 0, 1, 1/2, 1/3 etc appear a lot, thus it’s possible to treat this as a classification problem as well. Since training data (which contains text, rating and helpfulness) will be used, both tasks would be supervised learning. Thus, our project would focus on supervised machine learning on classification and possibly regression, and of course natural language processing. [1]

Since the above questions are very typical in machine learning, a lot of research has been done on them and many great solutions exist.

Natural Language Processing (NLP)

Bag of Words

In terms of natural language processing, the most straightforward way is “bag of words”, which simply assigns each unique word with an index, and converts a piece of text into a vector that represents the occurrence of each word. [2] This is the best way for beginners to get started, but is also very limited in performance due to the complexity of human language. For instance, when we say “good”, “helpful” or “nice” we usually mean it’s good, but if we say “not good”, “can’t say it’s helpful”, the meaning of the sentence would be reversed. More advanced techniques are needed to deal with this kind of situation.


A better solution is n-grams. Instead of splitting text into individual words, it splits them into lists of n consecutive words -- so “not good” will not become “not” “good” and potentially confuse the algorithm. However, if we just use all the possible combinations of words, it will greatly slow down the training algorithm, and most of the result wouldn’t be useful (like “would not” or “not say”). So an important part of n-gram is to predict the occurrence of the next word and use the result to decide if the actual next word should be included in an n-gram sentence or not. [3]

Term Frequency-Inverse Document Frequency (TF-IDF)

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency. In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms. In order to re-weight the count features into floating point values suitable for usage by a classifier, it is very common to use the tf–idf transform. [20]

Word Embedding

Word Embedding is a technique that is usually applied in Neural Networks. Briefly speaking, it converts a single word into a vector to allow additional information to be presented. In the area of NLP, the information is usually how close is this word to other words in terms of their meanings. E.g., “great” and “good” will have a similar vector representation.[15] This can help Machine Learning models to “understand” the human language better and produce more accurate predictions.

Machine Learning

NLP solutions are used to convert text into numeric representations that can be used by another code. It is the machine learning algorithm that handles the actual training and application of the model.

Linear Regression

Linear Regression is one of the most basic Machine Learning models. It works by assuming a linear relationship between input and output [4] and trying to use a cost function and gradient descent to minimize the overall cost (i.e., the outputs of the model have smallest overall difference from the expected values) [5][7]

Logistic Regression

Logistic Regression is very similar to Linear Regression. In fact, their only difference is that the output of Logistic Regression is limited within range 0 to 1 by applying a logistic function.[14] Moreover, the cost function and gradient descent are also adapted to suit the new functions, but they still work together to find out the optimal parameters that can reduce the cost on train samples to a minimal.[6]

Decision Tree and Random Forest

Decision Tree is the most basic non-linear model. As its name suggests, it creates a binary tree as the estimator.

The way it works is very simple: it starts at the root node with all samples at hand, then it scans through all features and finds out one that can separate most of the samples out. Then it splits the data set using this feature and recursively repeats this process. [16]

However, a single Decision Tree on its own can easily overfit, since it doesn’t know when to stop growing. Random Forest solves this issue by having multiple Decision Trees and combining their results together to make predictions. Since there are many trees which can restrict each other from overfitting, Random Forest is known as a very easy-to-use model that doesn’t need many hyperparameter tuning. [17]

Gradient Boosting Machine

Gradient Boosting Machine is a very powerful but computational expensive model. It improves the parameters of a loss function by training many “weak learners” and using them to reduce the overall loss. However, since weak learners are extremely weak, a huge amount of them is required for the model to have a good performance. Moreover, it also needs to be carefully tuned to prevent overfitting. [18]

Feedforward Neural Network

Neural Network is a very effective but complex model. It mimics the human brain and works by using many layers that contain neurons. Each neuron can activate (i.e., sending values to neurons in the next layer that connected to current neuron) by receiving values from neurons in the previous layer. Values are also controlled by weights, which can be different for every pair of neurons. Weights are the same as parameters in other models and need to be optimized during training.[19]

Related Technologies

Machine learning and NLP are very popular areas, there is a lot of sophisticated software.

MATLAB is a programming language that heavily focuses on maths. It is matrix-based and can express maths formulas in a very natural way. It is often used to analyze data and develop algorithms. [8] It is also possible to use MATLAB with other programming languages (e.g. Python). MATLAB is a very good tool to learn maths behind machine learning, however, it also has some obvious drawbacks. Firstly, it is not free, thus it can cause unnecessary charges to Ocado if they need to modify our program in the future. Secondly, it is purely a tool for maths and research, and it doesn’t produce any deployable applications. [9]

Octave is a language similar to MATLAB. It’s usually used for solving mathematical problems and automated data processing. One advantage it has over MATLAB is that it’s free software, so no budget is needed for using Ocative. [10] But it seems to have worse GUI, and it is still not good for production. [11]

Python is also a great tool for machine learning. Thanks to its open community, there are many reliable libraries to support both Machine Learning and NLP. Besides, it is a popular language and is very easy to learn. Unlike the other two languages, python can build an application easily. However, it is considered to have a slow computational speed, and it can generate a lot of runtime errors as it’s a dynamic programming language. [8]

“Final” Decisions

Since our project is heavily research-based, we’re meant to constantly experiment with different solutions and evaluate their performance. Thus our “final decisions” section would be quite different from an ordinary project -- we would choose a certain programming language and stick with it, but we won’t choose any particular methods of Machine Learning or NLP as our final decision.

In terms of programming language, Python has been chosen for this project. Main reasons are: firstly, all the team members have done Python last academic year, so we can get started more quickly with Python. Since our project was already delayed for 2 months due to some contract issue, being able to start quickly was very important for us. Secondly, our clients are more experienced with Python Machine Learning and NLP, so we can seek help from them if we encountered any issues when learning. Thirdly, Python is free, open-source software, and it has a bunch of useful libraries and online tutorials.

As for the libraries, we decided to use NumPy, SciPy, Pandas for data processing, TensorFlow for Neural Network and Scikit-learn, which is a very popular machine learning and NLP library. Scikit-learn contains many useful algorithms for “bag of words”, n-gram, linear regression, logistic regression and so on, and it also has a very well written documentation. [12] Thus we believe it will be very helpful when learning and experimenting with different solutions.

About the actual methods that we would use to implement the algorithm, as mentioned above, they are not fixed and can vary as we progress. But for our first step, we decided to implement “bag of words” from scratch to gain a better understanding of how everything works behind the scene, and we would use logistic regression to train a model that predicts star ratings, which is a classification problem. Then we would go into some non-linear algorithms like Random Forest. Eventually, if we have enough time we can try some advanced models like Neural Network. We would follow roughly the same procedure for helpfulness problem.

Since our project is research-based, evaluating our machine learning model would be an essential part. In order to demonstrate the algorithm better, we decided to use Jupyter Notebook, which is a useful tool for live code demo and visualisation, to write the evaluation report. [13]


