Group 12 - British Library Big Data Project Part 2

Machine Learning

We initially tried to find whether there already existed any libraries or APIs that offered the functionality that we were trying to make, and we found something called Camfind. Camfind is an app and API that allows you to give it an image and it will return a simple textual description of the image.

This seemed good as it is essentially what we are attempting to achieve with part of our project. However, most likely because it was designed to be an app on smartphones, the type of images that it works best with is not necessarily what we need. For certain classes of image it worked very well, mostly photographs, whereas for others the text may not be very suitable. Another issue is that when Camfind does return an incorrect result it is not clear how confident it was of that classification. We can assume that it works by choosing and combining the tags that it has the most confidence are correct. However, when the returned tags are incorrect, it would be helpful to know which tags were discarded since some of them may be correct and hence helpful in terms of searching. The API is too closed off for this though and all you can get back from it is the text description and nothing else. Also, there is also no way for us to use our own tags, and hence no way to leverage the existing Flickr user tags.

Along with the fact that it is a paid service for high volume queries, this made it difficult to use for our purposes. There is still the possibility of integrating it into our system without making it the key component, however, since if we can filter out the photographs and paintings then Camfind may be able to give a good description of those images.

We researched a number of Neural Networking libraries including Caffe and Neuroph. These are commonly used in image classification currently. Though they are very powerful and most likely capable of doing what our system requires, they are difficult to set up correctly without experience in the field. Depending on our success with less complicated machine learning techniques, it may be worth returning to this later on in the project. We also investigated Azure’s own machine learning suite which was very straightforward to use in general but hard to configure for our purposes.

We looked into using the Weka software which seems to have the functionality required for at least part of our project. Weka is a collection of machine learning algorithms that also has a pre-processing facility to aid in importing data and modelling techniques to analyse datasets. It also has implementations that can straightforwardly be used with the UI and hence without writing any custom code which means it would be comparatively quick to try them out and evaluate against each other and against other systems. It is not clear quite yet how easy it will be to make Weka part of our pipeline. As it has an API that can be used with Node.js then it will hopefully be possible to automate our system from start to finish

User interface

There are two aspects to our user interface: the first is an interface for searching for images; the second is a way to display the output of the search to the user. Both of these types of functionality have been done many times before and can give us a good basis for how our system should behave.

The obvious comparison when it comes to searching is Google since it is the de facto Internet search engine. However, it encourages users to input natural language queries jumbled queries where the significance of each part is not clear. What these have in common is that they put the obligation on Google to correctly interpret the query, based on the assumption that the user is unwilling to spend the time or effort being more specific. For us, however, it is really out of the scope of our project to give too much focus to interpreting the search queries. Also, we expect our users to be researchers who are happy to put in more time in order to give a more precise search query.

Although it isn’t immediately obvious on their page, it turns out that Google does offer an ‘Advanced Search’ option where users can be far more specific in their search terms. For instance, specifying words that must be included, words that must not be included, and words that may or may not be included. There are also ways to filter and narrow down your results. This seems to be exactly how we want our search to behave and so we will take strong cues from this.

For the other aspect, displaying the results, we found a couple of options. One of these is to go down the same route as Google and Flickr, by presenting the most appropriate results in a linear fashion. This is very useful for the case where the user knows exactly what they want, and they are only dealing with the ranking system to try and find it. However, in the case where the user has more of a vague idea, looking through a long list of possibly irrelevant results is tedious and potentially useless.

There are systems such as Facebook’s Graph Search that focus more on showing the links between search results - the nodes in the graph. By having this kind of functionality it would allow people to more easily navigate around the dataset without such a specific starting point.

Ultimately, some combination of these approaches would probably be best as we can’t predict how our users might want to find what they are looking for and it is possible they’ll want to use both together. The route that we have chosen to go down is to allow users to see similar images to ones that they have already found, rather than showing them the clusters of the entire search space. We think this is more intuitive, since as a user it is clearer that you are defining a ‘direction’ for your search by investigating the first, closest match.

British Library Dataset

Although the British Library dataset is very large - over 1 million images – we thought it would be a good idea to manually browse through a subset of the images to try and get an appreciation of the classes of images we are dealing with. This is obviously only a small fraction, but nevertheless presented some interesting findings. For example, there are a large number of images that are in fact embellished capital letters from the beginning of chapters. This type of image may be possible to detect, but could be very difficult to separate by the actual letter it contains due to the letters appearing very different in shape to a standard letter (though close enough to be detected by a human. An example can be seen below:

As a human, you could unconsciously recognise the shape of the tree and the letter ‘S’ and also notice that this would be a good fit for the first word sur (provided you speak French). However, this is not obvious for a computer, especially as to begin with it is not even clear which parts of the tree image make up the letter and which are just decoration. Another similar but slightly different example is graphs and diagrams:

This can clearly be distinguished as a diagram but even as a human, it is virtually impossible to say exactly what it is without further context. To classify this, you would have to look it up in its original book or find the corresponding scanned text.

For both of these examples though, we will hopefully be able to separate them into their own classes (initials and diagrams) such that any more specific tagging is done manually. Although this wouldn’t enable fully comprehensive tagging of the images, it would allow for much more efficient use of time for a user tagging.

Another thing we observed is that there are not a lot of images that were in full colour – in their original form (the images themselves contain colour but were originally black ink printed on a plain page). This is important to know as when we are trying to classify our images we need a way to describe the image, or compress it into a set of ‘features’ that concisely tell us information about image. Hence, we must be careful to avoid anything that is based on the colour of the images since it won’t be helpful. In fact, due to the different colours of paper that the books were printed on, as well as any other discoloration due to age or other factors, near identical images could appear to have quite different colours.

Research

Machine Learning

User interface

British Library Dataset