Group 12 : British Library Machine Learning Experiment
To view the application, visit http://blbigdata.herokuapp.com
Background and Context
Project Brief Given
Taking on the BL Big Data Experiment, this team would be looking to improve the existing engine, facilitating interfaces for Machine Learning processes. It follows up the brief by the MScSSE team, with tasks/features for Azure marketplace such that the sample data and algorithm models has a secure home for future UCL projects. There should be improvements to search methods, and interfaces for machine learning starting with a maps enquiry engine for identification of mapping. The project is open in scope for students to contribute new and empowering ideas for other researchers to combine assets and integrate with for their studies – there should be several APIs published to this effect. Interfaces for statistical modelling (SPSS, MatLab and others) and other ways of classifying data are to be designed in the solution to improve the way that Digital Humanities researchers conduct their studies.
Problem Statement
The initial problem statement comes from a larger project currently being undertaken by the British Library. Currently the British Library have scanned in a large number of books from the period 1510 – 1946, the bulk of which are from the late 19th century, for which there is no digital copy available. On these scans they have run Optical Character Recognition (OCR), which gives an output of the text content for every single page of the book. As a side effect of the OCR, there are also portions of some pages that are not recognised as being text. Some of these are just pieces of text that are embellished in some way or have otherwise failed to be recognised, but most of the time these sections contain images of some description (photographs, paintings, drawings, graphs, diagrams). From these scanned images a dataset of 1 million images was uploaded to the image sharing site Flickr and are now completely in the public domain. Each image uploaded to Flickr was also automatically given tags corresponding to information about the book it was taken from, such as: publication date and place, author, page number etc. Through the Flickr platform, users have been able to add in their own tags too, though with such a large dataset there are still many images without tags.
Overview of System Functionalities
The system is comprised of three parts:
- The first part is designing a system that is meant to be able to automatically generate tags for images. The idea is that by using Machine Learning algorithms we can use our own knowledge to manually tag a small set of images with the appropriate tags and then have these propagated to an unknown image.
- The second part is being able to access the dataset and run our tagging algorithm on it, as well as storing the tags in an appropriate way, so that they can be used for retrieval.
- The third part is to come up with a straightforward way to search for images fitting a certain description as well as being able to navigate through the dataset but looking for similar images to an existing one.
Context of the System
The system is designed to be used by historians and humanities researchers and so does not have to be as robust as something that would be open for anyone to use. Since researchers using the system would probably have more invested in what they are trying to find out through the system they can probably adapt to and learn a slightly more difficult but powerful UI. This is not an excuse to ignore HCI principles.