Group 12 : British Library Machine Learning Experiment


To view the application, visit http://blbigdata.herokuapp.com

Project Brief Given

Taking on the BL Big Data Experiment, this team would be looking to improve the existing engine, facilitating interfaces for Machine Learning processes. It follows up the brief by the MScSSE team, with tasks/features for Azure marketplace such that the sample data and algorithm models has a secure home for future UCL projects. There should be improvements to search methods, and interfaces for machine learning starting with a maps enquiry engine for identification of mapping. The project is open in scope for students to contribute new and empowering ideas for other researchers to combine assets and integrate with for their studies – there should be several APIs published to this effect. Interfaces for statistical modelling (SPSS, MatLab and others) and other ways of classifying data are to be designed in the solution to improve the way that Digital Humanities researchers conduct their studies.


Problem Statement

The initial problem statement comes from a larger project currently being undertaken by the British Library. Currently the British Library have scanned in a large number of books from the period 1510 – 1946, the bulk of which are from the late 19th century, for which there is no digital copy available. On these scans they have run Optical Character Recognition (OCR), which gives an output of the text content for every single page of the book. As a side effect of the OCR, there are also portions of some pages that are not recognised as being text. Some of these are just pieces of text that are embellished in some way or have otherwise failed to be recognised, but most of the time these sections contain images of some description (photographs, paintings, drawings, graphs, diagrams). From these scanned images a dataset of 1 million images was uploaded to the image sharing site Flickr and are now completely in the public domain. Each image uploaded to Flickr was also automatically given tags corresponding to information about the book it was taken from, such as: publication date and place, author, page number etc. Through the Flickr platform, users have been able to add in their own tags too, though with such a large dataset there are still many images without tags.


Overview of System Functionalities

The system is comprised of three parts:


Context of the System

The system is designed to be used by historians and humanities researchers and so does not have to be as robust as something that would be open for anyone to use. Since researchers using the system would probably have more invested in what they are trying to find out through the system they can probably adapt to and learn a slightly more difficult but powerful UI. This is not an excuse to ignore HCI principles.