Group 12 - British Library Big Data Project Part 2

Original Requirements

ID	Description	Rationale	Type	Priority
1	Well-documented project that can be built upon in the future	This is only the first iteration of the project and it is important that any success can be made the most of in the future rather than needing to be re-implemented	Non-functional	Must Have
2	A pipeline that goes from inputting images to being able to search through those images by tags	The system will be used by arts and humanities researchers and so it is not important that they know about the inner workings of the system to be able to use it	Non-functional	Must Have
3	Segmented steps in our pipeline	The method of tagging, the way images are input, or even the search functionality may change in the future so it is important that it can be easily adaptable to this	Non-functional	Must Have
4	A set of images with correct tags	In order to try and train our classifier to work on the entire set we need to have a number of images that we are confident are correct	Functional	Must Have
15	An automated image tagging method	There are far too many image to be tagged by hand, especially if the nature of the tags changes over time and every image needs to be re-classified	Non-functional	Must Have
6	An automated way to tag images based on their medium (photograph/painting/diagram etc.)	Although this would still result in large sets of images, it at least narrows down the number of images to search through, and could be used in combination with other tags	Non-functional	Should Have
7	A way to infer information about the image by the book that contains it	By using the fact that the books have a number of attributes such as publication date, location and the book's genre, we can use this in conjunction with the visual information to more accurately classify the images	Non-functional	Could Have
8	A way to use the existing Flickr tags	There is a wealth of tags that already exist on Flickr for the BL dataset which is valuable data but it is not necessarily reliable. May be useful for generating new tags	Non-functional	Should Have
9	A way to find images by their tags	Once the images have been correctly tagged, there still needs to be a way of retrieving them. This could be either our own system or by passing the tags on to another system such as Flickr	Non-functional	Must Have
10	A way to group similar images for retrieval	It is helpful to find images that have similar attributes when searching, especially if the user is not quite sure exactly what they are looking for	Non-functional	Should Have
11	A way to describe the clusters of images in a way that a human could understand	This makes browsing and retrieval of images much more straightforward but is difficult and with the time left may not be feasible	Non-functional	Would like to have
12	An object classification method for images	Searching by the images content is likely to be the most intuitive way to search for pictures. For example searching 'boat' for pictures of boats. But also the hardest, and may not be possible in the time given	Non-functional	Would like to have

Implementation of Requirements and Achievements

The user was very happy with our system because we implemented all of his requirements successfully. The following table shows each requirement and the priority of it:

ID	Description	Implemented successfully?	Priority of Requirement
1	Well-documented project that can be built upon in the future	Implemented Successfully	Must Have
2	A pipeline that goes from inputting images to being able to search through those images by tags	Implemented Successfully	Must Have
3	Segmented steps in our pipeline	Implemented Successfully	Must Have
4	A set of images with correct tags	Implemented Successfully	Must Have
5	An automated image tagging method	Implemented Successfully	Must Have
6	An automated way to tag images based on their medium (photograph/painting/diagram etc.)	Implemented Successfully	Should Have
7	A way to infer information about the image by the book that contains it	Implemented Successfully	Could Have
8	A way to use the existing Flickr tags	Implemented Successfully	Should Have
9	A way to find images by their tags	Implemented Successfully	Must Have
10	A way to group similar images for retrieval	Implemented Successfully	Should Have
11	A way to describe the clusters of images in a way that a human could understand	Implemented Successfully	Would like to have
12	An object classification method for images	Implemented Successfully	Would like to have

Further Explanations on Achievements

”Well-documented project that can be built upon in the future”: (Implemented successfully) We have written the documentation so that it shows the steps required to continue building the project. We have done this by including:
- A user manual which shows how the application works - it shows the steps required to carry out the main features.
- The code is formatted so that people can understand the different segments of it. Comments, indentation and appropriate identifier names were used to achieve this.
”A pipeline that goes from inputting images to being able to search through those images by tags”: (Implemented successfully) This was implemented successfully since:
- All of the images used are input from the British Library image data-set (extracted from Flickr).
- Users can search for images that are stored on Flickr (image data-set).
”Segmented steps in our pipeline”: (Implemented successfully) All of the features have been separated into different modules throughout the code. This is for many reasons:
- It is easier for future developers to continue developing certain features without having to worry about other features being affected.
- It is easier to read the code if the different features and steps are modular.
- If bugs occur when future teams work on our project, it is much easier to debug the code when the features are separated into different modules.
“A set of images with correct tags”: (Implemented successfully) We have tagged thousands of images in the data-set using two APIs and machine learning algorithms. It is not possible to tag every image 100% accurately, so we conducted thorough testing to check the percentage of images that are tagged correctly. We found that around 90% of them were tagged correctly using our algorithms.
“An automated image tagging method”: (Implemented successfully) Our methods used to tag images use algorithms to do this. We use two APIs to tag images based on what they represent (AlchemyAPI and Imagga API). There are approximately one million images in the data-set, so it would take too long to tag all of them. Therefore, we have made a script which tags a certain amount of images every day automatically.
“An automated way to tag images based on their medium (photograph/painting/diagram etc.)”: (Implemented successfully) We have also used machine learning algorithms to classify images based on characteristics such as:
- Whether the image is in black and white or colour.
- Whether the image is part of some musical notes.
- Whether the image is a line drawing or a photograph.
These classifications allowed us to tag images based on their medium.
“A way to infer information about the image by the book that contains it”: (Implemented successfully) We have made a feature that allows users to view information about the book that each image is from. The following information is displayed about the books that the images are from:
- Volume of the book.
- Name of the publisher.
- The book title.
- The book author.
- The place of publication.
- The year it was published.
- The number of pages.
“A way to use the existing Flickr tags”: (Implemented successfully) many of the existing Flickr tags are still associated with the images that they are tagged with. So, users can search for images that have already been tagged on Flickr.
“A way to find images by their tags”: (Implemented successfully) In the home page, users can input a search query and choose to find out all of the images that have tags associated to the input. Then, the user would be re-directed to a page that has all of the images as the search results.
“A way to group similar images for retrieval” (Implemented successfully) All of the images that the user are presented with in the search page are similar in one way. The images are either in the same book or they have similar tags (based on the APIs and machine learning algorithms used).
“A way to describe the clusters of images in a way that a human could understand” (Implemented successfully) In the search results page, the images are clustered by the books that they are from or by the tags that the images have (If the images have the same tags or they are in the same book, then they would be clustered together).
“An object classification method for images”: (Implemented successfully) images can be classified into having different characteristics (including the colour of the image and whether it is a line drawing or a photograph).

Achievements and Requirements

Original Requirements

Implementation of Requirements and Achievements

Further Explanations on Achievements