Uncovering Lives of WW1

Team 1

Nour Obeid George Badea Junda He

Overview

• Experiences of the First World War could be thoroughly different, depending on the location, type of military unit, medical issues, social class and more. We aim to provide more context for the experiences of different military units in different locations of the war and seek to find the connections among them.

• We are collaborating with the British Library, text-mining and geo-referencing on a large collection of digitised books provided by the British Library.

• We need to apply machine learning, corpus linguistics and sentiment analysis techniques to identify and generate a gazetteer of the names of places within digitised publications from and about the First World War

Initial Data gathering

Client requirements:

A system that is trained to:

-identify place names mentioned in documents that aren't in gazetteers

-identify description of places

Data collection : From meeting and Interview with the client

User type:

Researchers - Historians - Literary scholars - Social scientists

Data collection: Researching and interviewing concurrent developers

Software type: Terminal program or Mobile App or Interactive website

We chose our system to be an Interactive website because :

-There is no download required making it faster and easier

-The data is easily updatable

Initial requirements

Functional requirements:

The system SHOULD accept 1 input file
The system SHOULD identify place names
The system SHOULD identify the description of the place names
The system SHOULD extract sentences surrounding each place name

Non-functional requirements :

The input file MUST be in pdf or txt format
The http connection time out SHOULD be 1 minute
The waiting time SHOULD be visualised
The language SHOULD be English
The system SHOULD be developed by April 2018

User Consultation

As the Geo-parser is made mainly for researching and academic purposes, the users are most likely to be Professors, scholars or researchers in the field of human sciences or sociology. We had an interview with the Digital Curator of British Library’s Digital Scholar team, Dr. Mia Ridge to gather the initial requirements.

We emailed the developing team of the Edinburgh Geoparser and Dr. Beatrice Alex has provided us some information about the user and collaboration projects about this similar product:

The default Edinburgh Geoparser (available in the download) was developed for newspaper text . It is used to work on research project where people adapt it to different domains.
The geoparser is mainly rule-based apart from some of the candidate ranking steps in the geo-resolution component. The parser is found to work most reliable when applied to text in different domains as a machine learning based system would need annotated training data for each new domain and it’s costly to create.
We looked at the projects that used the Edinburgh Geoparser and analysed each of them including the lit long project, the Survey of English Place Names, the GAP project

After the consultation, we decided to enable multiple files uploading to increase its efficiency .

Personas

Sketches

Throughout the design process, we applied the Elaboration and Reduction method. Our ideas converged to the optimal one.

1st idea

2nd idea : more usered- centered

While sketching we aded to our requirements :

A cancel button must be displayed
The name of the uploaded file must be displayed

Automatic analysis

after uploading the file

The user can confirm

or cancel his upload

Scenario

Alvin has started his project. Initially Alvin needs to use text-mining and geo-referencing on a large set of collections of digitized books to build a database that contains the place-names and relevant descriptions. It would take a long time for humans to finish reading the books and precisely extracting the right sentences from the context.

So Alvin and his team used the Geo-parser to analyze the books, he selects many text files in one-go and lets the algorithms to extract place-names, relevant descriptions of the place and the title and author of the book. He types in the url of our website, uploads the text files and he clicks “submit”. After a while the algorithms finished reading and automatically build the database. Alvin then uses the output to continue doing his project

Storyboard

Prototype

We built 2 different prototypes to support our design

1. Digital prototype

III

2. Made of cardboards:

very quick to build
very effective for identifying problems in the early stages of design
quick to modify

III

Evaluation

We used the prototype made of cardboards to evaluate our system

User testing method

- We can get unbiased feedback from real users in real time and

-Can ask users to complete specific tasks

We consulted 5 users

We added to our requirements :

A "change file button "

The output should be visualised via google maps

The system should accept up to 2 files

Heuristic Method

Final Requirements

Functional Requirements : The system MUST accept an input file

The system MUST identify place names

The system MUST identify the description of the place names

The system MUST extract the sentences surrounding each place name

Non Functional Requirements :

The input files MUST be in txt or xml

The http connection time out MUST be one minute

The system MUST be developed by April 2018

The language MUST be in English

A cancel button SHOULD be displayed

The output SHOULD be visualised via google maps

The system SHOULD accept up to 2 input files

New Prototype

Text

III

Text

VII

Research

We have a range of gazetteers to choose; It is difficult to find a source that provides a perfect gazetteer for the domain of WW1 so it is necessary to experiment with different gazetteers before deciding which one to use. They are:

Genomes: covers all countries and contains over eleven million place-names, the drawback is some historical place-names will not be identified
Deep: historical place names in England
Pleiades+: a community build gazetteer of ancient places
Unlock: a general gazetteer mainly for the UK

The system should be able to identify people names as well. It is found that identifying place names and person names at the same time would increase the accuracy of differentiating some ambiguous places. For example, ‘Levis’ could be a first name or an island in Scotland. As well as “Mrs Chesterfield”, “Earl of Essex”.

Local gazetteers are important on improving the speed of processing large files. Comparing to use online gazetteer services, setting up a local database of storing the gazetteers is more desirable.

Process Pipeline

The specified text file is going through a series of processing steps which are combined into one pipeline. It is first tokenised, part-of-speech tagged and lemmatised. After these initial steps, named entity recognition is performed to identify location and person names as well as dates. Finally, visualisations are created to be able to inspect the file and the Geoparser output using a map interface in a browser.

1. Tokenization: It identifies word tokens and results in w elements being wrapped around the tokens and wraps s elements around the sentences

2. Part-of-speech tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.

3. Lemmatization is the process of finding the stem form of inflected words.

The Edinburgh Geoparser

In December 2015, the Edinburgh Geoparser was released under the University of Edinburgh’s GPL license to be used by other researchers in the field of text mining as well as other scholars who are interested in geoparsing text.

The Geoparser allows you to process a piece of English-language text and extract and resolve the locations contained within it. Among other uses, geo-resolution of locations makes it possible to map the data.

Further Development

Specifying temporal expressions (dates and times) and people names

As we mentioned before, it has the best outcome when the identify of place names and person names are operating at the same time.

In the future, the system should be able recognize date and times and normalize them as well. By normalizing, we mean to compute the time to the exact calendar time. For example, it calculates the exact time of the expression “Yesterday” by processing other data.

Geo-parsing multiple text files

For researchers and scholars, being able to process a large single file is far away from their needs. You may would like to analyze a bunch of documents at once.