Evaluation

MoSCoW List Achievement

ID	Description	Priority	State	Contributors
1	Operation number	MUST	✔	Hannah
2	Admin 0	MUST	✔	Tosin
3	Admin 1	MUST	✔	Tosin
4	Admin 2	MUST	✔	Tosin
5	P codes	MUST	✔	Hannah
6	Operation start date	SHOULD	✔	Hannah
7	Operation end dates	SHOULD	✔	Hannah
8	Glide number	SHOULD	✔	Hannah
9	Number of people affected	SHOULD	✔	Hannah
10	Number of people assisted	SHOULD	✔	Hannah
11	Host National Society	SHOULD	✔	Hannah
Key Functionalities:		`100 %`
Optional Functionalities:		`100%`

Individual Contributions

Section of Project	Stone	Tosin	Hannah	Zayn
Client Liaison	25	25	25	25
Requirements Analysis(HCI)	25	25	25	25
Research	25	25	25	25
UI Design	80	0	0	20
Coding	25	25	25	25
Testing	25	25	25	25
Integration	15	70	0	15
Report Website	25	25	25	25
Monthly Blogs	0	100	0	0
Monthly Videos	0	0	0	100
Overall Contribution	24.5%	32%	15%	28.5%
Main roles	UI Designer, Front-End Developer	DevOps, Back-end developer, Back-end Integration	Team Manager, Client Liaison, Back-end developer	Team Manager, Client Liaison, Full-stack Developer

Known Bug List

ID	Bug Description	Priority
1	When multiple Admin 0 in document, may incorrectly identify the actual country	High
2	May find incorrect ISO code for locations that are neither in Admin1 location database or Admin 2 location database	High
3	Extracts ISO codes of locations that aren't in the country	Medium
4	spaCy model extracts duplicate locations (E.g 'Gatsibo' & 'Gatsibo District')	Low

Evaluating QA Models

Common Metrics to Evaluate QA Models

There are two dominant metrics used by many question answering datasets, including SQuAD: exact match (EM) and F1 score. These scores are computed on individual question+answer pairs. When multiple correct answers are possible for a given question, the maximum score over all possible correct answers is computed. Overall EM and F1 scores are computed for a model by averaging over the individual example scores.

Exact Match

For each question+answer pair, if the characters of the model's prediction exactly match the characters of (one of) the True Answer(s), EM = 1, otherwise EM = 0. This is a strict all-or-nothing metric; being off by a single character results in a score of 0. When assessing against a negative example, if the model predicts any text at all, it automatically receives a 0 for that example.

F1

F1 score is a common metric for classification problems, and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it's computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.

Datasets Used

The Stanford Question Answering Dataset (SQuAD) is a set of question and answer pairs that present a strong challenge for NLP models. Due to its size (100,000+ questions), its difficulty due to the model only has access to a single passage and the fact that its answers are more complex and thus require more-intensive reasoning, SQuAD is an excellent dataset to train NLP models on.
SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
We have decided to choose a model that has been pre-trained on SQuAD2.0 for improved performance.

Our Evaluations

We shortlisted 5 well performing pre-trained QA models from Hugging Face and analysed them using their f1 and EM values on the SQuAD2.0 dataset, which are as follows:

Model Name	F1	EM
distilbert-base-cased-distilled-squad	86.996	79.600
deepset/roberta-base-squad2	82.950	79.931
deepset/minilm-uncased-squad2	79.548	76.192
bert-large-uncased-whole-word-masking-finetuned-squad	83.876	80.885
deepset/bert-base-cased-squad2	74.671	71.152

We also ran all models on a sample document provided by our client, asked the models questions related to our requirements, tabulated the results and asked our peers to rate the sensibility of the answers from 0-5 for each of the answers. The results are as follows:

Evaluating spaCy Models

Testing location detection

SpaCy has 3 main NLP models: small, medium, and large, which you can choose from to perform a task. The team decided that the large model would be best for our task of extracting all Geopolitical Entities (GPE) from the document.

We carried out a test on all three models, using the same extract of text, taken from the first two paragraph of the Wikipedia page on the United States. The extracted GPEs were then put into a list as shown below:

Small model - ['The United States of America', 'U.S.A.', 'USA', 'the United States', 'U.S.', 'US', 'America', 'The United States', 'the Federated States', 'the Marshall Islands', 'the Republic of Palau', 'Canada', 'Mexico', 'Bahamas', 'Cuba', 'Russia', 'Washington', 'D.C.', 'New York City']

Medium model - ['The United States of America', 'USA', 'the United States', 'U.S.', 'US', 'America', 'The United States', 'the Marshall Islands', 'the Republic of Palau', 'Canada', 'Mexico', 'Bahamas', 'Cuba', 'Russia', 'Washington', 'D.C.', 'New York City']

Large model - ['The United States of America', 'U.S.A.', 'USA', 'the United States', 'U.S.', 'US', 'America', 'The United States', 'the Federated States of Micronesia', 'the Marshall Islands', 'the Republic of Palau', 'Canada', 'Mexico', 'Bahamas', 'Cuba', 'Russia', 'Washington', 'D.C.', 'New York City']

Comments on results

Although all three models do a great job of identifying the locations in the text, the large model in one instance was able to recognise “the Federal states of Micronesia” as a GPE, as opposed to the small model that recognised only part of the GPE “the Federal States”, and the medium model not identifying it at all.

This may not seem like much of an important factor, it would be very important to our clients. Their main requirement was for us to extract precise locations from the DREF documents, for them to correctly identify disasters and issue funds accordingly. As seen in the table above, the large model is significantly larger than the small and medium. However, for reasons stated previously, we prioritised accuracy over any other factor.

Model statistics

Other metrics we considered when choosing our model, include storage size, F-score, and word vector size. A table comparing the three models using these metrics is shown below:

	Small model	Medium model	Large model
Size	12 MB	40 MB	560 MB
Word Vector Size	0 keys, 0 unique vectors (0 dimensions)	514k keys, 20k unique vectors (300 dimensions)	514k keys, 514k unique vectors (300 dimensions) /td>
F-Score	0.85	0.85	0.85

Although all three models have the same F-score, the large model has a much larger vector size compared to the small and large model, making it more accurate.

Evaluating Different String Matching Methods

To decide which method of string matching was best, we decided to compare the Fuzzywuzzy module, Edit distance and Cosine Matching by using 40 random Admin 1 and Admin 2 locations extracted from 15 random documents. We decided to test 2 different cases:

1. We filter the dataframe obtained from reading the Admin 1 Code and Admin 2 Code spreadsheets using the Country Code.

2. We filter the dataframe obtained from reading the Admin 1 Code and Admin 2 Code spreadsheets using the Country Code as well as matching the first letter of the Admin 1/Admin 2 locations with the first letter of the location whose code we are trying to obtain (this was done to see whether time of execution could be reduced).

We decided to compare the 3 types of string matching on the basis of time taken to execute the program, the number of ‘Not Found’ values returned as well as the accuracy which was calculated by (number of correct answers / total number of answers) * 100. The results obtained are as follows:

Type of String Matching	Time Taken to Execute Program	No. of 'Not Found' Values Returned	Accuracy
Fuzzy Matching without first letter filter	11.3 minutes	0	95%
Fuzzy Matching with first letter filter	17 minutes	1	90%
Cosine Matching without first letter filter	49.6 minutes	0	85%
Cosine Matching with first letter filter	16.29 minutes	1	85%
Edit Distance without first letter filter	10.3 minutes	0	90%
Edit Distance with first letter filter	10.42 minutes	1	87.5%

Critical Evaluation

User Interface

Our user interface was created using Tkinter, which was a good enough interface to meet our requirement. It is also very user friendly and meets HCI standards, as it is well labelled with instrictions, and help buttons for further clarification.

However, a minor improvement would be to use a better library could have been used instead, to make it more aesthetically pleasing, and to fit the Red Cross' style/theme.

Stability

The Interface will be deployed as a program on relevant devices. The system may take a while to complete extraction, bust so far we have not had an issue with the system crashing. Therefore, our system is very stable.

Efficiency

A huge shortcoming of our system is its efficiency. As there are so many integrated parts, it takes a while for the final extraction to take place. This is, more specifically, to do with the QA model, and Fuzzy Matching to find the P-codes for locations. As for the extraction of P-codes, this is due to the pattern matching structure that we implemented. Once the Admins have been extracted, it then tries to find the matching P-code from the Excel database.

However, the issue is that the databases have approximately 40,000 rows of data. Our efforts to reduce this time by filtering through country codes, reduced it to an average of around 2 minutes of extraction. Although this is a huge cut in time, it could be further improved if we implemented a better algorithm to search for the correct match quicker.

Compatibility

Our program is deployed as an application built with python UI library Tkinter. It was optimised for Windows operating systems only. It was not a requirement from our clients for it to work with other operating systems, as they had specified that they are Windows based.

Maintainability

Our system was structured with maintainability in mind. It separates frontend and backend, so that communication can clearly be seen in the code. We used an object-oriented approach in order to structure and separate tasks to different classes. All this means that the code can easily be modified, or extended for future use.

Project management

We practiced agile method to software engineering as a team. We had lab meetings, where we discuss progress with our TA, every Tuesday. Aside from that, we also held weekly team meeting to delegate tasks for the week, and discuss progress and ideas, keeping each other on the same page.

Our main means of team communication was through WhatsApp, and other meetings and discussions were held on Microsoft Teams. Our collaborative platform was Notion, where we stored meeting minutes, delegated weekly tasks for each member, and stored any other important documents or information. This helped us to keep track of progress, and make sure that everyone knew what tasks they were meant to be doing at any given time during the project.

Future Work

Though our system meets all of our clients' requirements, there are still some improvements that could be made. One major improvement would be the retrieval time. This could be done by building upon the Fuzzy Matzhing that has already been done. Another improvement would be to further train the existing QA model. As evaluated above, the QA model has some shortcomings in terms of answer accuracy, which could definitely be improved.

We also believe a feedback loop could be implemented wherein users rate the accuracy of answers on the page where the answers are displayed. This would provide a continous stream of feedback which would help the QA model train itself as it keeps getting used.

While we have coded the Text Summarizer and the Image Slicing code needed to extract details about the Operational Strategy, it has not yet been integrated into the tool as our client expressed that they were not very interested in focusing on it now. This feature could be integrated into the tool in the future to increase functionality.