Implementation


Impl. Overview(insert pic of new integ)

1.) Retrieve file(s) from drag and drop in UI

2.) Read first page of document

3.) Run QA model on first page to extract answers

4.) Get ISO code of country extracted by QA model using fuzzy matching

5.) Run spaCy to extract locations

6.) Remove duplicate locations found/p>

7.) Get P-codes of locations found using fuzzy matching

8.) Combine all extracted data and display on UI

9.) Save answers to database?

* note that all files mentioned can be found in src/Backend/Integration

Retrieving and reading file(s)

As shown in the UI design page, users are able to upload a file, or multiple files. We use python's built in library 'PyPDF2' to extract the text content of the PDFs. This is done using the read_file() method 'readfile.py' file.

 
                        
                            def read_file(self, file):
                                f = open(file, 'rb')
                                pdfReader = PdfReader(f)
                                pageone = PyPDF2.PdfReader(f).pages[0].extract_text()
                                text = ""
                                for i in range(1, len(pdfReader.pages)):
                                    text += PyPDF2.PdfReader(f).pages[i].extract_text()
                                f.close()
                                return pageone, text 
                        
                    

Answer exctraction

Then we run the QA model on it to extract the answers shown below. The questions are stored as a list in the class attributes, and all the answers are stored as a dictionary in the class attribute self.answers

 
                        
                            def __init__(self, file):
                                self.file = file
                                self.questions=["What is the Country of Disaster?",
                                "What is the Operation Start Date?", 
                                "What is the Operation End Date?",
                                "What is the number of people affected?", 
                                "What is the number of people assisted?", 
                                "What is the Glide Number?", 
                                "What is the Operation n°?", 
                                "What is the Operation Budget?", 
                                "What is the Host National Society?"]
                                self.answers = {}
                                self.extract_ans()
                        
                    

The method that does the main extraction is the extract_ans() method shown below. It loops through the questions and stores it in self.answers dictionary, with the question as the key and the answer as the value.

 
                        
                            def extract_ans(self):
                                # get first page of document
                                reader = ReadFile()
                                first_page = reader.exec(self.file)[0]
                                context = first_page
                        
                                # loop through questions
                                for n in self.questions:
                                    tokenizer.encode(n, truncation = True, padding = True)
                                    tokenizer.encode('[CLS]')
                                    nlp = pipeline('question-answering', model = model, tokenizer = tokenizer)
                                    answer = nlp({'question': n, 'context': context})['answer']
                                    self.answers[n] = str(answer)
                        
                    

Location extraction

For extracting locations, we used the spaCy library. As seen in the exctract_loc() function below, first we load the spaCy model, stored in the 'nlp' variable. Then we make a container for the nlp library called 'doc', which takes in the text of the DREF document. This container contains all the annotations, labels and entities of the text.

 
                        
                            def __init__(self, file):
                                self.file = file
                                self.locations = self.exctract_loc()
                        
                    
 
                        
                            def exctract_loc(self):   
                                nlp = spacy.load("en_core_web_lg")
                                .
                                . 
                                . 
                                doc = nlp(first_page)

                                locations = []
                                for ent in doc.ents:
                                    if ent.label_ == "GPE":
                                        locations.append(ent.text)

                                # Remove duplicate locations
                                clean = []
                                for i in locations:
                                    if i not in clean:
                                        clean.append(i)
                                return clean
                        
                    

The function gets all the entities labelled "GPE", standing for Geopolitical Entity, and places it in a temporary list 'locations'. We then make another list 'clean' which is thethe list of locations with exact match duplicates removed. This adjusted list is then saved to theself.locations class attribute.

Code Matching

We decided to use Fuzzy matching to compare extracted locations to the excel spreadsheet of Admins 0, 1 & 2 locations, thus finding the corresponding codes. We used the python library 'Pandas' to read from the excel sheets. This is done in the file called exctract_loc()

Process for finding ISO codes:

1.) Filter excel sheet by first letter of Country extrcated by QA model

2.) Try to find exact match

3.) If no exact match found, find the best match by comparing 'fuzz ratio'

The fuzz ratio is the similarity score between to strings. The higher the score the more similar the words are.

The function find_p_code() returns a tuple(tag, p-code), where the tag is 0 if there is not match, 1 if the location is admin 1 and 2 if the location is admin 2

 
                            
                                def __init__(self, admin_0, loc_list):
                                    self.admin_0 = admin_0
                                    # list of dict of pcodes for admin 1 & locations
                                    self.loc_list = loc_list
                                    self.ISO_code = self.getISOCode()
                                    self.p_code_1 = []
                                    self.p_code_2 = []
                            
                        
 
                            
                                def loop_p_codes(self):
                                    for loc in self.loc_list:
                                        tag = self.find_p_code(loc)[0]
                                        p_code = self.find_p_code(loc)[1]
                                        dict = {"Location":loc, "P-Code":p_code}
                                        if tag == 1:
                                            self.p_code_1.append(dict)
                                        if tag == 2:
                                            self.p_code_2.append(dict)
                            
                    

Display and save

The communication between the backend and the UI is done in the file called fronttoback.py. the function exctract_answers() runs the entire main function in new_integ.py, and stores it for displaying in the app.

Technology

Frontend

Framework: Tkinter

Library: react-cookie (for using cookie),react-redux (for using redux), react-icons, react-router-dom (managing route),runtime-env-cra(inject runtime environment on production environment), framer-motion(make animation)

Backend

Framework: Hugging Face

Model name: Roberta, name: "deepset/roberta-base-squad2"

Framework: spaCy

Model: "en_core_web_lg"

Deployment

Postgres

For deploying database