HTTP Request Handling

To handle http requests, we need to register the URL patterns in the chatbotSite/chatbotSite/urls.py file to allow our Django backend to call corresponding functions in the chatbotSite/dbapp/views.py.

Urls.py
urlpatterns = [
    path('generate_json_e/', views.generate_json_from_existing),
    path('get_json/', views.get_json),
    path('del_json/', views.del_json),
    path('url_list/', views.list_url),
    path('register_url/', views.register_url)
]
Views.py
…
@csrf_exempt
def generate_json_from_existing(request):  # POST - provide url and generate
    …
@csrf_exempt
def register_url(request):  # POST - register an url into the database
    …
def get_json(request):  # GET - get by id
    …
@csrf_exempt
def list_url(request):  # GET
    …
@csrf_exempt
def del_json(request):  # DEL - delete all
    …

Chatbot File Generation

The chatbot JSON file is generated by extracting information from the trust's website and replacing target parts of a template Watson Assistant Dialog JSON stored in the dialogJson/template_json.py file as a Python dictionary.

template = {
    THE TEMPLATE
}

The generate_json_from_existing function in the dbapp/views.py file will handle the POST request for generating the chatbot JSON file. It starts with calling the get_answer function from the webscraptool/get_answer_dict.py file to get the data for replacement. Next, this function will find the targeted location in the template and fill in the answers

@csrf_exempt
def generate_json_from_existing(request):  # POST - provide url and generate
    data = json.loads(request.body)
    …
    data["answers"] = get_answer(data["url"])
    …
    ans = Answers(data)
    template = dialog_json_template
    for index in range(len(template["dialog_nodes"])):
        try:
            node_title = template["dialog_nodes"][index]["title"]
            if node_title == "Anything else":
                template["dialog_nodes"][index]["actions"][0]["parameters"]["site"] = data["url"]
            elif node_title == "Hours Info":
                template["dialog_nodes"][index]["output"]["generic"][0]["values"][0]["text"] = ans.get_hours_info()
            elif node_title == "Phone Info":
            …
        except KeyError:
            print("KeyError with this node: " + str(index))
    …

Database Structure

Our MongoDB database consist of two collections websites_to_dialogs and dialog_json. The websites_to_dialogs collection stores each NHS trust's websites and the corresponding reference code. All the information in this collection will be displayed on the generation history table. The dialog_json collection stores the chatbot JSON files for creating Watson Assistants.

db_one
db_two

Web Scraping and Information Filtering

This process will scrape content from the webpage under a given NHS Trust Domain and retrieve information according to our predefined question. The answer is based on the information that the website provided.

Take the Domain from Frontend
Our Django backend will call the get_answer function in get_answer_dict.py and pass the Domain as an argument called url. Then we will use this domain to retrieve all the information we need.

view.py (Django backend)
def generate_json_from_existing(request):  
    # POST - provide url and generate
    data = json.loads(request.body)
    print(data)  # only url and ref
    if USE_DUMMY_DATA:
        data["answers"] = get_dummy_answer(data["url"])
    else:
        try:
            data["answers"] = get_answer(data["url"])
        except IndexError:
            print('index error')
. . . . . .

get_answer_dict.py
def get_answer(url):
. . . . . .

    time.sleep(5)
    tool = Tool()
tool.setup(url)
. . . . . .
    return {
        'phone': filtered_phonenumber_text,
        'openingtimepage':filtered_openingtime_text,
        'contactpage': filtered_address_text,
        'appointment': appointment_text
    }

Webscraptool Controller

After we have received the domain of NHS Trust it will be passed to our functions in the Tool class. And we can crawl URLs and extract useful information based on the domain we have received.

The Tool class is stored in tool.py. This is our Webscraptool controller which can call other functions we built for retrieving useful information.

The tool.py contains 6 functions:
1. setup call spider to do URL crawling
2. crawl_url_by_dictionary will call our function to check if the location in our page dictionary exists under a certain domain.
3. filter_url will filter the URL we crawled by the given keyword
4. scrape_text will call our Spider to scrape all the text on the given webpage
5. filter_text will extract the information we need from a bunch of unstructured text(the raw text we scraped from the webpage)
6. time_out is to force our Spider to terminate if it stuck due to the internet or other technical issues

tool.py
def timeout_handler():
    . . . . . .

class Tool(Abstract_Tool):
    def __init__(self) -> None:
        super().__init__()
        self.mainsite = ''
        self.url_dict = []

def setup(self, link) -> None:
        # timer for method to terminate scraping process when it gets stuck(take too long) due to some error
        timer = threading.Timer(1000.0, timeout_handler)
        timer.start()

    . . . . . .
        # if time is not out we need to cancel the timer
        timer.cancel()
    self.url_dict = results

    def crawl_url_by_dictionary(self, dict_name):
        . . . . . . 
        return get_vaild_url.run(self.mainsite, page_dictionary1[dict_name])

    def filter_url(self, keywords, blacklist_keywords) -> list:
        . . . . . .
        return result

    def scrape_text(self, filtered_urls) -> dict:
        . . . . . .        
        return {'text':results}
    def filter_text(self, content_dict, category) -> dict:
        . . . . . .        
        return {'filtered_text': result}

The detailed description of every function in this class can be checked in our interface called abstract_tool.py.


Web Crawling and Scraping
For a typical process, we start by crawling the URL under the NHS trust domain with our Scrapy spider. After it receives the domain url, it will put it in self.start_urls as a starting point and follow all the link on this page until our depth limit is reaced. Depth means it takes how many click from the starting url to reach the current page. Here we set the depth limit to 4 to avoid spending too much time on web crawling.

crawl_url.py
. . . . . .    
# allow passing url as an attribute through command line
def __init__(self, url=None, *args, **kwargs):
    super(Url_Crawler, self).__init__(*args, **kwargs)
    self.start_urls = [f'http://{url}/']
    self.allowed_domains = [f'{url}', '127.0.0.1', 'localhost']
    self.links = []

# Crawling URL
def parse_item(self, response):
    if 'depth' in response.meta: depth = response.meta['depth']
    # only follow 4 layers of link and exit so as to avoid crawling to much links
    if depth > 4: raise scrapy.exceptions.CloseSpider(reason='maximum depth reached!')
    item = {}
    item['title'] = response.css('title::text').get()
    item['url'] = response.request.url
    self.links.append(item)
    yield item

After we get the URL from the domain we use the function in the Tool class called filter_url to filter it. This function will read through self.url_dict and try to find if any keyword we select appears in the title.

class Tool(Abstract_Tool):
. . . . . .

    def filter_url(self, keywords, blacklist_keywords) -> list:
        result = []
        # add homepage
        result.append({'title':'', 'url':'https://' + self.mainsite, 'keyword':''})
        
        # check if keyword is in title
        for url in self.url_dict:
            for keyword in keywords:

                if url['title'] is not None and keyword in url['title']:
                    # check black list
                    blacklist_flag = 0

                    for blacklist_keyword in blacklist_keywords:

                        if blacklist_keyword in url['title']:  
                            blacklist_flag = 1     
                    if blacklist_flag == 0:            
                        result.append({'title':url['title'], 'url':url['url'], 'keyword':keyword})
        
        return result

Then we scrape all the text on these webpages by our Scrapy spider in get_text.py. We scrape everything except for content wrap with the script tag and style tag.

. . . . . .     
def parse(self, response):
    all_text = response.selector.xpath('//body/descendant-or-self::*[not(self::script | self::style)]/text()').getall()
    content = ''
    
    for text in all_text:
        if not str.isspace(text):
            content = content + text + '\n'

    yield {'text':content}

Sometimes Scrapy will miss some pages we need due to firewall or other issues. So we build another function as a complementary part of our Scrapy web crawler. It will use the request method to check through a dictionary. This dictionary contains some common locations on the NHS website where they stored the information we are interested in. If the request's status code is 301 or 200 then the page exists and consider as "filtered" at the same time. We do not need to call the filter_url function to filter them again.

get_vaild_url.py
def run(domain, page_dictionary):
    domain_url = 'http://' + domain +'/'
    urls_exist = []

    def check_url_validity(url):
        # return true if url exists(code:200)
        headers = {'User-Agent': 
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
        try:
            r = requests.head(url,headers=headers)
            r.close()
        except Exception as e:
            print('error:' + str(e))
            return False
        print(r.status_code)
        return r.status_code == 200 or r.status_code == 301

    for page in page_dictionary:
        time.sleep(1)
        if check_url_validity(domain_url + page):
            urls_exist.append({'url':domain_url + page})
    
    return urls_exist


Information Filtering

Preprocessing
We use IBM NLU API to do tokenization. This can split the unstructured raw text we scraped from the webpage into pieces and facilitate us to do text processing later.

tokenizer.py
def tokenization(txt):
. . . . . .
            apikey = env.IBM_NLU_API_KEY
            apiurl = env.IBM_NLU_URL
            authenticator = IAMAuthenticator(apikey)
            natural_language_understanding = NaturalLanguageUnderstandingV1(
                version='2022-04-07',
                authenticator=authenticator
            )

            natural_language_understanding.set_service_url(apiurl)

            # only get sentences
            response = natural_language_understanding.analyze(
                text=txt,
                features=Features(
                syntax=SyntaxOptions(
                    sentences=True,
                    ))).get_result()
. . . . . .            
    
    return response

Data Extraction
We iterate through the result IBM NLU generated while using the function we wrote in match_content.py to check which piece of text contains the information we need. And use the functions we write to get the related information. These functions are in get_addr.py, get_openingtime.py, and get_phone_num.py. We have written more detailed workflows in these files.

Finally, we filter out the duplicate result we get and move on to the DialogJson generating process.

filter_content.py
def addr(original):
    . . . . . .
    return addr_filtered

def openingtime(original):
    . . . . . .
    return opening_hours_filtered

def phonenumber(original):
    . . . . . .

    return filter_phonenum


Cache Bing results locally
During our testing, we discovered that the answer from Bing varies over time and the Bing API will charge when reaching a certain limit of API calls. This function is used to cache some Bing results for general questions locally by creating a request to our Azure Functions. This can maintain answers' consistency and prevent further costs.

bing_azure_function_api.py
def get_bing_result(domain_url, question):
    url = env.BING_AZURE_FUNC_URL

    payload = {
        "q": question,
        "site": 'https://' + domain_url
    }
    headers = {
        "Content-Type": "application/json",
        "Ocp-Apim-Subscription-Key": env.BING_API_KEY,
}
. . . . . .


API URLs and Keys
All of the API keys used in this part(Bing API and IBM NLU) are stored in env.py to make the development much easier.

MONGODB_URL = "mongodb://******"
BING_AZURE_FUNC_URL = "https://******"
BING_API_KEY = "******"
IBM_NLU_URL = "https://******"
IBM_NLU_API_KEY = "******"