Web Scraping and Information Filtering
This process will scrape content from the webpage under a given NHS Trust Domain and
retrieve information according to our predefined question. The answer is based on the
information that the website provided.
Take the Domain from Frontend
Our Django backend will call the get_answer
function in
get_answer_dict.py
and pass the
Domain as an argument called url
. Then we will use this
domain to retrieve all the
information we need.
view.py
(Django
backend)
def generate_json_from_existing(request):
# POST - provide url and generate
data = json.loads(request.body)
print(data) # only url and ref
if USE_DUMMY_DATA:
data["answers"] = get_dummy_answer(data["url"])
else:
try:
data["answers"] = get_answer(data["url"])
except IndexError:
print('index error')
. . . . . .
get_answer_dict.py
def get_answer(url):
. . . . . .
time.sleep(5)
tool = Tool()
tool.setup(url)
. . . . . .
return {
'phone': filtered_phonenumber_text,
'openingtimepage':filtered_openingtime_text,
'contactpage': filtered_address_text,
'appointment': appointment_text
}
Webscraptool Controller
After we have received the domain of NHS Trust it will be passed to our functions in the
Tool
class. And we can crawl URLs and extract useful
information based on the domain we
have received.
The Tool
class is stored in tool.py
. This is our Webscraptool controller which can
call other functions we built for retrieving useful information.
The tool.py
contains 6 functions:
1. setup
call spider to do URL crawling
2. crawl_url_by_dictionary
will call our function to
check if the location in our page dictionary exists under a certain domain.
3. filter_url
will filter the URL we crawled by the
given
keyword
4. scrape_text
will call our Spider to scrape all the
text
on the given webpage
5. filter_text
will extract the information we need
from a
bunch of unstructured text(the raw text we scraped from the webpage)
6. time_out
is to force our Spider to terminate if it
stuck due to the internet or other
technical issues
tool.py
def timeout_handler():
. . . . . .
class Tool(Abstract_Tool):
def __init__(self) -> None:
super().__init__()
self.mainsite = ''
self.url_dict = []
def setup(self, link) -> None:
# timer for method to terminate scraping process when it gets stuck(take too long) due to some error
timer = threading.Timer(1000.0, timeout_handler)
timer.start()
. . . . . .
# if time is not out we need to cancel the timer
timer.cancel()
self.url_dict = results
def crawl_url_by_dictionary(self, dict_name):
. . . . . .
return get_vaild_url.run(self.mainsite, page_dictionary1[dict_name])
def filter_url(self, keywords, blacklist_keywords) -> list:
. . . . . .
return result
def scrape_text(self, filtered_urls) -> dict:
. . . . . .
return {'text':results}
def filter_text(self, content_dict, category) -> dict:
. . . . . .
return {'filtered_text': result}
The detailed description of every function in this class can be checked in our interface called
abstract_tool.py
.
Web Crawling and Scraping
For a typical process, we start by crawling the URL under the NHS trust domain with our
Scrapy spider. After it receives the domain url, it will put it in
self.start_urls
as a
starting point and follow all the link on this page until our depth limit is reaced. Depth
means it takes how many click from the starting url to reach the current page. Here we set
the depth limit to 4 to avoid spending too much time on web crawling.
crawl_url.py
. . . . . .
# allow passing url as an attribute through command line
def __init__(self, url=None, *args, **kwargs):
super(Url_Crawler, self).__init__(*args, **kwargs)
self.start_urls = [f'http://{url}/']
self.allowed_domains = [f'{url}', '127.0.0.1', 'localhost']
self.links = []
# Crawling URL
def parse_item(self, response):
if 'depth' in response.meta: depth = response.meta['depth']
# only follow 4 layers of link and exit so as to avoid crawling to much links
if depth > 4: raise scrapy.exceptions.CloseSpider(reason='maximum depth reached!')
item = {}
item['title'] = response.css('title::text').get()
item['url'] = response.request.url
self.links.append(item)
yield item
After we get the URL from the domain we use the function in the Tool
class called
filter_url
to filter it. This function will read
through self.url_dict
and try to find if any keyword we
select appears in the title.
class Tool(Abstract_Tool):
. . . . . .
def filter_url(self, keywords, blacklist_keywords) -> list:
result = []
# add homepage
result.append({'title':'', 'url':'https://' + self.mainsite, 'keyword':''})
# check if keyword is in title
for url in self.url_dict:
for keyword in keywords:
if url['title'] is not None and keyword in url['title']:
# check black list
blacklist_flag = 0
for blacklist_keyword in blacklist_keywords:
if blacklist_keyword in url['title']:
blacklist_flag = 1
if blacklist_flag == 0:
result.append({'title':url['title'], 'url':url['url'], 'keyword':keyword})
return result
Then we scrape all the text on these webpages by our Scrapy spider in get_text.py
. We scrape everything except for content wrap
with the
script
tag and style
tag.
. . . . . .
def parse(self, response):
all_text = response.selector.xpath('//body/descendant-or-self::*[not(self::script | self::style)]/text()').getall()
content = ''
for text in all_text:
if not str.isspace(text):
content = content + text + '\n'
yield {'text':content}
Sometimes Scrapy will miss some pages we need due to firewall or other issues. So we build
another function as a complementary part of our Scrapy web crawler. It will use the request
method to check through a dictionary. This dictionary contains some common locations on the
NHS website where they stored the information we are interested in. If the request's status
code is 301 or 200 then the page exists and consider as "filtered" at the same time. We do
not need to call the filter_url
function to filter them
again.
get_vaild_url.py
def run(domain, page_dictionary):
domain_url = 'http://' + domain +'/'
urls_exist = []
def check_url_validity(url):
# return true if url exists(code:200)
headers = {'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
try:
r = requests.head(url,headers=headers)
r.close()
except Exception as e:
print('error:' + str(e))
return False
print(r.status_code)
return r.status_code == 200 or r.status_code == 301
for page in page_dictionary:
time.sleep(1)
if check_url_validity(domain_url + page):
urls_exist.append({'url':domain_url + page})
return urls_exist
Information Filtering
Preprocessing
We use IBM NLU API to do tokenization. This can split the unstructured raw text we scraped
from the webpage into pieces and facilitate us to do text processing later.
tokenizer.py
def tokenization(txt):
. . . . . .
apikey = env.IBM_NLU_API_KEY
apiurl = env.IBM_NLU_URL
authenticator = IAMAuthenticator(apikey)
natural_language_understanding = NaturalLanguageUnderstandingV1(
version='2022-04-07',
authenticator=authenticator
)
natural_language_understanding.set_service_url(apiurl)
# only get sentences
response = natural_language_understanding.analyze(
text=txt,
features=Features(
syntax=SyntaxOptions(
sentences=True,
))).get_result()
. . . . . .
return response
Data Extraction
We iterate through the result IBM NLU generated while using the function we wrote in
match_content.py
to check which piece of text contains
the information we need.
And use
the functions we write to get the related information. These functions are in
get_addr.py
,
get_openingtime.py
, and get_phone_num.py
. We have written more
detailed workflows in these files.
Finally, we filter out the duplicate result we get and move on to the DialogJson generating
process.
filter_content.py
def addr(original):
. . . . . .
return addr_filtered
def openingtime(original):
. . . . . .
return opening_hours_filtered
def phonenumber(original):
. . . . . .
return filter_phonenum
Cache Bing results locally
During our testing, we discovered that the answer from Bing varies over time and the Bing
API will charge when reaching a certain limit of API calls. This function is used to cache
some Bing results for general questions locally by creating a request to our Azure
Functions. This can maintain answers' consistency and prevent further costs.
bing_azure_function_api.py
def get_bing_result(domain_url, question):
url = env.BING_AZURE_FUNC_URL
payload = {
"q": question,
"site": 'https://' + domain_url
}
headers = {
"Content-Type": "application/json",
"Ocp-Apim-Subscription-Key": env.BING_API_KEY,
}
. . . . . .
API URLs and Keys
All of the API keys used in this part(Bing API and IBM NLU) are stored in env.py
to make the development much easier.
MONGODB_URL = "mongodb://******"
BING_AZURE_FUNC_URL = "https://******"
BING_API_KEY = "******"
IBM_NLU_URL = "https://******"
IBM_NLU_API_KEY = "******"