Week 1 - 5
4 Oct - 9 Nov
4 Oct - 9 Nov
First of all, on the very first week, we got introduced to each
other as well as the clients and the supervisor. We had our first
discussion with our client - the Chartered Society of Physiotherapy
(CSP), which is the professional body and trade union for
physiotherapists in the United Kingdom. We were assigned the project
that develops a chatbot to describe key terms and phrases from the
Physiotherapy Health Informatics Strategy (PHIS). During this
project, we also got help from IBM for tackling technical problems.
To kick off, we start by setting up everything that we need to do in
this project. We start by creating GitHub Repositories for both our
actual project and portfolio website. We also created Discord Server
for our team's communication platform and Notion Workspace for
assigning tasks. Furthermore, we created a Gantt chart to track our
project development. Lastly, we start our portfolio website by
picking a template that uses Bootstrap.
14 Nov - 28 Nov
After reading week, we finished with the HCI assignment and also
created the first mockup for CSP Chatbot. Until this week, we have
not met the client (CSP), we only received one email from them
regarding the requirements. So, we did not actually show the proper
mockup to the client yet.
Unfortunately, on 15 November, the module leader contacted us that
our client got changed from CSP to NHS England. Due to this, we need
to redesign our project, which means we need to draw the sketches
again, the prototype, as well as gather the project's requirements.
Our new project is about building a first-generation prototype of a
"chatbot generator" for IT admins who run their trusts online
services and search pages.
After we got the list of features for this chatbot generator, we
started by doing research about the first bullet point, which is
looking at how to automate the web scraping into a standardised
format for LLM processing. We managed to do research about Large
Language Models (LLM) and Web scraping/crawling. Due to changes in
the project, we switched our tech stack to MongoDB, React (as the
front-end), Django (new), and IBM Watson Assistant. We also
researched web scraping frameworks that are available in the Python
library, which are BeautifulSoup, Selenium, and Scrapy. For now, we
want to use Selenium to try scraping some simple websites. This was
our very first idea about the implementation of the project.
28 Nov - 12 Dec
After rough weeks, we managed to show some progress for the project.
We managed to integrate the IBM Watson Assistant into a sample
website, which runs using React. We also found out that Watson
Assistant has an integration feature that can extend the scope of
our assistant by searching through our documents and websites,
called Search. Furthermore, we successfully tried Selenium to web
scraping using a website called "Books to Scrape". Even though we
tried Selenium for web scraping, we still want to find the best
Python library that fulfils our needs to do the scraping/crawling.
We really care about efficiency and a library that can work with
large datasets. Our contenders are BeautifulSoap and Scrapy, and we
will firm our choice by next week.
12 Dec - 19 Dec
Week 10, the last week of term one, hip hip hooray!!! 🎆
🤩
This week, for the technology stack, we reached the conclusion that
we will use MongoDB as the database, React as the front-end, Django
as the back-end, and Scrapy as the Python library for web
scraping/crawling. We chose Scrapy in the end because it provides
the library that we need. We are expecting large datasets from each
different trusts' website and Scrapy is the perfect solution for us.
Other than that, Scrapy is also more efficient and faster than the
other contenders. It can also extract data in different formats,
such as CSV, XML and JSON. In this project, we will use JSON Format
as it will be transferred to a NoSQL database (MongoDB).
19 Dec - 8 Jan
During winter break, we agree that we will split the tasks into
three, one for each member. The first task is the web scraping part.
The second task is creating Dialog JSON and transferring it to a
database. The last task is improving the portfolio website and
updating the development blog as well as monthly videos.
For the portfolio website, we managed to improve
it, and we
completed all the layouts. So later on, we only need to fill in the
contents. We also keep track of the development blog by updating it
bi-weekly and the monthly videos too.
For the web scraping part, we started by finding a
common layout and
structure for all NHS trusts' websites. We did this because there
are too many web pages on a single website, and these pages may
contain much useless information, which can affect the result. This
strategy also reduces the huge amount of time to check them one by
one.
We found out that there are patterns in the existing websites. These
kinds of information are all stored on a webpage with a title, such
as contact us/contact, about us, etc. After that, we decided to
collect all the common keywords of these web pages, and filter the
URLs that we got from the previous stage with these keywords.
9 Jan - 23 Jan
Our main goals for these two weeks are to integrate the web scraping
part to the Django back-end and to make sure that we use real data,
not dummy data.
First of all, we separated the problem that we had during winter
break into two sub-tasks, which are removing duplications and
handling several clinic situations (as it may make the website's
layout become different). We managed to solve the first sub-task and
as well as handling with several clinic situations when filtering
addresses.
The main obstacle that we had when integrating the web scraping part
was the incorrect path. This problem arises because the current
working directory is changed, so the web scraping tool cannot import
the model and cannot read or write JSON files correctly. In order to
solve this, we got the location where the web scraping tool is
stored and used this location when the tool needed to import models
or manipulate JSON files.
For the front-end and back-end parts, we also made some
improvements. Firstly, we added a table at the bottom of our web app
for tracking the previous Dialog JSONs that were generated in the
MongoDB database. This table shows website links and reference codes
in pairs. Furthermore, we added a notification mechanism to help
users stay informed of the Dialog JSON generation's status. In
addition to the front-end updates, we managed to integrate the web
scraping part to replace the dummy data that was used before.
23 Jan - 6 Feb
During these two weeks, we were not progressing so much because of a
math test, but there is still some important progress. Our main
goals are to refactor the source code and scale up the answer
generation system.
To scale up the answer generation system, we read some papers
regarding the NLP Question and Answering (NLP QnA). Three papers
from Facebook Research are FiD, Atlas, and DPR. There is another
useful website called HuggingFace discussing "rag-token-nq". In
addition, we also tried Microsoft Azure Health Bot service, as well
as digging deeper into Scrapy.
For the code refactoring, we started by rewriting everything in a
class instead of many functions in many files. Our previous approach
can make other people confused when reading the code. Instead of
outputting everything as JSON files, we decided to use a Python
dictionary to hold every text that is scraped by Spider. The last
thing we were trying to fix was to find another way to run the
Scrapy script instead of using the CMD commands. We fixed it by
using a library called Signal and modified some code by adding the
argument '--nothreading --noreload' when using the command line to
run Django.
Mentioned Website and Papers:
https://github.com/facebookresearch/FiD
https://github.com/facebookresearch/atlas
https://github.com/facebookresearch/DPR
https://huggingface.co/facebook/rag-token-nq
6 Feb - 27 Feb
During these two weeks, we made a lot of progress on the project. We
reached the conclusion that to get the answer for the chatbot, we
will use our web scraping tool and Bing API. So, all predefined
questions will be answered based on the web scraping tool and the
other general questions will be passed to Bing API.
The flow of our chatbot generation service:
27 Feb - 13 Mar
For these two weeks, we finish our project by wrapping up
everything. We started by fixing the bugs which happened during the
web scraping process. In order to ensure the client that our
generation service is adaptable, we tried various websites (under
the NHS domain). We successfully did it, and we got all the relevant
data, and the chatbot can answer the question as well.
13 Mar - 24 Mar
For the final two weeks of the project, we finalised everything -
from our actual project to the project website. We started by doing
all testing for our chatbot generation service. We did various
testing, such as Unit Testing, System & Integration Testing, API
Testing, and User Acceptance Testing. Finally, we wrapped up our
project by providing the "ready for production" source code on
GitHub. In addition, we also finished the technical video, final
presentation video, demo video, and of course, the project website.
On the 21st of March, we also did another demonstration for people
from IBM, Microsoft, Playstation, etc. They were impressed by our
project and asked several interesting questions about NHS
Auto-chatbot (scalability, future work).
Thank you to everyone that read our development blog from start to
end!
We will see you around :D
Warm regards,
Rifqi, Zhiyu, and Yunsheng