NHS Auto-chatbot | Development Blog

Week 1 - 5

4 Oct - 9 Nov

View Details

First of all, on the very first week, we got introduced to each other as well as the clients and the supervisor. We had our first discussion with our client - the Chartered Society of Physiotherapy (CSP), which is the professional body and trade union for physiotherapists in the United Kingdom. We were assigned the project that develops a chatbot to describe key terms and phrases from the Physiotherapy Health Informatics Strategy (PHIS). During this project, we also got help from IBM for tackling technical problems.

To kick off, we start by setting up everything that we need to do in this project. We start by creating GitHub Repositories for both our actual project and portfolio website. We also created Discord Server for our team's communication platform and Notion Workspace for assigning tasks. Furthermore, we created a Gantt chart to track our project development. Lastly, we start our portfolio website by picking a template that uses Bootstrap.

During these five weeks, our main goal is to finish the HCI assignment. In the first half, we started by gathering user requirements, creating personas, and also creating scenarios. Furthermore, we did the research for our project and came to the conclusion that we will use MERN (MongoDB, Express.js, React, Next.js) Stack as the technology stack. We will also use an IBM Cloud Product called IBM Watson Assistant. This will help us develop the chatbot that we need. Lastly, we managed to try Watson Assistant to create a simple Chatbot example.

In the second half, we finished the HCI assignment by creating the Sketches and Prototypes (using Figma). One of our teammates also started creating the development blog to update the progress of the project. During these weeks, we got introduced to the IBM Mentor and Teaching Assistant. These two amazing people will help us throughout the project. Finally, we completed the research to create the first mockup for CSP Chatbot. We started by analysing the data from the CSP website, especially in the PHIS section.

Week 6 - 7

14 Nov - 28 Nov

View Details

After reading week, we finished with the HCI assignment and also created the first mockup for CSP Chatbot. Until this week, we have not met the client (CSP), we only received one email from them regarding the requirements. So, we did not actually show the proper mockup to the client yet.

Unfortunately, on 15 November, the module leader contacted us that our client got changed from CSP to NHS England. Due to this, we need to redesign our project, which means we need to draw the sketches again, the prototype, as well as gather the project's requirements. Our new project is about building a first-generation prototype of a "chatbot generator" for IT admins who run their trusts online services and search pages.

After we got the list of features for this chatbot generator, we started by doing research about the first bullet point, which is looking at how to automate the web scraping into a standardised format for LLM processing. We managed to do research about Large Language Models (LLM) and Web scraping/crawling. Due to changes in the project, we switched our tech stack to MongoDB, React (as the front-end), Django (new), and IBM Watson Assistant. We also researched web scraping frameworks that are available in the Python library, which are BeautifulSoup, Selenium, and Scrapy. For now, we want to use Selenium to try scraping some simple websites. This was our very first idea about the implementation of the project.

During these two weeks, we finally met our client for the first time 😊. Our clients, James and Lucy, are from NHS Shared Business Service (SBS) and Digital Productivity. During the 20-minute meeting, we managed to discuss briefly the project. The clients showed some examples regarding the current NHS SBS Chatbot. We also informed them that we will use IBM Watson Assistant as the chatbot platform for the initial stages. Because it only lasts 20 minutes, we did not manage to ask many questions to them. Therefore, we prepared some other questions to ask for the next weekly meeting, such as legal issues for web scraping and also the MoSCoW list.

P.S. All of us were surprised during this week because the clients got changed after 1.5 months. We agreed that it is a very rough week for us, but we need to keep moving on and try to tackle the problems and give an excellent solution, even though the new project is harder than the previous one 😊.

Week 8 - 9

28 Nov - 12 Dec

View Details

After rough weeks, we managed to show some progress for the project. We managed to integrate the IBM Watson Assistant into a sample website, which runs using React. We also found out that Watson Assistant has an integration feature that can extend the scope of our assistant by searching through our documents and websites, called Search. Furthermore, we successfully tried Selenium to web scraping using a website called "Books to Scrape". Even though we tried Selenium for web scraping, we still want to find the best Python library that fulfils our needs to do the scraping/crawling. We really care about efficiency and a library that can work with large datasets. Our contenders are BeautifulSoap and Scrapy, and we will firm our choice by next week.

During these two weeks, we met the client for the second time. For this meeting, we managed to create the MoSCoW List with the clients. We got a clearer idea for the project by sketching our step-by-step implementation. Lastly, we prepared some questions to ask for the next meeting, and also the project's plans for the last week of the term, such as updating the MoSCoW List.

Week 10

12 Dec - 19 Dec

View Details

Week 10, the last week of term one, hip hip hooray!!! 🎆 🤩

This week, for the technology stack, we reached the conclusion that we will use MongoDB as the database, React as the front-end, Django as the back-end, and Scrapy as the Python library for web scraping/crawling. We chose Scrapy in the end because it provides the library that we need. We are expecting large datasets from each different trusts' website and Scrapy is the perfect solution for us. Other than that, Scrapy is also more efficient and faster than the other contenders. It can also extract data in different formats, such as CSV, XML and JSON. In this project, we will use JSON Format as it will be transferred to a NoSQL database (MongoDB).

Unfortunately, we did not meet the clients this week. Therefore, for the winter break later, we will continue according to the MoSCoW list that we previously had. Before the term ends, we created a Trello Board (or Kanban Board) to track our project. This Kanban Board is useful not just for us but also for IBM's side as we sent the link (to see the board) to them.

Within one week, we managed to do web scraping of the Great Ormond Street Hospital website using Scrapy. Even though we successfully tried using Scrapy, there was a problem that rise during this period. Our IP address (one of us) got blocked due to frequently scraping the website. This happened when we tried scraping other NHS trusts' website (White Lodge Medical Practice). In the end, we managed to bypass the firewall by improving the Python code.

Lastly, we tried IBM Watson Natural Language Understanding (Watson NLU) as a text analyser. This text analyser is one of IBM Cloud's products. Our goal for this week was to get some data from the trusts' website and turn that into a JSON file, which we did it.

Finally,
Happy New Year! 🎇
新年快乐! 🎇
Selamat Tahun Baru! 🎇

Winter Break

19 Dec - 8 Jan

View Details

During winter break, we agree that we will split the tasks into three, one for each member. The first task is the web scraping part. The second task is creating Dialog JSON and transferring it to a database. The last task is improving the portfolio website and updating the development blog as well as monthly videos.

For the portfolio website, we managed to improve it, and we completed all the layouts. So later on, we only need to fill in the contents. We also keep track of the development blog by updating it bi-weekly and the monthly videos too.

For the web scraping part, we started by finding a common layout and structure for all NHS trusts' websites. We did this because there are too many web pages on a single website, and these pages may contain much useless information, which can affect the result. This strategy also reduces the huge amount of time to check them one by one.

We found out that there are patterns in the existing websites. These kinds of information are all stored on a webpage with a title, such as contact us/contact, about us, etc. After that, we decided to collect all the common keywords of these web pages, and filter the URLs that we got from the previous stage with these keywords.

After this first step, we moved to the second step, which is matching and getting the related information. The first thing that we did was to match the phone number. We matched it by using the Python library called phonenumbers, and with the help of this article. We encounter a problem here, that it does not know which phone number belongs to which surgery. The next thing that we did was to get the address and postcode. At first, we used IBM Watson to do this, but apparently, it cannot recognise the address correctly, and much useless information was also matched in this process, for example, in the Humble Yard Practice website. In the end, we managed to find a powerful regex provided by UK Government on the Wiki page, which helped us solve this problem. The last thing that we did was to get the opening hour by splitting the text that we got from scraping the trusts' websites using NLU and regex to check which one contains the time period. Lastly, there were two problems that we needed to solve during this process, multiple clinic situations and how to filter out duplications of information. We agree that we will solve the second problem first.

For the Dialog JSON part, we managed to work on the Watson Assistant Dialog JSON file generation. We did this by completing an IBM-provided course on how to use its chatbot service, namely Watson Assistant. Furthermore, we learned that a Dialog JSON file is constructed by intents, which are used to detect the expectation of users; Entities, which are used to look for specific information in users' questions; and Dialogs, which are used to control the logic flow of conversations.

After finishing the course and doing some research about it, we started developing a program that automatically generates such Dialog JSON file, which makes it possible to upload to Watson Assistant and get the desired chatbot. After some time, we realised that it could be very difficult to generate question-answer pairs altogether because we had not yet developed an advanced web scrapping tool for extracting possible questions users may ask about a specific trust. So instead, we adopted a much easier but also meaningful solution, which is starting with a few predefined questions such as the one already mentioned above.

After completing the Dialog JSON generation part, we created a MongoDB database that was used to store those Dialog JSONs. Next step, we developed a Django back-end that handles HTTP requests and communicates with the MongoDB database. After that, we used an API testing tool called Insomnia to do fully-testing on the back-end and database. We also created a front-end website with React.js and Material UI.

At the end of winter break, our web app allows users to generate a Dialog JSON with dummy data. At this point, we have not integrated with the web scraping part. Our web app provides users with a reference code, and users can use this code to get the Dialog JSON as they want. They might need to wait for a few moments, roughly 5-10 minutes. Our plan for the beginning of the term is to integrate both parts and get ready for Elevator Pitch.

Week 11 - 12

9 Jan - 23 Jan

View Details

Our main goals for these two weeks are to integrate the web scraping part to the Django back-end and to make sure that we use real data, not dummy data.

First of all, we separated the problem that we had during winter break into two sub-tasks, which are removing duplications and handling several clinic situations (as it may make the website's layout become different). We managed to solve the first sub-task and as well as handling with several clinic situations when filtering addresses.

The main obstacle that we had when integrating the web scraping part was the incorrect path. This problem arises because the current working directory is changed, so the web scraping tool cannot import the model and cannot read or write JSON files correctly. In order to solve this, we got the location where the web scraping tool is stored and used this location when the tool needed to import models or manipulate JSON files.

For the front-end and back-end parts, we also made some improvements. Firstly, we added a table at the bottom of our web app for tracking the previous Dialog JSONs that were generated in the MongoDB database. This table shows website links and reference codes in pairs. Furthermore, we added a notification mechanism to help users stay informed of the Dialog JSON generation's status. In addition to the front-end updates, we managed to integrate the web scraping part to replace the dummy data that was used before.

To ensure users will get notified as soon as they input the link (instead of waiting for 10 minutes until web scraping finishes), we separated the original GET request of JSON generation into two requests. The first request registers a space in the database and returns the reference code to the user. The second request runs the web scraping process and eventually updates the registered space with the generated Dialog JSON. After the front end receives the second request's response, the user will be notified with a success or an error message. In the end, we managed to deploy our chatbot by putting the Dialog JSON to IBM Watson Assistant.

Finally, we successfully did our elevator pitch on 20 January. We managed to show our first product to the clients, and they are proud of it. We also improved our Gantt Chart. For future work, we want to extend the web scraping process, so that we get more important information. We also want to try putting Dialog JSON into Microsft Azure.

Here is our first demo video of the generator:

Here is our elevator pitch (pre-recorded):

Week 13 - 14

23 Jan - 6 Feb

View Details

During these two weeks, we were not progressing so much because of a math test, but there is still some important progress. Our main goals are to refactor the source code and scale up the answer generation system.

To scale up the answer generation system, we read some papers regarding the NLP Question and Answering (NLP QnA). Three papers from Facebook Research are FiD, Atlas, and DPR. There is another useful website called HuggingFace discussing "rag-token-nq". In addition, we also tried Microsoft Azure Health Bot service, as well as digging deeper into Scrapy.

For the code refactoring, we started by rewriting everything in a class instead of many functions in many files. Our previous approach can make other people confused when reading the code. Instead of outputting everything as JSON files, we decided to use a Python dictionary to hold every text that is scraped by Spider. The last thing we were trying to fix was to find another way to run the Scrapy script instead of using the CMD commands. We fixed it by using a library called Signal and modified some code by adding the argument '--nothreading --noreload' when using the command line to run Django.

Mentioned Website and Papers:
https://github.com/facebookresearch/FiD
https://github.com/facebookresearch/atlas
https://github.com/facebookresearch/DPR
https://huggingface.co/facebook/rag-token-nq

Week 15 - 16

6 Feb - 27 Feb

View Details

During these two weeks, we made a lot of progress on the project. We reached the conclusion that to get the answer for the chatbot, we will use our web scraping tool and Bing API. So, all predefined questions will be answered based on the web scraping tool and the other general questions will be passed to Bing API.

The flow of our chatbot generation service:

Do web scraping of a specific website, e.g., GOSH Website.
After we get all the data from the website, we will filter the important information using our information-retrieving tool (IBM NLU and Azure Services).
If the user asks predefined questions, then our web scraping tool will handle it.
If the user asks general questions, the question will be passed to Bing API, and Bing will check the answer based on the information we retrieved. After that, the question will be answered based on Bing's confidence score. If it's more than 40%, the answer will be based on the top 5 search results. If it's between 20% and 40%, the answer will be based on the top answer from search results. Lastly, if the confidence score is below 20%, it will just give an answer that says "Sorry, I can't answer the question based on the information from your website.".

For the generation service website itself, we also improve it. We add three buttons that are About (linked to our project website), Help (linked to our deployment later - still in progress!), and Settings (to change the IBM API Key and Watson Assistance Instance ID). Furthermore, we manage to add a "Preview Chatbot" feature. This is for the NHS IT admin so that they can see the chatbot right after generating it. To make sure our chatbot can be implemented in a website, we integrate it into our project website (the FAB button on the homepage). We also host our website using Azure VM ().
The link: http://nhschatbotgenerator.uksouth.cloudapp.azure.com:3000/

In addition, we add another predefined question that is for how to book an appointment. We also add more intents and dialogues in the IBM Watson Assistant. This is for improving the experience when users use our chatbot.

Finally, we research and discuss all the pricing for our generation service. Starting from deploying the chatbot using IBM Watson Assistant, until using specific Azure Functions. We agreed and concluded that the full price range from £140 to £200 per month.

Week 17 - 18

27 Feb - 13 Mar

View Details

For these two weeks, we finish our project by wrapping up everything. We started by fixing the bugs which happened during the web scraping process. In order to ensure the client that our generation service is adaptable, we tried various websites (under the NHS domain). We successfully did it, and we got all the relevant data, and the chatbot can answer the question as well.

We finalised everything, starting from our project's name to our own chatbot name. We named our chatbot "NHS Auto-chatbot" and created a logo for that. We also merged all the branches that we have into the master branch and prepared the production branch too. In addition, we started finalising our project website, for example, we created a Dialogue Flow diagram - it's on the System Design page. We completed the homepage so that we can use it for the demo to GOSH on 13 March 2023.

On the demo day, we managed to impress all the group of people that sees our generation service tool. They play around with our chatbot and also test the generation service website. We will do another demo next week at CS Lab!

Some pictures during the demo day at Great Ormond Street Hospital (GOSH)

Week 19 - 20

13 Mar - 24 Mar

View Details

For the final two weeks of the project, we finalised everything - from our actual project to the project website. We started by doing all testing for our chatbot generation service. We did various testing, such as Unit Testing, System & Integration Testing, API Testing, and User Acceptance Testing. Finally, we wrapped up our project by providing the "ready for production" source code on GitHub. In addition, we also finished the technical video, final presentation video, demo video, and of course, the project website.

On the 21st of March, we also did another demonstration for people from IBM, Microsoft, Playstation, etc. They were impressed by our project and asked several interesting questions about NHS Auto-chatbot (scalability, future work).

Thank you to everyone that read our development blog from start to end!

We will see you around :D

Warm regards,

Rifqi, Zhiyu, and Yunsheng

Development Blog

Term 1

Week 1 - 5

Week 6 - 7

Week 8 - 9

Week 10

Winter Break

Winter Break

Term 2

Week 11 - 12

Week 13 - 14

Week 15 - 16

Week 17 - 18

Week 19 - 20

Development Blog

Term 1

Week 1 - 5

Week 1 - 5 (4 Oct - 9 Nov)

Week 6 - 7

Week 6 - 7 (14 Nov - 28 Nov)

Week 8 - 9

Week 8 - 9 (28 Nov - 12 Dec)

Week 10

Week 10 (12 Dec - 19 Dec)

Winter Break

Winter Break

Winter Break (19 Dec - 8 Jan)

Term 2

Week 11 - 12

Week 11 - 12 (9 Jan - 23 Jan)

Week 13 - 14

Week 13 - 14 (23 Jan - 6 Feb)

Week 15 - 16

Week 15 - 16 (6 Feb - 27 Feb)

Week 17 - 18

Week 17 - 18 (27 Feb - 13 Mar)

Week 19 - 20

Week 19 - 20 (13 Mar - 24 Mar)