Testing

Testing Strategy

Integration Testing

User Acceptance Testing

Testing Strategy

For each new iteration and new feature we implemented subsequently meant we had to test that it was performing correctly, this is known as a test-driven development approach. TDD and the agile development process we had in place worked well together in carrying out continuous iterations.

Agile software development advocates frequent short development cycles. In this first cycle we generated unit tests and our aim was to evaluate the accuracy of our fine-tuned TaPas model, the testing approach was not to pass every test. Instead, we would calculate the ratio between successful and failed tests, then constantly fine-tune the TaPas model to bring the success rate up through further iterations. It showed us where our model was working well and where we could improve and focus on.

Our second major implementation saw the completion of the bot pipeline and user interface. Following this completion we generated a file to validate and test dialogues end-to-end by running through test stories. This helps us evaluate that the bot behaves as expected. Testing TaPas on test stories is the best way to have confidence in how the assistant will act in certain situations.

Unit Testing

Even though we used open source packages, we are also required to create some self-defined modules to link the functionalities and extend some features.

TaPas / Post Processing Auxiliary Module

Aggregation tests

To process the logits or coordinates returned from TaPas model prediction, auxiliary functions are required. They are written as per needs with some references to tutorials or guide sources [1]. Tests are required to ensure that they work as expected and will not break the system or return false answers.

Post processing - code snippet

Where applicable and allowed, post processing and validation functions are tested to check if they return the expected processed answers. The functions that involve requesting an endpoint could not be tested as a running server is required. However, we extracted validation functions to be their own, and therefore could be tested where relevant. As for endpoint related functions, they have their own tests in the Django database. Thus, the working of the functionalities are ensured.

TaPaS Module Test Snippet

Tests for both self-defined modules are ensured to pass to avoid any possible errors or bugs.

Reference: [1] Hugging Face TAPAS - Usage: Inference [Online] Available at: https://huggingface.co/docs/transformers/model_doc/tapas#usage-inference [Accessed: 3 March 2022]

Django Database

Django db tests

We utilised the Django built-in testing functionality to ensure that whenever an object with the stated entity value is requested, the correct entry is returned. Apart from that, the server endpoint is simulated and tested to ensure that the server works well and returns a success response whenever a request is sent. As our database model is very small, the tests which can be carried out are limited and therefore, the test number is small as well.

Database - passed tests

The results are shown in the photo. The database’s functionality is tested and all tests are expected to pass. If any failures occur, iterations are carried out to fix the bug as soon as possible.

Integration Testing

To ensure that the system is working and all parts are coordinating well, we carried out integration testing with the chatbot. This is to ensure that the dialogue flow is working and actions are triggered correctly according to the user’s intents.

Rasa Bot

Rasa stories

To ensure that the bot correctly identifies the intent of the user and the next action to take, tests are written in the form of stories to check the correctness of the dialogue flow. As shown in the photo, the user's input is simulated through the “user” field and the bot’s decision has to match the indicated intents and actions.

Entity Prediction Metrics

Rasa Entity Prediction Metrics

Intents Prediction Metrics

Rasa Intents Prediction Metrics

Rasa Stories Prediction Metrics

By doing so, we are able to test if the system works smoothly and the respective modules or functions associated with the actions are triggered when appropriate.

User Acceptance Testing

To check how well our bot performs in the eyes of possible users we decided to expose our bot to 2 testers. These users have different backgrounds, different intentions, and skills. They both have no specialised technological knowledge, but are able to perform basic computer tasks, such as opening a browser, searching through files stored on the computer.

Simulated testers

These users reflect people who are most likely to be using the bot in the future and receiving feedback from these users is crucial for further development and adjustment of the project.

Tester 1

Kate is a 45 years old renowned doctor, with 15 years of expertise.
She has to deal with plenty of data from different patients, such as test reports, surveys or a list of medications.

Tester 2

Fred is a 50-year old patient, who struggles with a few illnesses, he visits doctors regularly, and takes a lot of tests such as blood and urine tests.
He has some vision issues as well and reading the test reports is a struggle.

Test cases

Test 1

We let users open the chatbot and advise them to test the basic chatting feature of the bot.

Test 2

We gave a sample test report and asked a tester to upload it to the system.
Then we provide a very short overview of what information could be extracted from the document.
Next, the testers were asked to send text queries to retrieve different kinds of information from the data.

Feedback

When testing the deployed bot, the users were asked to give feedback regarding different stages of interacting with the chat bot. They were asked to rate a few objectives. The possible answers are in the range 0-4 where 0 states the lowest score and 4 the highest.

There are different statements that users rated, these can be found in the table below:

Feedback table