Project Requirements

Partner Introduction & Project Background

For Phase 1, we are working with GDIHUB. Through discussions with our project partners at GDIHUB, we decided that the best way to approach phase 1 would be to create a RAG system in the form of a chatbot, that would be able to mine local documents for answers to research questions. This would allow for a much wider array of queries to be asked and responded to. Our partner for phase 2 is Ossia Voice. Ossia Voice is an accessibility tool for Augmentative and Alternative Communication. It helps people who are unable to speak and also have significant difficulty with motion. For example, people with Motor Neurone Disease. The current version of Ossia uses the OpenAI API. This results in users racking up charges from API use, cutting it off to many people due to their socioeconomic background. We want to be able to first create an offline literature review tool, so that we can review and evaluate various different speech engines. We then use our research to be able to fine-tune and evaluate various necessary models. At the end of the project, we hope to have a solution that works free of the need of any API keys. Hence, making the Ossia Voice Project available to a much wider array of users.

Survey Results

From a Patient with Motor Neurone Disease (MND)

Challenges in using technology for communication:

“One of the biggest challenges is the physical strain it takes to use a keyboard or touch screen for long periods. My muscle control is limited, so even small tasks like typing or moving a mouse can be exhausting.”

Current interaction and desired improvements:

“I use a voice recognition tool and sometimes an eye-tracking system. The voice recognition isn’t always accurate because my speech can be slurred. The eye-tracking works better, but it sometimes feels slow. I think better calibration or predictive text that truly understands context would help.”

Situations where communication is difficult and possible solutions:

“When I’m tired, everything becomes harder. In these moments, I wish the system would adapt to my energy or mood levels, maybe by offering simpler commands or allowing me to save commonly used phrases for faster responses.”

Ideal communication tool features:

“It would be a device that is fully hands-free, maybe controlled by my eye movements or brain signals. It would be smart enough to understand the context of conversations and offer suggestions. It would also be lightweight and portable so I could use it anywhere.”

Effort required to use current devices:

“It takes a lot of effort. On a good day, I can manage, but on bad days, even simple things feel overwhelming. Sometimes I avoid using them just because of how draining it is, and that makes me feel isolated. I need something that requires less energy to operate to make communication easier for me.”

From a Caretaker of a Person with MND

Challenges observed in communication:

“I see them struggling with accuracy. Sometimes it takes multiple tries to get the technology to understand them, whether it’s their voice or using the eye-tracker. This leads to a lot of frustration, especially during longer conversations.”

Key features that would reduce frustration:

“It needs to be faster and more intuitive. A predictive feature that learns their most common words and phrases would help, so they don’t have to type everything out. And it should be responsive without needing constant recalibration.”

Situations where current tools fail:

“In emergencies, when they need to communicate something urgent, the current tools just aren’t fast enough. We need something that allows them to alert me immediately, without any delays or effort on their part.”

Role in assisting communication and improvements needed:

“I help set up the devices, make sure they’re charged and calibrated, and troubleshoot when something goes wrong. Sometimes, I also help by selecting phrases for them when they’re too tired to use the device. It would help if the devices were more automated, so they didn’t need as much manual setup."

Making communication tools more intuitive:

“Something that works consistently without much intervention would be ideal. If it could recognise their voice or eye movements more reliably, it would reduce the need for me to step in as much. A tool that offers more autonomy to them, even on bad days, would make a big difference in both of our lives.”

User Personas

Persona 1: Charles Lewis

Charles was an intelligent software developer who was known to be very open and conversational. However, that all changed since he was diagnosed with Motor Neurone Disease (MND). Over time, Charles' speech abilities deteriorated very quickly. Furthermore, he has also found other text-to-speech tools very difficult to use since his ability to type has also become very exhausting for him due to his deteriorating motor skills. Previosuly very social, nowadays Charles often feels like an afterthought at social events which has led to him feeling isolated and depressed. Charles is seeking software that will help him to be able to communicate with his friends and family whilst also requiring minimal typing. Charles is also not in the greatest financial situation since he has had to quit his job and is currently surviving off of savings and his disability benefits. As a result, he requires software that will not force him to have to rack up many charges and ideally be free to use. Charles has found the new Ossia Voice to be a revolution in his day-to-day life. Charles had previously used an OpenAI API and was very impressed with the software. However, he couldn't continue using it due to financial constraints. With the option to use offline models, and since Charles has access to a good quality device from his software engineering days, he was able to use a high-performing model. This resulted in him being able to massively improve his social life, almost to how it was pre-diagnosis. He also made use of the voice cloning mechanism and many of his friends have commented saying that they feel like they have Charles back rather than previously feeling like they were still talking to a machine. It has massively improved his mental health and wellbeing whilst also removing his previous feelings of isolation.

Persona 2: Grace Patel

Grace is an elderly lady who is a former teacher. She used to be described by friends as a "social butterfly". However, a few years ago she was diagnosed with a muscle disorder causing her to have reduced mobility and speaking abilities. Grace has a caregiver who helps her with her daily tasks. She has found that the current text-to-speech tools are not very intuitive and require a lot of manual setup. Grace's caregiver has also found that the current tools are not very reliable and often require her to step in and help Grace. Grace is looking for a tool that will help her to communicate with her friends and family more easily and without the need for constant recalibration or having to rely on her caregiver as much. Grace was astounded at the extents the Ossia Software we produced could reach. Since, she is not very technical. She found the large and easy to use menu very inviting whilst also being clear to see without her glasses. She found it very simple and since the offline models produce what she wants to say most of the time, she hardly has to type in any words. Grace has mentioned that because of this, she hasn't had to rely on her caregiver as much and our software has helped her to get her independence back. Additionally, with Grace being a former teacher. She also tried out our RAG document miner tool. As a former educator, she had a strong affinity for organising and accessing information quickly. The tool allowed her to easily extract and reference key documents, enhancing her communication and helping her reconnect with the wealth of knowledge she’d accumulated over the years. She also mentioned that due to the simple and intuitive user interface she found it very easy to use despite her accessibility issues.

Use Cases

Use Case Diagram - Phase 1

Use Case List - Phase 1

ID	Use Case	Actor/User
UC1	Upload Documents	Researcher
UC2	Query information from uploaded documents	Researcher
UC3	Reset Database	Researcher
UC4	Save and Retrieve Response History	Researcher

Use Case Descriptions - Phase 1

ID	Actor	Description	Main Flow	Result
UC1	Researcher	Upload documents	Researcher selects a document and uploads it. If the file format is valid, the system stores it in its database. Otherwise, an error is displayed. The document is then indexed for future queries.	Document is successfully uploaded and ready for search.
UC2	Researcher	Query information from uploaded documents	Researcher enters a query. System searches indexed documents and returns relevant information. If no results are found, the user is prompted to refine the query.	Relevant information is displayed to the researcher.
UC3	Researcher	Reset Database	Researcher selects the "Delete Database" option and confirms the action. If confirmed, the system clears all data. Chatbot will no longer be able to access previously uplaoded documents.	Database is cleared and ready for new uploads.
UC4	Researcher	Save and retrieve response history	Researcher can choose to export the conversation by clicking the "Export Chat" button. This saves the entire chat as .txt file in the folder of their choice.	Past responses are accessible for future reference.

Use Case Diagram - Phase 2

Use Case List - Phase 2

ID	Use Case	Actor/User
UC5	Holds mic to hear other people	Patient/Caregiver
UC6	Chooses appropriate response to the person	Patient/Caregiver
UC7	Upload Voice Clips for Voice Cloning	Caregiver
UC8	Can edit the possible responses by inputting keywords	Patient/Caregiver

Use Case Descriptions - Phase 2

ID	Actor	Description	Main Flow	Result
UC5	Patient/Caregiver	Holds mic to hear other people	The user activates the microphone. The system processes incoming sound and enhances it for clarity. Converts this speech into text and displays it in the chat window. If the mic is not detected, an error is displayed.	Audio is captured and is converted to text for the user to be able to read on the screen.
UC6	Patient/Caregiver	Chooses appropriate response to the person	The system provides response options. The user selects a response. If the selection needs modification, the user can edit it before confirming. The chosen response is converted to speech and played back to the other person.	The chosen response is communicated to the other person.
UC7	Caregiver	Upload Voice Clips for Voice Cloning	The caregiver selects and uploads a voice clip. The system processes and stores the clip for cloning. The voice of the user is used to mimic the text-to-speech engine, making the generated voice sound similar to how the patient previously sounded. Unsupported file formats will not be able to be uploaded.	Voice clip is stored and ready for voice cloning.
UC8	Patient/Caregiver	Can edit the possible responses by inputting keywords	The user inputs new keywords. The system queries the LLM with an updated prompt and updates response suggestions accordingly. If invalid keywords are used, an error message is displayed.	Customised responses are saved and available for selection.

MoSCoW Requirement List

Phase 1 (Functional Requirements)

ID	Requirement	Priority
1	Have working GUI with AI response and textbox to enter keywords or prompt text	Must
2	Be able to extract text from PDFs	Must
3	Generate a suitable response using an offline LLM	Must
4	Use RAG and vector database to search documents	Must
5	Chunk texts for easy retrieval and more understandable responses	Should
6	Add support to choose from files or folders	Should
7	Save history of previous responses	Should
8	Add support for Word documents	Could
9	Add support for Markdown files	Could
10	Add support for multi-column documents	Could
11	Allow for Do-Not-Include items	Could
12	Filters and option to sort	Could
13	Be able to link to similar online resources/OneDrive for further study	Could
14	Add support for queries regarding images in documents	Could
15	Add build-in helper page	Could

Phase 2 (Functional Requirements)

ID	Requirement	Priority
1	Speech To Text working locally in browser	Must
2	Text To Speech working locally in browser	Must
3	Use offline LLM to generate possible keywords	Must
4	Allow users to choose between offline/online models and STT models	Must
5	Have suitable expression	Should
6	Ensure that the LLM generates a wide array of responses based on mood	Should
7	Diarisation to discern between multiple speakers**	Could
8	Allow user to upload voice clips for voice cloning in the user’s own voice	Could
9	Support real-time speech-to-text using Whisper with finalisation for larger models**	Could*

* (Not in MoSCoW document but preferred by client as a research feature)

** (More research than feature; client does not expect them in the main branch)

Non-Functional Requirements (Both)

ID	Requirement	Priority
10	Be easily usable for patients	Must
11	Be easily perceivable and maintainable for future development	Must
12	Have minimal latency or wait times while using the app	Should
13	Compile Windows exe(phase1)	Should
14	Compile MacOS exec(phase1)	Could