For Phase 1, we are working with GDIHUB. Through discussions with our project partners at GDIHUB, we decided that the best way to approach phase 1 would be to create a RAG system in the form of a chatbot, that would be able to mine local documents for answers to research questions. This would allow for a much wider array of queries to be asked and responded to. Our partner for phase 2 is Ossia Voice. Ossia Voice is an accessibility tool for Augmentative and Alternative Communication. It helps people who are unable to speak and also have significant difficulty with motion. For example, people with Motor Neurone Disease. The current version of Ossia uses the OpenAI API. This results in users racking up charges from API use, cutting it off to many people due to their socioeconomic background. We want to be able to first create an offline literature review tool, so that we can review and evaluate various different speech engines. We then use our research to be able to fine-tune and evaluate various necessary models. At the end of the project, we hope to have a solution that works free of the need of any API keys. Hence, making the Ossia Voice Project available to a much wider array of users.
We collected requirements through stakeholder interviews, user feedback, and competitor analysis. A survey was conducted to understand the needs of potential users.
The survey gathered insights from a patient with Motor Neurone Disease (MND) and their caretaker. Their feedback helped us identify the key challenges in using assistive communication technologies.
Challenges in using technology for communication:
“One of the biggest challenges is the physical strain it takes to use a keyboard or touch screen for long periods. My muscle control is limited, so even small tasks like typing or moving a mouse can be exhausting.”
Current interaction and desired improvements:
“I use a voice recognition tool and sometimes an eye-tracking system. The voice recognition isn’t always accurate because my speech can be slurred. The eye-tracking works better, but it sometimes feels slow. I think better calibration or predictive text that truly understands context would help.”
Situations where communication is difficult and possible solutions:
“When I’m tired, everything becomes harder. In these moments, I wish the system would adapt to my energy or mood levels, maybe by offering simpler commands or allowing me to save commonly used phrases for faster responses.”
Ideal communication tool features:
“It would be a device that is fully hands-free, maybe controlled by my eye movements or brain signals. It would be smart enough to understand the context of conversations and offer suggestions. It would also be lightweight and portable so I could use it anywhere.”
Effort required to use current devices:
“It takes a lot of effort. On a good day, I can manage, but on bad days, even simple things feel overwhelming. Sometimes I avoid using them just because of how draining it is, and that makes me feel isolated. I need something that requires less energy to operate to make communication easier for me.”
Challenges observed in communication:
“I see them struggling with accuracy. Sometimes it takes multiple tries to get the technology to understand them, whether it’s their voice or using the eye-tracker. This leads to a lot of frustration, especially during longer conversations.”
Key features that would reduce frustration:
“It needs to be faster and more intuitive. A predictive feature that learns their most common words and phrases would help, so they don’t have to type everything out. And it should be responsive without needing constant recalibration.”
Situations where current tools fail:
“In emergencies, when they need to communicate something urgent, the current tools just aren’t fast enough. We need something that allows them to alert me immediately, without any delays or effort on their part.”
Role in assisting communication and improvements needed:
“I help set up the devices, make sure they’re charged and calibrated, and troubleshoot when something goes wrong. Sometimes, I also help by selecting phrases for them when they’re too tired to use the device. It would help if the devices were more automated, so they didn’t need as much manual setup."
Making communication tools more intuitive:
“Something that works consistently without much intervention would be ideal. If it could recognise their voice or eye movements more reliably, it would reduce the need for me to step in as much. A tool that offers more autonomy to them, even on bad days, would make a big difference in both of our lives.”
Charles was an intelligent software developer who was known to be very open and conversational. However, that all changed since he was diagnosed with Motor Neurone Disease (MND). Over time, Charles' speech abilities deteriorated very quickly. Furthermore, he has also found other text-to-speech tools very difficult to use since his ability to type has also become very exhausting for him due to his deteriorating motor skills. Previosuly very social, nowadays Charles often feels like an afterthought at social events which has led to him feeling isolated and depressed. Charles is seeking software that will help him to be able to communicate with his friends and family whilst also requiring minimal typing. Charles is also not in the greatest financial situation since he has had to quit his job and is currently surviving off of savings and his disability benefits. As a result, he requires software that will not force him to have to rack up many charges and ideally be free to use. Charles has found the new Ossia Voice to be a revolution in his day-to-day life. Charles had previously used an OpenAI API and was very impressed with the software. However, he couldn't continue using it due to financial constraints. With the option to use offline models, and since Charles has access to a good quality device from his software engineering days, he was able to use a high-performing model. This resulted in him being able to massively improve his social life, almost to how it was pre-diagnosis. He also made use of the voice cloning mechanism and many of his friends have commented saying that they feel like they have Charles back rather than previously feeling like they were still talking to a machine. It has massively improved his mental health and wellbeing whilst also removing his previous feelings of isolation.
Grace is an elderly lady who is a former teacher. She used to be described by friends as a "social butterfly". However, a few years ago she was diagnosed with a muscle disorder causing her to have reduced mobility and speaking abilities. Grace has a caregiver who helps her with her daily tasks. She has found that the current text-to-speech tools are not very intuitive and require a lot of manual setup. Grace's caregiver has also found that the current tools are not very reliable and often require her to step in and help Grace. Grace is looking for a tool that will help her to communicate with her friends and family more easily and without the need for constant recalibration or having to rely on her caregiver as much. Grace was astounded at the extents the Ossia Software we produced could reach. Since, she is not very technical. She found the large and easy to use menu very inviting whilst also being clear to see without her glasses. She found it very simple and since the offline models produce what she wants to say most of the time, she hardly has to type in any words. Grace has mentioned that because of this, she hasn't had to rely on her caregiver as much and our software has helped her to get her independence back. Additionally, with Grace being a former teacher. She also tried out our RAG document miner tool. As a former educator, she had a strong affinity for organising and accessing information quickly. The tool allowed her to easily extract and reference key documents, enhancing her communication and helping her reconnect with the wealth of knowledge she’d accumulated over the years. She also mentioned that due to the simple and intuitive user interface she found it very easy to use despite her accessibility issues.
ID | Use Case | Actor/User |
---|---|---|
UC1 | Upload Documents | Researcher |
UC2 | Query information from uploaded documents | Researcher |
UC3 | Reset Database | Researcher |
UC4 | Save and Retrieve Response History | Researcher |
ID | Actor | Description | Main Flow | Result |
---|---|---|---|---|
UC1 | Researcher | Upload documents | Researcher selects a document and uploads it. If the file format is valid, the system stores it in its database. Otherwise, an error is displayed. The document is then indexed for future queries. | Document is successfully uploaded and ready for search. |
UC2 | Researcher | Query information from uploaded documents | Researcher enters a query. System searches indexed documents and returns relevant information. If no results are found, the user is prompted to refine the query. | Relevant information is displayed to the researcher. |
UC3 | Researcher | Reset Database | Researcher selects the "Delete Database" option and confirms the action. If confirmed, the system clears all data. Chatbot will no longer be able to access previously uplaoded documents. | Database is cleared and ready for new uploads. |
UC4 | Researcher | Save and retrieve response history | Researcher can choose to export the conversation by clicking the "Export Chat" button. This saves the entire chat as .txt file in the folder of their choice. | Past responses are accessible for future reference. |
ID | Use Case | Actor/User |
---|---|---|
UC5 | Holds mic to hear other people | Patient/Caregiver |
UC6 | Chooses appropriate response to the person | Patient/Caregiver |
UC7 | Upload Voice Clips for Voice Cloning | Caregiver |
UC8 | Can edit the possible responses by inputting keywords | Patient/Caregiver |
ID | Actor | Description | Main Flow | Result |
---|---|---|---|---|
UC5 | Patient/Caregiver | Holds mic to hear other people | The user activates the microphone. The system processes incoming sound and enhances it for clarity. Converts this speech into text and displays it in the chat window. If the mic is not detected, an error is displayed. | Audio is captured and is converted to text for the user to be able to read on the screen. |
UC6 | Patient/Caregiver | Chooses appropriate response to the person | The system provides response options. The user selects a response. If the selection needs modification, the user can edit it before confirming. The chosen response is converted to speech and played back to the other person. | The chosen response is communicated to the other person. |
UC7 | Caregiver | Upload Voice Clips for Voice Cloning | The caregiver selects and uploads a voice clip. The system processes and stores the clip for cloning. The voice of the user is used to mimic the text-to-speech engine, making the generated voice sound similar to how the patient previously sounded. Unsupported file formats will not be able to be uploaded. | Voice clip is stored and ready for voice cloning. |
UC8 | Patient/Caregiver | Can edit the possible responses by inputting keywords | The user inputs new keywords. The system queries the LLM with an updated prompt and updates response suggestions accordingly. If invalid keywords are used, an error message is displayed. | Customised responses are saved and available for selection. |
ID | Requirement | Priority |
---|---|---|
1 | Have working GUI with AI response and textbox to enter keywords or prompt text | Must |
2 | Be able to extract text from PDFs | Must |
3 | Generate a suitable response using an offline LLM | Must |
4 | Use RAG and vector database to search documents | Must |
5 | Chunk texts for easy retrieval and more understandable responses | Should |
6 | Add support to choose from files or folders | Should |
7 | Save history of previous responses | Should |
8 | Add support for Word documents | Could |
9 | Add support for Markdown files | Could |
10 | Add support for multi-column documents | Could |
11 | Allow for Do-Not-Include items | Could |
12 | Filters and option to sort | Could |
13 | Be able to link to similar online resources/OneDrive for further study | Could |
14 | Add support for queries regarding images in documents | Could |
15 | Add build-in helper page | Could |
ID | Requirement | Priority |
---|---|---|
1 | Speech To Text working locally in browser | Must |
2 | Text To Speech working locally in browser | Must |
3 | Use offline LLM to generate possible keywords | Must |
4 | Allow users to choose between offline/online models and STT models | Must |
5 | Have suitable expression | Should |
6 | Ensure that the LLM generates a wide array of responses based on mood | Should |
7 | Diarisation to discern between multiple speakers** | Could |
8 | Allow user to upload voice clips for voice cloning in the user’s own voice | Could |
9 | Support real-time speech-to-text using Whisper with finalisation for larger models** | Could* |
* (Not in MoSCoW document but preferred by client as a research feature)
** (More research than feature; client does not expect them in the main branch)
ID | Requirement | Priority |
---|---|---|
10 | Be easily usable for patients | Must |
11 | Be easily perceivable and maintainable for future development | Must |
12 | Have minimal latency or wait times while using the app | Should |
13 | Compile Windows exe(phase1) | Should |
14 | Compile MacOS exec(phase1) | Could |