Phase1
Related project review
Before starting to build our own project, we conducted research on existing RAG AI solutions. One well-known approach is using the Ollama + AnythingLLM client. In this setup, users need to use Ollama to download and manage LLMs and then utilize AnythingLLM to call the model and connect to a database. Afterward, users can ask questions through AnythingLLM.
However, this approach has several drawbacks:
Complex operation – Users need to install multiple third-party clients, configure them manually, and establish connections between them to achieve RAG functionality.
Limited flexibility – Users can only use models supported by the Ollama client and cannot freely adjust models or experiment with new open-source models.
Our Project Goal
To address these limitations, our goal is to enable users to utilize RAG AI functionality without relying on any third-party software. Additionally, we want users to have the freedom to choose and use any LLM they prefer, without being restricted by third-party platforms.

Figure 1: Ollama + AnythingLLM client
Large Language Model (LLM) Comparison
Aiming for 16GB of user memory, we tested the requirements and performance of major open-source large language models to determine which model was best suited for RAG project.
Generation Speed Comparison
Model | Size | Memory Used (GB) | Generate Time (s) |
---|---|---|---|
llama3.2-3b | 3B | 10.94 | 11.74 |
qwen3.2-3b | 3B | 9.55 | 9.69 |
llama3.2-1b | 1B | 5.93 | 7.21 |
qwen3.2-1.5b | 1B | 7.47 | 6.35 |
Generation Response Evaluation
In our tests, we used the article “Enhancing Communication Equity: Evaluation of an Automated Speech Recognition Application in Ghana” to test LLM's RAG capability, link to article:
Enhancing Communication Equity: Evaluation of an Automated Speech Recognition Application in GhanaEvaluation Criteria
Criteria | Excellent (4-5) | Good (2-3) | Poor (0-1) |
---|---|---|---|
Understanding of Concept | Thoroughly explains key concepts with relevant study examples. | Partially explains key concepts but lacks depth or examples. | Fails to explain concepts or provides incorrect information. |
Use of Supporting Evidence | Provides strong evidence from the study, including participant insights. | Uses some evidence, but lacks depth or specificity. | No or weak evidence from the study. |
Critical Analysis | Offers deep analysis of issues, potential solutions, and their impact. | Provides a basic analysis but lacks depth. | Superficial or no analysis. |
Clarity & Organization | Well-structured, clear, and logically presented answer. | Partially structured with minor clarity issues. | Disorganized, difficult to follow, or incomplete. |
Innovative Thinking | Offers unique insights or practical recommendations. | Some insights but lacks originality. | No insights or original thought. |
Qwen 2.5 3B
Category | Contextual Factors | Model Adaptation and Flexibility | Human-Technology Interaction | Broader Policy and Ethical Considerations |
---|---|---|---|---|
Understanding of Concept | 5 | 4 | 5 | 5 |
Use of Supporting Evidence | 5 | 4 | 5 | 4 |
Critical Analysis | 4 | 4 | 5 | 4 |
Clarity & Organization | 5 | 4 | 5 | 5 |
Innovative Thinking | 4 | 3 | 4 | 5 |
Total Score: 89
Llama 3.2 3B
Category | Contextual Factors | Model Adaptation and Flexibility | Human-Technology Interaction | Broader Policy and Ethical Considerations |
---|---|---|---|---|
Understanding of Concept | 4 | 3 | 4 | 5 |
Use of Supporting Evidence | 4 | 3 | 4 | 4 |
Critical Analysis | 4 | 3 | 4 | 4 |
Clarity & Organization | 5 | 4 | 5 | 5 |
Innovative Thinking | 3 | 3 | 3 | 4 |
Total Score: 78
Qwen 1.5B
Category | Contextual Factors | Model Adaptation and Flexibility | Human-Technology Interaction | Broader Policy and Ethical Considerations |
---|---|---|---|---|
Understanding of Concept | 4 | 3 | 4 | 5 |
Use of Supporting Evidence | 3 | 3 | 3 | 4 |
Critical Analysis | 3 | 3 | 3 | 4 |
Clarity & Organization | 4 | 3 | 4 | 4 |
Innovative Thinking | 3 | 2 | 3 | 4 |
Total Score: 69
Llama 1B
Category | Contextual Factors | Model Adaptation and Flexibility | Human-Technology Interaction | Broader Policy and Ethical Considerations |
---|---|---|---|---|
Understanding of Concept | 3 | 3 | 3 | 4 |
Use of Supporting Evidence | 2 | 2 | 3 | 3 |
Critical Analysis | 3 | 3 | 3 | 3 |
Clarity & Organization | 3 | 3 | 4 | 4 |
Innovative Thinking | 2 | 2 | 3 | 3 |
Total Score: 59
Phase2
Related project review
As part of our Phase 2 research, we explored ElevenLabs, a platform specializing in AI-generated speech like Ossia. levenLabs provides human-like voice synthesis in multiple languages, making it a strong competitor in the text-to-speech (TTS) and voice AI space.
However, the Elevenlabs has a serious defect: The system relies heavily on manual text input, which can be challenging for users with mobility difficulties. In contrast, Ossia is designed to function with an average of lower than 1 word typed per minute, significantly reducing the effort required for users with limited mobility.

The main page of elevenlabs
Project Research
In order to achieve the best performance for our project, we carried out a lot of research and testing to find the most suitable technical solution for the project. The research is mainly divided to three parts: Large-language-model, Text-to-speech, and Speech-to-text.
Large Language Model (LLM) Comparison
To find the most suitable large language model for our project, we tested the generation time of major open-source LLMs and their response performance, including Llama, Qwen, Gemma, etc. All LLM is running on Macbook pro M4 Max with 48 unified memory.
Generation Speed Comparison
Model | Size | Memory Used (GB) | Generate Time (s) |
---|---|---|---|
DeepSeek-R1-Distill-Llama-8B-q4f16_1-MLC | 8B | 7.27 | 40.2 |
DeepSeek-R1-Distill-Qwen-7B-q4f16_1-MLC | 8B | Error | Error |
Llama-3.1-8B-Instruct-q4f16_1-MLC | 8B | 8.77 | 11.77 |
Qwen2.5-7B-Instruct-q4f16_1-MLC | 7B | 6.55 | 13.50 |
gemma-2-9b-it-q4f16_1-MLC | 9B | 9.23 | 13.64 |
Llama-3.2-3B-Instruct-q4f16_1-MLC | 3B | 3.63 | 7.26 |
Qwen2.5-3B-Instruct-q4f16_1-MLC | 3B | 3.12 | 12.45 |
gemma-2-2b-it-q4f16_1-MLC | 2B | 4.16 | 13.4 |
Llama-3.2-1B-Instruct-q4f16_1-MLC | 1B | 2.13 | 4.49 |
Qwen2.5-1.5B-Instruct-q4f16_1-MLC | 1.5B | 2.16 | 8.09 |
Generated Response Comparison
Model | Answer to Question: "Who is your favourite tennis player?" | Rank |
---|---|---|
DeepSeek-R1-Distill-Llama-8B-q4f16_1-MLC | With answers make no sense | Unusable |
DeepSeek-R1-Distill-Qwen-7B-q4f16_1-MLC | Error | Unusable |
Llama-3.1-8B-Instruct-q4f16_1-MLC | Can generate some famous tennis players’ names and use them to create sentences | 2 |
Qwen2.5-7B-Instruct-q4f16_1-MLC | Can generate some famous tennis players’ names and use them to create sentences | 1 |
gemma-2-9b-it-q4f16_1-MLC | Can generate some famous tennis players’ names and use them to create sentences | 2 |
Llama-3.2-3B-Instruct-q4f16_1-MLC | Sometimes will generate the keyword not really relative (cannot generate any tennis player name) | 5 |
Qwen2.5-3B-Instruct-q4f16_1-MLC | Likely to generate some tennis player names, but sometimes will just list related words about tennis | 4 |
gemma-2-2b-it-q4f16_1-MLC | Can generate answers as good as other 8B models | 3 |
Llama-3.2-1B-Instruct-q4f16_1-MLC | Cannot generate tennis-related words, and have errors when using words to create sentences | Unusable |
Qwen2.5-1.5B-Instruct-q4f16_1-MLC | Sometimes can generate tennis player names, but most of the time just relative words | 6 |
Conclution
If user's device have 16g memory with Radeon or Nvidia gpu, I recommend using 8b models, both of llama and qwen have good performence. While for the user with 8g memory or lower, gemma2-2b performance is better, and it just take 4g memory. I don't recommend using the 1b model at all, unless the user's device performance simply can't support the better one. Using this model requires multiple user guides to produce the desired sentence.
Text-To-Speech (TTS) Comparison
Model | Speed | Licence | Languages | Size | Voice Similarity | Extra Features | Accent/Style Quality | Links |
---|---|---|---|---|---|---|---|---|
E2-F5-TTS | Near real-time | MIT | English | Medium | High | Simple cloning | Good | - |
StyleTTS 2 | Moderate | Apache 2.0 | English | Medium | High | Style transfer, emotion control | Excellent | Link |
XTTS v2.0.3 | Fast | MIT | Arabic, Brazilian Portuguese, Mandarin Chinese, Czech, Dutch, English, French, German, Italian, Polish, Russian, Spanish, Turkish, Japanese, Korean, Hungarian, Hindi | Medium | High | Voice cloning with 6-second sample | Excellent | Link |
OpenVoice | Fast | MIT | Multilingual | Medium | High | Flexible style control, zero-shot cross-lingual cloning | Excellent | Link |
Real Time Voice Cloning | Real-time | MIT | English | Medium | High | Real-time cloning with minimal data | Good | Link |
VITS | Fast | Apache 2.0 | Multilingual | Medium | High | Voice cloning, emotion | Excellent | - |
Mimic 3 | Fast | BSD | Multilingual | Medium | High | Voice cloning, customizable voices | Good | - |
SpeechT5 | Moderate | MIT | Multilingual (English, Chinese, and others with fine-tuning) | Medium | High | Pre-trained for both TTS and ASR, supports speaker embedding-based voice cloning | Good | Link |
Speech-To-Text (STT) Comparison
Clear Speech Results
Model | CER | WER | BLEU | ROUGE-1 (Precision/Recall/F1) | ROUGE-L (Precision/Recall/F1) |
---|---|---|---|---|---|
Whisper-tiny | 0.0752 | 0.2449 | 0.6532 | 0.8627 / 0.88 / 0.8713 | 0.8627 / 0.88 / 0.8713 |
Whisper-base | 0.0451 | 0.2041 | 0.6992 | 0.9388 / 0.92 / 0.9293 | 0.9388 / 0.92 / 0.9293 |
Whisper-small | 0.0301 | 0.1020 | 0.8350 | 0.9388 / 0.92 / 0.9293 | 0.9388 / 0.92 / 0.9293 |
Whisper-medium | 0.0301 | 0.1020 | 0.8350 | 0.9388 / 0.92 / 0.9293 | 0.9388 / 0.92 / 0.9293 |
Unclear Speech Results
Model | CER | WER | BLEU | ROUGE-1 (Precision/Recall/F1) | ROUGE-L (Precision/Recall/F1) |
---|---|---|---|---|---|
Whisper-tiny | 0.2331 | 0.4898 | 0.4010 | 0.6415 / 0.6800 / 0.6602 | 0.6415 / 0.6800 / 0.6602 |
Whisper-base | 0.2256 | 0.4082 | 0.4411 | 0.7551 / 0.7400 / 0.7475 | 0.7551 / 0.7400 / 0.7475 |
Whisper-small | 0.1541 | 0.4490 | 0.4578 | 0.7843 / 0.8000 / 0.7921 | 0.7647 / 0.7800 / 0.7723 |
Whisper-medium | 0.0977 | 0.2245 | 0.6160 | 0.8936 / 0.8400 / 0.8660 | 0.8936 / 0.8400 / 0.8660 |
CER (Character Error Rate): This metric measures errors at the character level (such as insertions, deletions, or substitutions). A lower CER means the transcription is more accurate.
WER (Word Error Rate): This tells us how many mistakes there are at the word level. A lower WER indicates fewer mistakes in the transcription.
BLEU (Bilingual Evaluation Understudy): BLEU evaluates how closely the transcription matches the reference by looking at n-gram overlaps. A higher BLEU score means the transcription is more similar to the reference text.
ROUGE-1 (Unigram Overlap): ROUGE-1 checks how many individual words match between the transcription and the reference text, using precision, recall, and F-measure. A higher ROUGE-1 score suggests a better match in terms of basic word choice.
ROUGE-L (Longest Common Subsequence): ROUGE-L goes a step further by looking at the longest common sequence of words between the two texts, giving insight into how well the overall sentence structure is preserved. A higher ROUGE-L score indicates a closer match in both content and order.
Loading time and responce time acceptance
Response Time Comparison
Model | Processing Time (s) |
---|---|
Whisper-tiny | 0.7146 |
Whisper-base | 1.1301 |
Whisper-small | 3.0367 |
Whisper-medium | 6.3204 |
Loading Time Acceptance
Model | Satisfactory rate |
---|---|
Whisper-tiny | 100% |
Whisper-base | 100% |
Whisper-small | 88.9% |
Whisper-medium | 0%(loading error under slow network) |
For the loading time, we have set the acceptance criteria as 80% satisfactory, and for the response time, we have set the acceptance criteria as 5s. The above table shows the comparison of the loading time and response time of different models.
Conclusion
Based on the Response Time, loading time, and quality of the stt, we recommend using the Whisper-tiny, base and small model for the project. The Whisper-medium model has a loading error under slow network conditions, so it is not recommended for the project.
Advantage over the original default chrome STT:
Better accept the elderly by increase the acceptance for unclear speech. The below is the original browser STT:

Reference text: Hello, welcome to our speech recognition system. The weather today is sunny.
browser-default STT result: weather today
Whisper-tiny(the smallest Whisper mode) result: Hello, welcome to our speaker recognition system. Whether today is sunny