Phase1

Related project review

Before starting to build our own project, we conducted research on existing RAG AI solutions. One well-known approach is using the Ollama + AnythingLLM client. In this setup, users need to use Ollama to download and manage LLMs and then utilize AnythingLLM to call the model and connect to a database. Afterward, users can ask questions through AnythingLLM.

However, this approach has several drawbacks:

Complex operation – Users need to install multiple third-party clients, configure them manually, and establish connections between them to achieve RAG functionality.

Limited flexibility – Users can only use models supported by the Ollama client and cannot freely adjust models or experiment with new open-source models.

Our Project Goal

To address these limitations, our goal is to enable users to utilize RAG AI functionality without relying on any third-party software. Additionally, we want users to have the freedom to choose and use any LLM they prefer, without being restricted by third-party platforms.

llm

Figure 1: Ollama + AnythingLLM client

Large Language Model (LLM) Comparison

Aiming for 16GB of user memory, we tested the requirements and performance of major open-source large language models to determine which model was best suited for RAG project.

Generation Speed Comparison

Model Size Memory Used (GB) Generate Time (s)
llama3.2-3b 3B 10.94 11.74
qwen3.2-3b 3B 9.55 9.69
llama3.2-1b 1B 5.93 7.21
qwen3.2-1.5b 1B 7.47 6.35

Generation Response Evaluation

In our tests, we used the article “Enhancing Communication Equity: Evaluation of an Automated Speech Recognition Application in Ghana” to test LLM's RAG capability, link to article:

Enhancing Communication Equity: Evaluation of an Automated Speech Recognition Application in Ghana

Evaluation Criteria

Criteria Excellent (4-5) Good (2-3) Poor (0-1)
Understanding of Concept Thoroughly explains key concepts with relevant study examples. Partially explains key concepts but lacks depth or examples. Fails to explain concepts or provides incorrect information.
Use of Supporting Evidence Provides strong evidence from the study, including participant insights. Uses some evidence, but lacks depth or specificity. No or weak evidence from the study.
Critical Analysis Offers deep analysis of issues, potential solutions, and their impact. Provides a basic analysis but lacks depth. Superficial or no analysis.
Clarity & Organization Well-structured, clear, and logically presented answer. Partially structured with minor clarity issues. Disorganized, difficult to follow, or incomplete.
Innovative Thinking Offers unique insights or practical recommendations. Some insights but lacks originality. No insights or original thought.

Qwen 2.5 3B

Category Contextual Factors Model Adaptation and Flexibility Human-Technology Interaction Broader Policy and Ethical Considerations
Understanding of Concept 5 4 5 5
Use of Supporting Evidence 5 4 5 4
Critical Analysis 4 4 5 4
Clarity & Organization 5 4 5 5
Innovative Thinking 4 3 4 5

Total Score: 89

Llama 3.2 3B

Category Contextual Factors Model Adaptation and Flexibility Human-Technology Interaction Broader Policy and Ethical Considerations
Understanding of Concept 4 3 4 5
Use of Supporting Evidence 4 3 4 4
Critical Analysis 4 3 4 4
Clarity & Organization 5 4 5 5
Innovative Thinking 3 3 3 4

Total Score: 78

Qwen 1.5B

Category Contextual Factors Model Adaptation and Flexibility Human-Technology Interaction Broader Policy and Ethical Considerations
Understanding of Concept 4 3 4 5
Use of Supporting Evidence 3 3 3 4
Critical Analysis 3 3 3 4
Clarity & Organization 4 3 4 4
Innovative Thinking 3 2 3 4

Total Score: 69

Llama 1B

Category Contextual Factors Model Adaptation and Flexibility Human-Technology Interaction Broader Policy and Ethical Considerations
Understanding of Concept 3 3 3 4
Use of Supporting Evidence 2 2 3 3
Critical Analysis 3 3 3 3
Clarity & Organization 3 3 4 4
Innovative Thinking 2 2 3 3

Total Score: 59

Phase2

Related project review

As part of our Phase 2 research, we explored ElevenLabs, a platform specializing in AI-generated speech like Ossia. levenLabs provides human-like voice synthesis in multiple languages, making it a strong competitor in the text-to-speech (TTS) and voice AI space.

However, the Elevenlabs has a serious defect: The system relies heavily on manual text input, which can be challenging for users with mobility difficulties. In contrast, Ossia is designed to function with an average of lower than 1 word typed per minute, significantly reducing the effort required for users with limited mobility.

llm

The main page of elevenlabs

Project Research

In order to achieve the best performance for our project, we carried out a lot of research and testing to find the most suitable technical solution for the project. The research is mainly divided to three parts: Large-language-model, Text-to-speech, and Speech-to-text.

Large Language Model (LLM) Comparison

To find the most suitable large language model for our project, we tested the generation time of major open-source LLMs and their response performance, including Llama, Qwen, Gemma, etc. All LLM is running on Macbook pro M4 Max with 48 unified memory.

Generation Speed Comparison

Model Size Memory Used (GB) Generate Time (s)
DeepSeek-R1-Distill-Llama-8B-q4f16_1-MLC8B7.2740.2
DeepSeek-R1-Distill-Qwen-7B-q4f16_1-MLC8BErrorError
Llama-3.1-8B-Instruct-q4f16_1-MLC8B8.7711.77
Qwen2.5-7B-Instruct-q4f16_1-MLC7B6.5513.50
gemma-2-9b-it-q4f16_1-MLC9B9.2313.64
Llama-3.2-3B-Instruct-q4f16_1-MLC3B3.637.26
Qwen2.5-3B-Instruct-q4f16_1-MLC3B3.1212.45
gemma-2-2b-it-q4f16_1-MLC2B4.1613.4
Llama-3.2-1B-Instruct-q4f16_1-MLC1B2.134.49
Qwen2.5-1.5B-Instruct-q4f16_1-MLC1.5B2.168.09

Generated Response Comparison

Model Answer to Question: "Who is your favourite tennis player?" Rank
DeepSeek-R1-Distill-Llama-8B-q4f16_1-MLCWith answers make no senseUnusable
DeepSeek-R1-Distill-Qwen-7B-q4f16_1-MLCErrorUnusable
Llama-3.1-8B-Instruct-q4f16_1-MLCCan generate some famous tennis players’ names and use them to create sentences2
Qwen2.5-7B-Instruct-q4f16_1-MLCCan generate some famous tennis players’ names and use them to create sentences1
gemma-2-9b-it-q4f16_1-MLCCan generate some famous tennis players’ names and use them to create sentences2
Llama-3.2-3B-Instruct-q4f16_1-MLCSometimes will generate the keyword not really relative (cannot generate any tennis player name)5
Qwen2.5-3B-Instruct-q4f16_1-MLCLikely to generate some tennis player names, but sometimes will just list related words about tennis4
gemma-2-2b-it-q4f16_1-MLCCan generate answers as good as other 8B models3
Llama-3.2-1B-Instruct-q4f16_1-MLCCannot generate tennis-related words, and have errors when using words to create sentencesUnusable
Qwen2.5-1.5B-Instruct-q4f16_1-MLCSometimes can generate tennis player names, but most of the time just relative words6

Conclution

If user's device have 16g memory with Radeon or Nvidia gpu, I recommend using 8b models, both of llama and qwen have good performence. While for the user with 8g memory or lower, gemma2-2b performance is better, and it just take 4g memory. I don't recommend using the 1b model at all, unless the user's device performance simply can't support the better one. Using this model requires multiple user guides to produce the desired sentence.

Text-To-Speech (TTS) Comparison

Model Speed Licence Languages Size Voice Similarity Extra Features Accent/Style Quality Links
E2-F5-TTS Near real-time MIT English Medium High Simple cloning Good -
StyleTTS 2 Moderate Apache 2.0 English Medium High Style transfer, emotion control Excellent Link
XTTS v2.0.3 Fast MIT Arabic, Brazilian Portuguese, Mandarin Chinese, Czech, Dutch, English, French, German, Italian, Polish, Russian, Spanish, Turkish, Japanese, Korean, Hungarian, Hindi Medium High Voice cloning with 6-second sample Excellent Link
OpenVoice Fast MIT Multilingual Medium High Flexible style control, zero-shot cross-lingual cloning Excellent Link
Real Time Voice Cloning Real-time MIT English Medium High Real-time cloning with minimal data Good Link
VITS Fast Apache 2.0 Multilingual Medium High Voice cloning, emotion Excellent -
Mimic 3 Fast BSD Multilingual Medium High Voice cloning, customizable voices Good -
SpeechT5 Moderate MIT Multilingual (English, Chinese, and others with fine-tuning) Medium High Pre-trained for both TTS and ASR, supports speaker embedding-based voice cloning Good Link

Speech-To-Text (STT) Comparison

Clear Speech Results

Model CER WER BLEU ROUGE-1 (Precision/Recall/F1) ROUGE-L (Precision/Recall/F1)
Whisper-tiny 0.0752 0.2449 0.6532 0.8627 / 0.88 / 0.8713 0.8627 / 0.88 / 0.8713
Whisper-base 0.0451 0.2041 0.6992 0.9388 / 0.92 / 0.9293 0.9388 / 0.92 / 0.9293
Whisper-small 0.0301 0.1020 0.8350 0.9388 / 0.92 / 0.9293 0.9388 / 0.92 / 0.9293
Whisper-medium 0.0301 0.1020 0.8350 0.9388 / 0.92 / 0.9293 0.9388 / 0.92 / 0.9293

Unclear Speech Results

Model CER WER BLEU ROUGE-1 (Precision/Recall/F1) ROUGE-L (Precision/Recall/F1)
Whisper-tiny 0.2331 0.4898 0.4010 0.6415 / 0.6800 / 0.6602 0.6415 / 0.6800 / 0.6602
Whisper-base 0.2256 0.4082 0.4411 0.7551 / 0.7400 / 0.7475 0.7551 / 0.7400 / 0.7475
Whisper-small 0.1541 0.4490 0.4578 0.7843 / 0.8000 / 0.7921 0.7647 / 0.7800 / 0.7723
Whisper-medium 0.0977 0.2245 0.6160 0.8936 / 0.8400 / 0.8660 0.8936 / 0.8400 / 0.8660

CER (Character Error Rate): This metric measures errors at the character level (such as insertions, deletions, or substitutions). A lower CER means the transcription is more accurate.

WER (Word Error Rate): This tells us how many mistakes there are at the word level. A lower WER indicates fewer mistakes in the transcription.

BLEU (Bilingual Evaluation Understudy): BLEU evaluates how closely the transcription matches the reference by looking at n-gram overlaps. A higher BLEU score means the transcription is more similar to the reference text.

ROUGE-1 (Unigram Overlap): ROUGE-1 checks how many individual words match between the transcription and the reference text, using precision, recall, and F-measure. A higher ROUGE-1 score suggests a better match in terms of basic word choice.

ROUGE-L (Longest Common Subsequence): ROUGE-L goes a step further by looking at the longest common sequence of words between the two texts, giving insight into how well the overall sentence structure is preserved. A higher ROUGE-L score indicates a closer match in both content and order.

Loading time and responce time acceptance

Response Time Comparison

Model Processing Time (s)
Whisper-tiny 0.7146
Whisper-base 1.1301
Whisper-small 3.0367
Whisper-medium 6.3204

Loading Time Acceptance

Model Satisfactory rate
Whisper-tiny 100%
Whisper-base 100%
Whisper-small 88.9%
Whisper-medium 0%(loading error under slow network)

For the loading time, we have set the acceptance criteria as 80% satisfactory, and for the response time, we have set the acceptance criteria as 5s. The above table shows the comparison of the loading time and response time of different models.

Conclusion

Based on the Response Time, loading time, and quality of the stt, we recommend using the Whisper-tiny, base and small model for the project. The Whisper-medium model has a loading error under slow network conditions, so it is not recommended for the project.

Advantage over the original default chrome STT:

Better accept the elderly by increase the acceptance for unclear speech. The below is the original browser STT:

/

Reference text: Hello, welcome to our speech recognition system. The weather today is sunny.

browser-default STT result: weather today

Whisper-tiny(the smallest Whisper mode) result: Hello, welcome to our speaker recognition system. Whether today is sunny