Testing

Testing Both Experiments

Our project has two experiments:

OfflineLLM powered literature review tool with UCL GDIHUB(short 1 month project)

Offline LLM with Ossia voice (our main focus)

Quick Navigation OfflineLLM powered literature review tool with UCL GDIHUB

Exp1: Testing Strategy
Exp1: Unit & Integration
Exp1: Compatibility
Exp1: Responsive Design
Exp1: User Acceptance
Exp1: Stress Testing

Quick Navigation Offline LLM with Ossia voice

Exp2: Testing Strategy
Exp2: Unit & Integration
Exp2: Compatibility
Exp2: Responsive Design
Exp2: User Acceptance

Experiment 1: Offline-LLM powered literature review tool with UCL GDIHUB

Testing Strategy

Our testing approach is multi-layered to make sure everything works smoothly—from individual parts to the entire system under pressure.

We start with unit tests to check each component, then move on to client (integration) tests that mimic real user interactions, and finally, we perform stress tests to see how the system behaves under heavy load. This process helps us catch problems early, confirm that all parts work well together, and ensure the system remains reliable even extreme usage (50+ documents loaded), multiple open and close.

Unit and Integration Testing

❓

Purpose

The unit test is used on the database.py segment to ensure:

The database could be cleared and reloaded with new data.

The database could be queried with the RAG model.

The database could be high-efficient to avoid duplication.

🔧

Testing Tools

Python unittest framework
Mocking libraries

📚

Methodology

Run all tests using command pytest tests.

📊

Results

All 6 tests are passed.

💡 Analysis & Conclusion

This test is the baseline for all the developments by making sure the data loaded into the db is correct

Compatibility Testing

❓

Purpose

To ensure our offline RAG system operates consistently across various operating systems and environments, making it accessible to all potential users regardless of their platform.

🔧

Testing Tools

Machines with different OS configurations

📚

Methodology

We established a systematic testing protocol across multiple environments:

Deployed and tested the application on Windows, MacOS
Verified functionality with multiple Python versions
The Windows machines are running WIN11 23H2/24H2 machines with RTX4060 + 32G of RAM
The MacOS machines are running MacOS 15.3 with M4 PRO 24G and M4 MAX 48G.

📊

Results

The application demonstrated consistent functionality across all tested platforms with only minor differences:

Successfully ran on Windows, MacOS
Compatible with all tested Python versions
Documentation updated with detailed installation requirements

💡 Analysis & Conclusion

Cross-platform compatibility was successfully achieved, ensuring our solution is accessible to researchers regardless of their operating system preference. The application demonstrated consistent behavior and performance across all tested environment.

Responsive Design Testing

❓

Purpose

To ensure the user interface adapts appropriately to different screen sizes and resolutions, providing an optimal experience across various display configurations.

🔧

Testing Tools

Screen recording software
Various display resolutions and aspect ratios

📚

Methodology

We implemented and tested adaptive UI features through:

Testing the application at various screen sizes from 1080P to 2.5K resolution
Verifying proper element repositioning during window resizing
Ensuring UI elements remain accessible and functional at all sizes,

📊

Results

The responsive design features performed effectively:

UI successfully adapted to all tested window sizes
Elements maintained proper proportions and accessibility
Text remained readable across all display configurations

💡 Analysis & Conclusion

The implementation of adaptive UI successfully ensures that our application provides a consistent user experience regardless of display size or resolution. This feature is particularly valuable for researchers who may use the tool across different devices or in varied workspace configurations.

User Acceptance Testing

❓

Purpose

To validate that our offline RAG system meets the actual needs of researchers and effectively supports their literature review workflows in real-world scenarios. This testing phase confirms that the system delivers value to end users and identifies any usability improvements before final deployment.

👥

Testers

UCL GDIHUB researchers as actual user
Year 2 students as simulated user

Client Feedback

The clients expressed high satisfaction with both the operation and precision of the entire workflow.

"The system efficiently handled our needs while maintaining excellent response accuracy."

Detailed Analysis

The user acceptance testing was conducted with researchers from UCL GDIHUB, who simulated real-world usage of the system. The test cases covered the full workflow, from adding documents to the database, applying and managing filter items, querying the system using retrieval-augmented generation (RAG), verifying the results, and exporting or deleting data.

Throughout the testing process, simulated researchers interacted with the system as intended users, assessing usability, accuracy, and overall system performance. The testing confirmed that the system efficiently handled document management and filtering, provided precise and relevant responses to queries, and maintained stability during data operations.

Based on feedback, clients expressed satisfaction with both the functionality. No major issues were reported, but minor suggestions were made for UI improvements to enhance the user experience. These insights has been Deployed of the system to refine usability further.

Next Steps

Based on feedback, we have implemented the following improvements:

Enhanced UI

Stress Testing

❓

Purpose

To evaluate stability under heavy load conditions, ensuring the offline RAG system can handle extensive document.

🔧

Testing Tools

Memory profiler
Large document corpus (50+ academic papers)

📚

Methodology

We conducted systematic stress testing through:

Loading progressively larger document collections (5, 20, 50+ documents)
Monitoring memory usage during extended operation periods
Testing with lengthy documents
Simulating user interactions in quick succession

📊

Results

Stress testing revealed:

Successfully handled 50+ documents without significant performance degradation
Memory usage remained within acceptable bounds during extended sessions
No critical failures occurred during intensive testing scenarios

💡 Analysis & Conclusion

The stress testing confirmed that our offline RAG system can reliably handle the document volumes and usage patterns expected in real-world research scenarios.

Experiment 2: Offline LLM with Ossia voice

Testing Strategy

Our testing for the Ossia Voice project follows a comprehensive approach that addresses both technical functionality and user experience. We focus on validating the single subsystem (SST TTS LLM) before moving to overall testing.

Unit and Integration Testing

❓

Purpose

To verify that individual components of the Ossia voice system function correctly and work together seamlessly, ensuring reliability of the core offline speech processing functionality.

🔧

Testing Tools

Browser console
Debugging windows in form of vue component

📚

Methodology

Our testing methodology included:

Tests for speech processing modules
Test for offline LLM response
Integration tests from end to end

Tests were run by all team memebrs as well as external testers since LLM and speech recognition response is unpreditable

📊

Results

Testing results demonstrated:

All core modules passed manual verification of results
Integration points between components work correctly
Edge cases identified and addressed and errors are handled correctly

💡 Analysis & Conclusion

The unit and integration testing confirmed that our offline LLM powered ossia system functions reliably across all core components. Minor issues were identified and resolved during the testing process, ensuring a stable foundation for the user acceptance phase.

Compatibility Testing

❓

Purpose

To ensure the Ossia voice system works consistently across different operating systems, hardware configurations and different needs.

🔧

Testing Tools

Multiple hardware configurations
Various operating systems (Windows, macOS)
Different assistive input devices(screen keyboard,touchpad and mocked use of eye tracking devices)

📚

Methodology

We conducted compatibility testing across:

Windows 10/11 and macOS environments
Systems with various CPU/GPU configurations
Trial with common assistive input technologies

📊

Results

Compatibility testing revealed:

Successful operation across all tested operating systems
Compatible with standard assistive input devices

💡 Analysis & Conclusion

The system demonstrated good compatibility across different environments, the whole system was established to ensure adequate speech processing and offline-LLM performance.

Responsive Design Testing

❓

Purpose

To ensure the Ossia voice application interface adapts acceptablly to different screen sizes and resolutions, providing an accessible experience for users working on regular desktop and laptops.

🔧

Testing Tools

Browser DevTools
Various physical devices (Windows tablets, Windows and Mac laptops, Windows desktops)
Screen recording for further analysis

📚

Methodology

Our responsive design testing approach included:

Testing the application across multiple screen sizes from 1080p to 4K resolution
Verifying button and interactive element sizing for accessibility on chrome based browsers
Ensuring critical interface components stay in place

📊

Results

The responsive design testing showed:

Interface successfully adapted to all tested screen sizes
Targets and UI maintained suitable size for users with motor impairments
Text and speech controls and inputs remained accessible across all tested devices
Microsoft edge for MacOS Version 134.0.3124.68 has bugs on personal accounts downloading files (while is working fine using school microsoft account), instruction is given to end user to reset their settings or switch to google chrome

💡 Analysis & Conclusion

Our responsive design testing confirmed that the Ossia voice application provides consistent accessibility across different devices and screen resolutions. This is particularly important for users with NMDs who may use various devices depending on their environment and care situation. The adaptable interface ensures that users can effectively communicate with their loves and friends.

User Acceptance Testing

❓

Purpose

To validate that the offline Ossia voice system meets the needs of people with NMDs, providing an accessible and effective communication tool without requiring OPENAI API subscriptions.

👥

Testers

Simulated users with motor neuron disease(with certain body parts fixed)
Accessibility specialists
Regular testers simulating patient's loved ones and friends

📚

Methodology

User testing was conducted with:

Guided setup to validate the setup procedure
Good accessibility and response that could generate suitable words
Feedback collection through interviews and questionnaires

📊

Results

User feedback revealed:

High satisfaction with the offline-LLM functionality
Positive response to voice quality and naturalness
Appreciation for the elimination of API costs

💡 Analysis & Conclusion

User acceptance testing confirmed that our offline-llm solution successfully addresses the original goal of making Ossia Voice accessible without API dependencies. The system provides similar quality to online solutions while eliminating subscription costs, which was particularly valued by users who depend on the system for daily communication.