Testing Both Experiments

Our project has two experiments:

  • OfflineLLM powered literature review tool with UCL GDIHUB(short 1 month project)
  • Offline LLM with Ossia voice (our main focus)
  • Quick Navigation OfflineLLM powered literature review tool with UCL GDIHUB

    Quick Navigation Offline LLM with Ossia voice

    Experiment 1: Offline-LLM powered literature review tool with UCL GDIHUB

    Testing Strategy

    Our testing approach is multi-layered to make sure everything works smoothlyβ€”from individual parts to the entire system under pressure.

    We start with unit tests to check each component, then move on to client (integration) tests that mimic real user interactions, and finally, we perform stress tests to see how the system behaves under heavy load. This process helps us catch problems early, confirm that all parts work well together, and ensure the system remains reliable even extreme usage (50+ documents loaded), multiple open and close.

    Unit and Integration Testing

    ❓
    Purpose

    The unit test is used on the database.py segment to ensure:

      The database could be cleared and reloaded with new data.
      The database could be queried with the RAG model.
      The database could be high-efficient to avoid duplication.
    πŸ”§
    Testing Tools
    • Python unittest framework
    • Mocking libraries
    πŸ“š
    Methodology

    Run all tests using command pytest tests.

    πŸ“Š
    Results

    All 6 tests are passed.

    πŸ’‘ Analysis & Conclusion

    This test is the baseline for all the developments by making sure the data loaded into the db is correct

    Compatibility Testing

    ❓
    Purpose

    To ensure our offline RAG system operates consistently across various operating systems and environments, making it accessible to all potential users regardless of their platform.

    πŸ”§
    Testing Tools
    • Machines with different OS configurations
    πŸ“š
    Methodology

    We established a systematic testing protocol across multiple environments:

    • Deployed and tested the application on Windows, MacOS
    • Verified functionality with multiple Python versions
    • The Windows machines are running WIN11 23H2/24H2 machines with RTX4060 + 32G of RAM
    • The MacOS machines are running MacOS 15.3 with M4 PRO 24G and M4 MAX 48G.
    πŸ“Š
    Results

    The application demonstrated consistent functionality across all tested platforms with only minor differences:

    • Successfully ran on Windows, MacOS
    • Compatible with all tested Python versions
    • Documentation updated with detailed installation requirements
    πŸ’‘ Analysis & Conclusion

    Cross-platform compatibility was successfully achieved, ensuring our solution is accessible to researchers regardless of their operating system preference. The application demonstrated consistent behavior and performance across all tested environment.

    Responsive Design Testing

    ❓
    Purpose

    To ensure the user interface adapts appropriately to different screen sizes and resolutions, providing an optimal experience across various display configurations.

    πŸ”§
    Testing Tools
    • Screen recording software
    • Various display resolutions and aspect ratios
    πŸ“š
    Methodology

    We implemented and tested adaptive UI features through:

    • Testing the application at various screen sizes from 1080P to 2.5K resolution
    • Verifying proper element repositioning during window resizing
    • Ensuring UI elements remain accessible and functional at all sizes,
    πŸ“Š
    Results

    The responsive design features performed effectively:

    • UI successfully adapted to all tested window sizes
    • Elements maintained proper proportions and accessibility
    • Text remained readable across all display configurations
    πŸ’‘ Analysis & Conclusion

    The implementation of adaptive UI successfully ensures that our application provides a consistent user experience regardless of display size or resolution. This feature is particularly valuable for researchers who may use the tool across different devices or in varied workspace configurations.

    User Acceptance Testing

    ❓
    Purpose

    To validate that our offline RAG system meets the actual needs of researchers and effectively supports their literature review workflows in real-world scenarios. This testing phase confirms that the system delivers value to end users and identifies any usability improvements before final deployment.

    πŸ‘₯
    Testers
    • UCL GDIHUB researchers as actual user
    • Year 2 students as simulated user
    Detailed Analysis

    The user acceptance testing was conducted with researchers from UCL GDIHUB, who simulated real-world usage of the system. The test cases covered the full workflow, from adding documents to the database, applying and managing filter items, querying the system using retrieval-augmented generation (RAG), verifying the results, and exporting or deleting data.

    Throughout the testing process, simulated researchers interacted with the system as intended users, assessing usability, accuracy, and overall system performance. The testing confirmed that the system efficiently handled document management and filtering, provided precise and relevant responses to queries, and maintained stability during data operations.

    Based on feedback, clients expressed satisfaction with both the functionality. No major issues were reported, but minor suggestions were made for UI improvements to enhance the user experience. These insights has been Deployed of the system to refine usability further.

    Next Steps

    Based on feedback, we have implemented the following improvements:

    • Enhanced UI

    Stress Testing

    ❓
    Purpose

    To evaluate stability under heavy load conditions, ensuring the offline RAG system can handle extensive document.

    πŸ”§
    Testing Tools
    • Memory profiler
    • Large document corpus (50+ academic papers)
    πŸ“š
    Methodology

    We conducted systematic stress testing through:

    • Loading progressively larger document collections (5, 20, 50+ documents)
    • Monitoring memory usage during extended operation periods
    • Testing with lengthy documents
    • Simulating user interactions in quick succession
    πŸ“Š
    Results

    Stress testing revealed:

    • Successfully handled 50+ documents without significant performance degradation
    • Memory usage remained within acceptable bounds during extended sessions
    • No critical failures occurred during intensive testing scenarios
    πŸ’‘ Analysis & Conclusion

    The stress testing confirmed that our offline RAG system can reliably handle the document volumes and usage patterns expected in real-world research scenarios.

    Experiment 2: Offline LLM with Ossia voice

    Testing Strategy

    Our testing for the Ossia Voice project follows a comprehensive approach that addresses both technical functionality and user experience. We focus on validating the single subsystem (SST TTS LLM) before moving to overall testing.

    Unit and Integration Testing

    ❓
    Purpose

    To verify that individual components of the Ossia voice system function correctly and work together seamlessly, ensuring reliability of the core offline speech processing functionality.

    πŸ”§
    Testing Tools
    • Browser console
    • Debugging windows in form of vue component
    πŸ“š
    Methodology

    Our testing methodology included:

    • Tests for speech processing modules
    • Test for offline LLM response
    • Integration tests from end to end

    Tests were run by all team memebrs as well as external testers since LLM and speech recognition response is unpreditable

    πŸ“Š
    Results

    Testing results demonstrated:

    • All core modules passed manual verification of results
    • Integration points between components work correctly
    • Edge cases identified and addressed and errors are handled correctly
    πŸ’‘ Analysis & Conclusion

    The unit and integration testing confirmed that our offline LLM powered ossia system functions reliably across all core components. Minor issues were identified and resolved during the testing process, ensuring a stable foundation for the user acceptance phase.

    Compatibility Testing

    ❓
    Purpose

    To ensure the Ossia voice system works consistently across different operating systems, hardware configurations and different needs.

    πŸ”§
    Testing Tools
    • Multiple hardware configurations
    • Various operating systems (Windows, macOS)
    • Different assistive input devices(screen keyboard,touchpad and mocked use of eye tracking devices)
    πŸ“š
    Methodology

    We conducted compatibility testing across:

    • Windows 10/11 and macOS environments
    • Systems with various CPU/GPU configurations
    • Trial with common assistive input technologies
    πŸ“Š
    Results

    Compatibility testing revealed:

    • Successful operation across all tested operating systems
    • Compatible with standard assistive input devices
    πŸ’‘ Analysis & Conclusion

    The system demonstrated good compatibility across different environments, the whole system was established to ensure adequate speech processing and offline-LLM performance.

    Responsive Design Testing

    ❓
    Purpose

    To ensure the Ossia voice application interface adapts acceptablly to different screen sizes and resolutions, providing an accessible experience for users working on regular desktop and laptops.

    πŸ”§
    Testing Tools
    • Browser DevTools
    • Various physical devices (Windows tablets, Windows and Mac laptops, Windows desktops)
    • Screen recording for further analysis
    πŸ“š
    Methodology

    Our responsive design testing approach included:

    • Testing the application across multiple screen sizes from 1080p to 4K resolution
    • Verifying button and interactive element sizing for accessibility on chrome based browsers
    • Ensuring critical interface components stay in place
    πŸ“Š
    Results

    The responsive design testing showed:

    • Interface successfully adapted to all tested screen sizes
    • Targets and UI maintained suitable size for users with motor impairments
    • Text and speech controls and inputs remained accessible across all tested devices
    • Microsoft edge for MacOS Version 134.0.3124.68 has bugs on personal accounts downloading files (while is working fine using school microsoft account), instruction is given to end user to reset their settings or switch to google chrome
    πŸ’‘ Analysis & Conclusion

    Our responsive design testing confirmed that the Ossia voice application provides consistent accessibility across different devices and screen resolutions. This is particularly important for users with NMDs who may use various devices depending on their environment and care situation. The adaptable interface ensures that users can effectively communicate with their loves and friends.

    User Acceptance Testing

    ❓
    Purpose

    To validate that the offline Ossia voice system meets the needs of people with NMDs, providing an accessible and effective communication tool without requiring OPENAI API subscriptions.

    πŸ‘₯
    Testers
    • Simulated users with motor neuron disease(with certain body parts fixed)
    • Accessibility specialists
    • Regular testers simulating patient's loved ones and friends
    πŸ“š
    Methodology

    User testing was conducted with:

    • Guided setup to validate the setup procedure
    • Good accessibility and response that could generate suitable words
    • Feedback collection through interviews and questionnaires
    πŸ“Š
    Results

    User feedback revealed:

    • High satisfaction with the offline-LLM functionality
    • Positive response to voice quality and naturalness
    • Appreciation for the elimination of API costs
    πŸ’‘ Analysis & Conclusion

    User acceptance testing confirmed that our offline-llm solution successfully addresses the original goal of making Ossia Voice accessible without API dependencies. The system provides similar quality to online solutions while eliminating subscription costs, which was particularly valued by users who depend on the system for daily communication.