Skip to main content

Testing

Testing Strategy

As our product focussed mainly on the usability of the product by the teachers and children at the schools, we placed a stronger emphasis on user acceptance testing than unit testing or integration testing. We tested our product on live users every week, whether they were fellow students or visitors at events and refined any bugs, features that they wished to be implemented. Also, we had a weekly Friday meeting with our NAS clients to discuss our current situation of the project and obtain any direct feedback from them.

Unit and Integration Testing

Due to our end users being completely non-technical occupational therapists or primary school children, we limited the project structure so that the users are very limited in what they can do. This made the Unit testing aspect very easy to do as there was only a handful of functions which could be tested manually from the frontend.

The AI outputs were very broad in what they could produce and the only tests we created was to manually check whether the outputs they produced were satisfiable for a handful of demo songs. For example, there was an instance where the AI did not produce any background prompts and after manual debugging, we realised that the AI replied by stating “the lyrics was not interesting” enough for it to extract any meaningful backgrounds.

Compatibility testing

Compatibility testing was done manually on Mac and Windows based computers. Our client wished us to deploy on an Intel based hardware, on a Windows computer.

The application works smoothly on Intel based machines running windows as well as performing the animations to a decent extent on windows machines using other chips (AMD was tested).

However, electron-builder (library used to package our product) optimises the code for a specific OS format with the path referencing. As UNIX based systems (mac and linux are the major ones) use a different file system, therefore the relative references to the image files and audio files are not configured properly and fails to load. Therefore, our system only works on windows based systems at the current state.

The performance on each type of machine is detailed below in the Performance section.

Performance/stress testing

The main intel hardware that we tested on had the following specs:

  1. Windows with Intel CPU: Intel i9 CPU
  2. Windows with Medium level Intel hardware: Intel i5 114000 CPU, 8GB RAM and Intel UHD 740 GPU
  3. Windows with Non-Intel hardware: AMD CPU

Although we could have used a higher end hardware, we agreed that our clients will not have access to these hardware so testing on these would not be ideal or realistic. The majority of our clients at the NAS schools mainly had access to CPUs with rare access to GPUs, and to extend our affordability criteria to those without Intel GPUs we chose to test on CPU functionality only.

From the frontend, we timed how long it took to run the generation of all LLM outputs from the 3 separate devices that we had access to. The graph below shows the distribution of this. Although the generation time relies slightly on the length of the song (and hence its lyrics) there is a nice trend in all songs that better Intel CPUs perform drastically better than the other options. Moreover, the Non-Intel laptop was an AMD Ryzen 7000 series, whose benchmarks are better than an Intel 11th gen i5. Despite this, the Intel CPUs consistently performed better than the AMD CPUs, proving our theory that OpenVINO is indeed well suited for execution on Intel hardware. Although the average time difference between these two is around 1min, this will pile up to a significant amount of time when running the program as a batch.

alt text

We also measured the average memory usage when analysing the song with the LLM, which turned out to be very consistent value 6547MB. This is a relatively high RAM usage which may impede the user’s experience of using the laptop normally. However, as our product contains an option to run the LLM commands as a batch overnight, we believe this will have less of an impact than normally. Moreover, when testing on a machine with lower memory size of 4GB, the LLM did successfully lower the memory usage to fit the machine despite taking a slightly longer time to generate, so it is not completely inaccessible for those with devices under RAM size of 6GB.

alt text

In regards to the whisper and stable diffusion testing, we have not created a graph as the difference between systems was negligible.

For whisper:

  • the non-Intel machine took average 01:06 per song
  • the intel i5 machine took average 00:53 per song
  • the intel i9 machine took average 00:45 per song

For stable diffusion, we tested on generating the full set of 3 backgrounds and 3 objects (total of 6 images)

  • the non-Intel machine took 01:16 per song (of which 1min was taken to load the code)
  • the intel i5 machine took 01:12 per song
  • the intel i9 machine took 00:46 per song

The non-intel and intel machines had similar specs so produced similar times but the i9 machine had a much better CPU, loading the code much faster than the others.

User acceptance testing

As our product mainly focussed on user satisfaction and how easy it is for the school to use it, we spent multiple rounds on user testing with various age groups.

Round 1 - Haggerston School and Enfield Grammar School

Our initial user tests, where we collected data from 18 secondary school children using Microsoft forms. We have attached the graphs of the most important feedback now. Our product was still in its prototype phase and didn’t have many functionalities. However, they seemed to be attracted to the shader mode more than the particle mode, which was more visually appealing but less stimulating.

At this stage, our product was still in a prototype stage, with a very basic particle system and the shader working for one song.

From the graphs, we can see that our product was widely accepted as interesting from the students, with the majority feeling that the visuals were well timed to the song and the particles chosen by the AI matched the theme of the song. An interesting response was that the whiteboard looked plain, so we increased the max particle limit to 50 particles and particle spawn rate to 2 per second.

alt text

alt text

alt text