Research

Having identified our goals and personas, we seek to analyse other similar applications to extract what makes a successful karaoke game. Additionally, we explore current technology and potential limitations to inform our design approach. By synthesising these insights we aim to identify best practices and integrate these elements into Reading Star.

Technology Review

Evaluation of potential technologies and frameworks

Requirements

Considering our clients' requests, we have some requirements that narrow down the possible technologies:

  • Must be native to Windows
  • Must have local Artificial Intelligence
  • Must be optimised for Intel hardware

Frontend Technologies

Firstly, we review the frontend technologies that may fit these criteria. Microsoft Foundation Class (MFC) is a popular choice for applications that are designed for older computers, however, after careful consideration we decided MFC did not allow us to create a captivating UI that is essential to a karaoke game.

We also considered .NET, which we decided would limit our extensibility. While our current requirements could be met, this would make it difficult to adapt our application to other platforms such as mobile later.

Ultimately, we decided React Native for Windows was the right blend of modern looks and native integration.

Backend Technologies

There are two main backend requirements: music player and lyrics matching.

Looking at how to add the music into the game, we had to find a method that did not infringe intellectual property laws and integrated seamlessly. The choice was made for us by the restrictions of law, we could not license songs individually, and we could not attain them otherwise. Therefore we had to integrate an existing provider into our application, where the only possible choice was the Youtube API. Hence, we decided to embed the Youtube player into the application and stream the songs.

For matching lyrics, we found many different models and methods. We looked at open source models and tested their accuracy on prototypes. Since we needed Intel optimisation, we used OpenVINO to increase efficiency on Intel hardware by converting models into an Intermediate Representation (IR).

We compared various speech recognition models during our research using various metrics such as the "librispeech performance benchmark", and testing the models locally. Nvidia's canary-1b model outperformed other ASR models in most metrics including Word Error Rate, however when testing with audio from neurodiverse children provided by the National Autistic Society, it failed to accurately transcribe most of the audio. Another model we tested was Vosk, which excels in real-time transcription, essential for our application, however the models are stored in Kaldi nnet3 format which proved difficult to integrate into an openvino pipeline for inference. We finally decided on OpenAI's tiny.en whisper model. Whisper is trained on a much more diverse dataset than Nvidia's Canary, and therefore performed better with neurodiverse speech and different accents, and the small size of the tiny model allowed us to engineer an effective and efficient algorithm for live transcription. Additionally, OpenVino had already created a pipeline for utilizing the whisper model which allowed us to optimize inference on Intel CPUs and NPUs and create an accurate and efficient pipeline for live speech recognition.

The final component is the python framework to implement the backend technologies. We opted for FastAPI due to its support for Asynchronous Server Gateway Interface (ASGI) ensuring low-latency handling, which is crucial to our application. FastAPI’s lightweight design and seamless integration with our AI pipeline, allowed us to build a responsive and scalable backend.

Summary

Key findings and conclusions from our research

Summary of Technical Decisions

Overall, after reviewing many different potential technologies, we decided to use React Native for Windows, Whisper with OpenVINO, and the Youtube API as the core services of our application. These decisions align well with the project requirements. Our analysis of other similar projects also conclude this approach is feasible and fits in with our takeaways.

References

Sources and citations

[1] "Let's sing 2024." Plaion GmbH, Austria, 2024

[2] R. Burke, "Let's sing 2024 review - more on key than not," GamingTrend, https://gamingtrend.com/reviews/lets-sing-2024-review-more-on-key-than-not/ (accessed Mar. 2, 2025).

[3] G. Miller, "SingStar Review," IGN, https://www.ign.com/articles/2008/05/20/singstar-review (accessed Feb. 27, 2025).

[4] "Singstar." Sony Computer Entertainment, Foster City, Calif, 2007

[5] L. Bedigian, "Striking a chord with SingStar Pop's Tamsin Lucas - PS2 News," PS2 Gamezone, https://web.archive.org/web/20070408075151/http://ps2.gamezone.com/news/04_03_07_09_27AM.htm (accessed Feb. 27, 2025).