Research

Having identified our goals and personas, we seek to analyse other similar applications to extract what makes a successful karaoke game. Additionally, we explore current technology and potential limitations to inform our design approach. By synthesising these insights we aim to identify best practices and integrate these elements into Reading Star.

Related Projects Review

Analysis of similar solutions and existing work

SingStar 2007

Singstar is one of the earliest and most popular examples of a home karaoke video game. The game was developed by Sony and released in 2007 for the Playstation 3. Singstar lets the players sing along to the illuminated words and scores them on the accuracy of the pitch. A prompt comes out after the end of each sentence providing feedback such as "Good" or "Singstar", which then adds a corresponding amount to the score.

Singstar Gameplay — Figure 2: Singstar gameplay [3]

Key Features and Learnings

Main Features	What we learn
Real time scoring	Implementing a reliable score system is important for user engagement and fairness.
Illuminated lyrics	Visual cues help users follow along and enhance usability.
Instant feedback with words of encouragement	Immediate feedback can improve user morale and motivate the player.

We also observe the technologies they used in Singstar. The game used fast Fourier transforms (FFT) to analyse the frequency of the input signal, which was then compared to the stored information. [5] The drawback of this is having to manually configure each song's phrases and frequencies, leading to a smaller selection of songs. Additionally, Singstar did not transcribe the lyrics, meaning humming at the right pitch yielded perfect scores as well. Hence, this approach is not viable for our project's goals. However, we have learned of many ways to make the design and UI engaging and entertaining through the analysis of this related project.

Let's Sing 2024

Having looked at one of the best regarded karaoke games of recent times we now observe how a modern implementation of the same genre has evolved for the newest generation of technology. 'Let's Sing 2024' is a karaoke game published by Plaion GmbH for many platforms including the Playstation 5.

Now in this iteration of the genre we see something interesting, Let's Sing will tell you to stop humming if you have not sung the lyrics for an extended period of time. [2] While the game still only judges the score based on your tone, it does not want you to hum the entirety of the song. This means there is some lyrics matching or at least speech-to-text functionality in the game. However, the interesting part is that Let's Sing will not penalise humming, only display a warning. The developers' notion seems to be that overly penalising the words is not the right approach to karaoke. Moreover, we see the same features in the UI that we also noted in Singstars implementation, which reinforce the importance of these elements.

Key Features and Learnings

Main Features	What we learn
Tone and rhythm matching with small nudges to sing the correct words	Leniency in matching words is best for the enjoyability of a karaoke game.
Diverse song library	Variety in music encourages users to try more songs and spend more time singing.

Overall, we made 5 takeaways by observing what other similar projects have found to be the best way to make a fun karaoke game.

Technology Review

Evaluation of potential technologies and frameworks

Requirements

Considering our clients' requests, we have some requirements that narrow down the possible technologies:

Must be native to Windows
Must have local Artificial Intelligence
Must be optimised for Intel hardware

Frontend Technologies

Firstly, we review the frontend technologies that may fit these criteria. Microsoft Foundation Class (MFC) is a popular choice for applications that are designed for older computers, however, after careful consideration we decided MFC did not allow us to create a captivating UI that is essential to a karaoke game.

We also considered .NET, which we decided would limit our extensibility. While our current requirements could be met, this would make it difficult to adapt our application to other platforms such as mobile later.

Ultimately, we decided React Native for Windows was the right blend of modern looks and native integration.

Backend Technologies

There are two main backend requirements: music player and lyrics matching.

Looking at how to add the music into the game, we had to find a method that did not infringe intellectual property laws and integrated seamlessly. The choice was made for us by the restrictions of law, we could not license songs individually, and we could not attain them otherwise. Therefore we had to integrate an existing provider into our application, where the only possible choice was the Youtube API. Hence, we decided to embed the Youtube player into the application and stream the songs.

For matching lyrics, we found many different models and methods. We looked at open source models and tested their accuracy on prototypes. Since we needed Intel optimisation, we used OpenVINO to increase efficiency on Intel hardware by converting models into an Intermediate Representation (IR).

We compared various speech recognition models during our research using various metrics such as the "librispeech performance benchmark", and testing the models locally. Nvidia's canary-1b model outperformed other ASR models in most metrics including Word Error Rate, however when testing with audio from neurodiverse children provided by the National Autistic Society, it failed to accurately transcribe most of the audio. Another model we tested was Vosk, which excels in real-time transcription, essential for our application, however the models are stored in Kaldi nnet3 format which proved difficult to integrate into an openvino pipeline for inference. We finally decided on OpenAI's tiny.en whisper model. Whisper is trained on a much more diverse dataset than Nvidia's Canary, and therefore performed better with neurodiverse speech and different accents, and the small size of the tiny model allowed us to engineer an effective and efficient algorithm for live transcription. Additionally, OpenVino had already created a pipeline for utilizing the whisper model which allowed us to optimize inference on Intel CPUs and NPUs and create an accurate and efficient pipeline for live speech recognition.

The final component is the python framework to implement the backend technologies. We opted for FastAPI due to its support for Asynchronous Server Gateway Interface (ASGI) ensuring low-latency handling, which is crucial to our application. FastAPI’s lightweight design and seamless integration with our AI pipeline, allowed us to build a responsive and scalable backend.

Summary

Key findings and conclusions from our research

Summary of Technical Decisions

Overall, after reviewing many different potential technologies, we decided to use React Native for Windows, Whisper with OpenVINO, and the Youtube API as the core services of our application. These decisions align well with the project requirements. Our analysis of other similar projects also conclude this approach is feasible and fits in with our takeaways.

References

Sources and citations

[1] "Let's sing 2024." Plaion GmbH, Austria, 2024

[2] R. Burke, "Let's sing 2024 review - more on key than not," GamingTrend, https://gamingtrend.com/reviews/lets-sing-2024-review-more-on-key-than-not/ (accessed Mar. 2, 2025).

[3] G. Miller, "SingStar Review," IGN, https://www.ign.com/articles/2008/05/20/singstar-review (accessed Feb. 27, 2025).

[4] "Singstar." Sony Computer Entertainment, Foster City, Calif, 2007

[5] L. Bedigian, "Striking a chord with SingStar Pop's Tamsin Lucas - PS2 News," PS2 Gamezone, https://web.archive.org/web/20070408075151/http://ps2.gamezone.com/news/04_03_07_09_27AM.htm (accessed Feb. 27, 2025).