AI Implementation
- C++
- Python
- Integration
C++
The C++ code creates a data pipeline leveraging OpenVINO genAI
for both LLM and S2T effects, storing the extracted data in JSON format for use by the frontend application, using the nlohmann::json
library.
The OpenVINO GenAI
library we used was a dynamic library downloaded from here. This is the recommended format from their sample code and although a static compilation exists, it proved to be barely efficient over the dynamic library format with the GenAI and required sacrificing flexibility of the code to only make the code hardware specific. However, this does come with the downside of having to run the setupvars.ps1
script before running the compiled executable every single time the executable is run. The library requires certain environment variables to be set in order to locate the openvino dll files, which the setup script temporarily does.
Command line arguments are supplied to the compiled executable using boost::program_options
library to control which features must be extracted from the frontend. The same flags are stored in the frontend AI caller, which runs the binary file stored under the relative path of ./resources/cppVer.exe
.
po::options_description general_options("Allowed options");
general_options.add_options()
("help,h", "produce help message")
("debug,d", "enable debug mode")
("whisper,w", "use whisper mode")
// ... other options
po::options_description llm_options("LLM only options");
llm_options.add_options()
("status", "extract status from lyrics")
("extractColour,c", "extract colours from lyrics")
// ... other LLM options
These allow the user control over the following arguments, giving the user limited control of the AI generation to prevent inappropriate content or data formats which may break the database.
Allowed options:
-h, --help: produce help message
-d, --debug: enable debug mode
-w, --whisper: use whisper mode
-l, --llm: use llm mode
-s, --song: specify song id
--text_log: enable text logging
-m, --model: specify model name
-e, --electron: enable electron mode, exe is run from Super Happy Space
Whisper only options
--fixSampleRate: fix sample rate of audio file to 16kHz
LLM only options
--smallerLLM: use smaller LLM model, with less parameters
--status: extract status from lyrics
-c, --extractColour: extract colours from lyrics
-p, --extractParticle: extract particle effect from lyrics
-o, --extractObject: extract objects from lyrics
-b, --extractBackground: extract backgrounds from lyrics
--generateObjectPrompts: generate object image prompts
--generateBackgroundPrompts: generate background image prompts
--all: extract all llm features
Main Function
The main function orchestrates the overall flow of the application by:
- parsing the command line arguments
- configuring paths based on the runtime environment
- validating the requirements for the requested operation
- initialising and running the appropriate pipeline
- performing cleanup before exiting
int main(int argc, char *argv[]) {
// Parse command-line options
// ...
// Set paths based on environment
// ...
// validate requirements e.g.
// check if model type is specified
if (!vm.count("whisper") && !vm.count("llm") &&
!vm.count("stable-diffusion")) {
std::cerr << "Error: Please specify a model type to use" << std::endl;
return 1;
}
// Execute the requested pipeline
if (vm.count("whisper")) {
// Run Whisper pipeline
// ...
}
if (vm.count("llm")) {
// Run LLM pipeline
// ...
}
// Cleanup
// ...
return 0;
}
After the pipeline is configured, the code base splits into 2 main classes:
Whisper
- Handles audio transcription of lyrics - calls theov::genai::WhisperPipeLine
LLM
- Performs text analysis and content generation - calls theov::genai::LLMPipeline
Although I have used classes to bundle the relevant functions together, I did not use an OOP format by creating an AI superclass as the 2 classes had little in common apart from calling the constructor. A class was used instead of a namespace to bundle the classes as the relevant functions had variables that it needed to share between functions such as the pipeline and the final unordered_map
to be stored as a json.
The constructor of each class sets up the appropriate pipeline from the OpenVINO GenAI
library, by passing in the device type and the model directory. The current implementation works with models in the OpenVINO IR
format, which can be converted from most models on huggingface using their Optimum CLI
tool. For our specific project, we used the default pre-converted models on OpenVINO’s huggingface page.
Before setting up each AI pipeline in the constructors, the code automatically detects the compatible devices. The code automatically defaults to using either the NPU or GPU if it is available and uses the first available Device if not. By default, the OpenVINO is compatible with all CPU structures, even though it may not be optimised so there is no need for error handlers when there are no availableDevices.
std::string getModelDevice() {
ov::Core core;
std::vector<std::string> availableDevices = core.get_available_devices();
// Prioritise GPU if available
for (const auto &device : availableDevices) {
if (device.find("NPU") != std::string::npos) {
return device;
}
if (device.find("GPU") != std::string::npos) {
return device;
}
}
return availableDevices[0];
}
LLM Class
The LLM class is structured in the following format:
class LLM {
private:
const std::string device;
ov::genai::LLMPipeline pipe;
const std::string songName;
const std::string lyrics;
const bool debug;
std::string lyricsSetup;
std::string shorterLyricsSetup;
std::string outputFilePath;
std::unordered_map<LLMOutputType, std::vector<std::string>> outputMap;
std::string generate(std::string prompt, int max_new_tokens);
void retrieveCurrentOutput();
public:
LLM(std::string llmModelPath, std::string songName, bool debug);
void extractColours();
void extractStatus();
void extractParticleEffect();
void extractObjects();
void extractBackgrounds();
void generateObjectPrompts();
void generateBackgroundPrompts();
void jsonStoreData();
};
The constructor initialises the LLM pipeline, preparing both full and truncated prompts and loading any existing output using the retrieveCurrentOutput
function. This function gets the json file from the global outputFilePath
variable and stores the data as an unordered_map
. Although the LLM does not use any of the data from the current json file stored, nlohmann::json
does not support updating the json file automatically by default. Therefore, when writing to the database, the whole file is deleted and rewritten. To save the original data that the LLM does not update, the initial values must be read in first. The LLMOutputType
is a default enum value as detailed more in Global Arguments section
void retrieveCurrentOutput() {
json j;
// read existing json data from file if it exists
std::ifstream inputFile(outputFilePath);
if (inputFile.is_open()) {
std::cout << "Reading existing data from file" << std::endl;
inputFile >> j;
inputFile.close();
} else {
j = json();
}
// store existing data in outputMap
for (const auto &output : j.items()) {
// std ::cout << "output type: " << output.key() << std::endl;
LLMOutputType outputType = outputTypeMapReverse.at(output.key());
if (outputTypeIsVector.at(outputType)) {
outputMap[outputType] = output.value();
} else {
// std::cout << "output value: " << output.value().get<std::string>()
// << std::endl;
outputMap[outputType] =
std::vector<std::string>{output.value().get<std::string>()};
}
}
}
Each other function in the LLM creates a final prompt combining the global prompts, detailed in the global arguments section. If the debug
flag is set, the code automatically shows the raw output in the terminal. This function cannot be accessed from the frontend as it is not possible to access the terminal unless the exe is run from the terminal, which is very unlikely.
void extractColours() {
std ::cout << "Extracting colours from lyrics" << std::endl;
std::string colourPrompt = lyricsSetup + colourExtractionPrompt;
std::string colourOutput;
try{
colourOutput = generate(colourPrompt, 500);
} catch (const std::bad_alloc& e) {
std::cerr << "Bad allocation error: " << e.what() << std::endl;
std::cerr << "Trying with shorter lyrics" << std::endl;
colourOutput = generate(shorterLyricsSetup + colourExtractionPrompt, 500);
}
// some other logic to extract from llm output
outputMap[COLOURS] = colours;
outputMap[COLOURS_REASON] = {colourOutput};
if (debug) {
std::cout << "Colours extracted: " << std::endl;
std::cout << colourOutput << std::endl;
}
}
Whisper Class
The whisper class has the following structure:
class Whisper {
private:
const std::string device;
ov::genai::WhisperPipeline pipe;
const std::string songId;
const bool debug;
void saveLyrics(std::string lyrics) {
// Save lyrics to file
}
public:
Whisper(std::string songId, bool debug);
void generateLyrics();
};
The constructor initialises the whisper pipeline with the appropriate model and device, storing it under the pipe field.
After the Whisper class has been instantiated, the generateLyrics
public functions is called with the songId setup. The whisper configs are are hardset to a maximum of 500 new tokens (around 380 words) and english language on transcribe mode. For processing of the audio file, I have reused the utils::audio::read_wav
function defined in OpenVINO GenAI's
sample code available https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/cpp/whisper_speech_recognition). It converts the wavPath into a raw speech input required by the library, hence the code requiring a specific wav format of 16kHz.
void Whisper::generateLyrics() {
std::string wavPath = (wavDirPath / (songId + ".wav")).string();
// Configure the generation settings
ov::genai::WhisperGenerationConfig config = pipe.get_generation_config();
config.max_new_tokens = 500;
config.language = "<|en|>";
config.task = "transcribe";
config.return_timestamps = true;
// Process the audio file
ov::genai::RawSpeechInput rawSpeech = utils::audio::read_wav(wavPath);
std::string lyrics = pipe.generate(rawSpeech, config);
// Save the results
saveLyrics(lyrics);
}
Global Arguments
As the system relies on a specific directory structure for models and assets, global constants are declared for commonly used. To support for multiple architectures, std::filesystem::path
so that the pre-compiled code is portable to any system.
// ----------------- paths -----------------
std::filesystem::path currentDirectory = std::filesystem::current_path();
std::string gemmaModelPath;
std::string smallerLLMPath;
std::string stableDiffusionModelPath;
std::filesystem::path whisperModelPath;
std::filesystem::path songDataPath;
std::string particleListFilePath;
std::string logPath;
std::filesystem::path lyricsDirPath;
std::filesystem::path wavDirPath;
std::filesystem::path imageDirPath;
These paths can be adjusted based on the runtime environment, specifically when the -e, --electron
flag is set, the path structure is formatted so it adheres to our project structure after the frontend is packaged.
Another global argument is the prompt templates, which we have carefully crafted to guide the LLM’s analysis.
std::string colourExtractionPrompt =
"Analyse the lyrics of the song provided and extract 5 unique,"
"unusual colors (avoid common colors like red, green, or blue) that are "
"explicitly mentioned or strongly implied."
// ... more prompt text
std::string statusPrompt = // ...
std::string particleSelectionPrompt = // ...
std::string lyricsPrompt = // ...
std::string objectExtractionPrompt =
"Analyse the lyrics of the song provided and extract 3 unique, unusual "
"objects that are explicitly mentioned or strongly implied."
"Give the output in the following exact format for easy extraction using "
"regex:"
"Object 1: $Object name$"
"Object 2: $Object name$"
"Object 3: $Object name$";
std::string backgroundExtractionPrompt = // ...
std::string imageSetup = // ..
std::string imageSettings = // ..
std::string objectSettings = // ..
std::string backgroundSettings = // ..
A combination of these global prompts and the lyrics are used to generate the final prompt to be passed into the LLM. When multiple outputs are expected, specifically for the object extraction and the background extraction, explicit instructions are given to wrap the output in $ symbols. Then a helper function called getOptionsFromLlmOutput
is called to extract all the words wrapped with a $ using regex
expression of "\\$(.*?)\\$"
.
As this json is read from both the electron code and the C++ code, the json keys must be synced between them. To ensure an easier switch in JSON keys, a set of global hashmaps and enums are used to convert between all the possible LLM. As the json data contains both string and list values, we store the json data as a map of vectors and convert to a string by taking the first value of the vector, if the outputTypeIsVector
is false.
enum LLMOutputType {
// fields which are not generated by LLM
ID,
TITLE,
// ...
// fields which are generated by LLM
STATUS,
COLOURS,
// ...
};
const std::unordered_map<LLMOutputType, std::string> outputTypeMap = {
{ID, "id"},
{TITLE, "title"},
// ...
};
const std::unordered_map<std::string, LLMOutputType> outputTypeMapReverse = {
{"id", ID},
{"title", TITLE},
// ...
};
const std::unordered_map<LLMOutputType, bool> outputTypeIsVector = {
{ID, false},
{TITLE, false},
// ...
};
Python
The python code used the StableDiffusionPipeline
command from the diffusers
library to generate images from the prompts supplied by the command line arguments, which were specified using the argparse
library with the following options:
parser = argparse.ArgumentParser("SD")
parser.add_argument("--prompt", help="prompt given to stable diffusion", type=str)
parser.add_argument("--device", help="device to run on", type=str, default="AUTO")
parser.add_argument("--model_id", help="model id to use", type=str, default="sdxs-512-dreamshaper")
parser.add_argument("--model_dir", help="model directory to use", type=str, default="AiResources")
parser.add_argument("--output-dir", help="output path to save results", type=str, default="assets/images")
parser.add_argument("--songId", help="song id to use", type=str)
parser.add_argument("--allSongs", help="use all songs", action="store_true")
parser.add_argument("-e", "--electron", help="run in electron mode", action="store_true")
args = parser.parse_args()
There are 3 different modes which the code accepts:
- Prompt mode
- Single Song mode
- All song Mode
In prompt mode, the supplied prompt in the command line argument is the value that will be passed and generated by the T2I model. This is mainly for testing and debugging purposes and is not used in our product
The Single Song Mode is the main mode that is adopted by our software. After the pipe was instantiated from the model directory, the code searches through the json files to find the matching songId. The code then searches through all the background and object prompts, running the image generation per prompt, before storing it as png files under the specified image directory for that particular song.
pipe = StableDiffusionPipeline.from_pretrained(MODEL_DIR)
jsonFields = ["background_prompts", "object_prompts"]
# SONG MODE
if args.songId:
path = Path("assets","SongData", args.songId+".json")
imagesDir = Path(args.output_dir, args.songId)
if args.electron:
path = Path("resources", "assets", "SongData", args.songId+".json")
imagesDir = Path("resources", "assets", "images", args.songId)
if not path.exists():
print("Song not found")
if not imagesDir.exists():
imagesDir.mkdir()
with open(path) as f:
data = json.load(f)
for field in jsonFields:
objectList = data[field]
for (i, obj) in enumerate(objectList):
sample_text = obj
result = pipe(sample_text, num_inference_steps=1, guidance_scale=0)
image = result.images[0]
image.save(imagesDir / f"{field}_{i}.png")
print(f"Finished {field}_{i}")
print("Finished Stable Diffusion")
When generating the images, the num_inference_steps
(number of times the AI refines the image and builds on top) was set to be 1 as the AI was specifically fine tuned for quick inference with this setting. Anything higher than this would give a foggy mess as shown below:
Due to the same logic, the guidance_scale
(how random the image generated from the AI is) set to be 0 as otherwise the particular model makes horrifying images as shown below.
Image with num_inference_steps = 3
Image with guidance_scale = 0.5
The final mode, All Song Mode, was made with the assumption of running the image generation for all the songs in mind. However, this deletes any previous AI generated images and gives an error for the songs where the LLM has not yet been run, as the LLM is only run when the user manually runs it (explained more in section below). Therefore, this option was omitted from the final product in the end, although kept for experimental references for future products.
Electron Main integration and AI Runner
The 2 compiled AI executable files are stored under resources/cppVer.exe
and resources/SD.exe
.
For running the LLM commands, the frontend calls the ipcRenderer command run-gemma-with-options
with the songId and the a list of options, specifying the features to be extracted when running the AI. These options are converted into their equivalent flags set by the exe file using the buildGemmaCommand
function. By default, it is set in the -e
electron mode, changing the file dependencies to suit electron’s file configuration, -l
llm mode, -s
specifying the songId of the json.
// Add the function to build Gemma command with options
function buildGemmaCommand(songId: string, options: Record<string, boolean>) {
let command = `${exePath} -e -l -s ${songId}`;
// Add flags based on options
if (options.extractColour) command += ' -c';
if (options.extractParticle) command += ' -p';
// --- other flags
}
After the command is created, it uses the spawn
function in default JS to run the command consecutively the path to the setupvars.ps1
file (explained in detail in AI/C++ implementation section. This runs the AI as a background process, and stores it in the json file after it finishes execution. Although it is required to reload the json files (done automatically when exiting the song info page or manually via the button), the premise was to allow the user to enjoy playing with the product whilst the AI is processing the songs.
'powershell',
[
'-ExecutionPolicy',
'Bypass',
'-Command',
`& { . '${ps1Path}'; & ${command}; }`,
],
This method works perfectly, apart from the fact that there is no way to track the progress of how much the AI has processed from the frontend. As the user had no way of figuring out how much the AI’s current processing level, the runAIProessWithTracking
was used to process the console logs from the executable files.
// Store the process with its operationId
activeProcesses[operationId] = process;
process.stdout.on('data', (data) => {
trackProgressFromStdout(data, sender, operationId);
});
process.stderr.on('data', (data) => {
const errorMessage = data.toString();
console.error(`⚠️ stderr: ${errorMessage}`);
const errorData = {
operationId,
error: errorMessage
};
console.log("Sending error:", errorData);
sender.send('ai-error', errorData);
});
process.on('close', (code) => {
console.log(`✅ Process exited with code ${code}`);
const completeData = {
operationId,
exitCode: code
};
console.log("Sending process complete:", completeData);
sender.send('ai-process-complete', completeData);
// Remove from active processes when done
delete activeProcesses[operationId];
We defined preset finish commands hardcoded into both the AI executable and the Electron Main processes. When each AI process such as statusExtraction, aiSetup is finished, the executable logs the preset command on the console, which the Electron Main picks up and compares with the current list of progress steps set by the frontend. If any of them matches, the Electron Main sends a message to the Electron Renderer, stating that the specific progress step has been completed. After receiving this, the Electron Renderer frontend updates the processing status to complete.
// Define the possible progress steps for tracking
const progressSteps = {
whisper: 'Finished Whisper',
llm: 'Finished LLM',
stableDiffusion: 'Finished Stable Diffusion',
aiSetup: 'Finished AI Setup',
statusExtraction: 'Finished Status Extraction',
// etc...
function trackProgressFromStdout(data: Buffer, sender: Electron.WebContents, operationId: string) {
const output = data.toString();
console.log(`📜 stdout: ${output}`);
Object.entries(progressSteps).forEach(([key, message]) => {
if (output.includes(message)) {
const progressData = {
operationId,
step: key,
message: message,
completed: true
};
console.log("Sending progress update:", progressData);
sender.send('ai-progress-update', progressData);
}
});
}
As the same code is used throughout Electron Renderer in multiple places: the AIProgressTracker.tsx
to display the progress tracker react components, the AIRunner.tsx
to handle the commands sent to Electron Main and receive the status updates, the useAIProcessTracking.ts
to setup the hooks to store the current process of the tracked AI process, have been created. This leverages a lot of the boilerplate code used in multiple locations to run the AI into a single location.
There is also a BatchLLMRunner
which calls the run-gemma-with-options
command sequentially on all songs selected with the llm options selected. This is to allow the users to run all of the LLM processing overnight, instead of having to spend time waiting for the AI to finish.