Skip to main content

AI Implementation

C++

The C++ code creates a data pipeline leveraging OpenVINO genAI for both LLM and S2T effects, storing the extracted data in JSON format for use by the frontend application, using the nlohmann::json library.

The OpenVINO GenAI library we used was a dynamic library downloaded from here. This is the recommended format from their sample code and although a static compilation exists, it proved to be barely efficient over the dynamic library format with the GenAI and required sacrificing flexibility of the code to only make the code hardware specific. However, this does come with the downside of having to run the setupvars.ps1 script before running the compiled executable every single time the executable is run. The library requires certain environment variables to be set in order to locate the openvino dll files, which the setup script temporarily does.

Command line arguments are supplied to the compiled executable using boost::program_options library to control which features must be extracted from the frontend. The same flags are stored in the frontend AI caller, which runs the binary file stored under the relative path of ./resources/cppVer.exe .

po::options_description general_options("Allowed options");
general_options.add_options()
("help,h", "produce help message")
("debug,d", "enable debug mode")
("whisper,w", "use whisper mode")
// ... other options

po::options_description llm_options("LLM only options");
llm_options.add_options()
("status", "extract status from lyrics")
("extractColour,c", "extract colours from lyrics")
// ... other LLM options

These allow the user control over the following arguments, giving the user limited control of the AI generation to prevent inappropriate content or data formats which may break the database.

  Allowed options:
-h, --help: produce help message
-d, --debug: enable debug mode
-w, --whisper: use whisper mode
-l, --llm: use llm mode
-s, --song: specify song id
--text_log: enable text logging
-m, --model: specify model name
-e, --electron: enable electron mode, exe is run from Super Happy Space

Whisper only options
--fixSampleRate: fix sample rate of audio file to 16kHz

LLM only options
--smallerLLM: use smaller LLM model, with less parameters
--status: extract status from lyrics
-c, --extractColour: extract colours from lyrics
-p, --extractParticle: extract particle effect from lyrics
-o, --extractObject: extract objects from lyrics
-b, --extractBackground: extract backgrounds from lyrics
--generateObjectPrompts: generate object image prompts
--generateBackgroundPrompts: generate background image prompts
--all: extract all llm features

Main Function

The main function orchestrates the overall flow of the application by:

  • parsing the command line arguments
  • configuring paths based on the runtime environment
  • validating the requirements for the requested operation
  • initialising and running the appropriate pipeline
  • performing cleanup before exiting
int main(int argc, char *argv[]) {
// Parse command-line options
// ...

// Set paths based on environment
// ...

// validate requirements e.g.
// check if model type is specified
if (!vm.count("whisper") && !vm.count("llm") &&
!vm.count("stable-diffusion")) {
std::cerr << "Error: Please specify a model type to use" << std::endl;
return 1;
}

// Execute the requested pipeline
if (vm.count("whisper")) {
// Run Whisper pipeline
// ...
}

if (vm.count("llm")) {
// Run LLM pipeline
// ...
}

// Cleanup
// ...
return 0;
}

After the pipeline is configured, the code base splits into 2 main classes:

  1. Whisper - Handles audio transcription of lyrics - calls the ov::genai::WhisperPipeLine
  2. LLM - Performs text analysis and content generation - calls the ov::genai::LLMPipeline

Although I have used classes to bundle the relevant functions together, I did not use an OOP format by creating an AI superclass as the 2 classes had little in common apart from calling the constructor. A class was used instead of a namespace to bundle the classes as the relevant functions had variables that it needed to share between functions such as the pipeline and the final unordered_map to be stored as a json.

The constructor of each class sets up the appropriate pipeline from the OpenVINO GenAI library, by passing in the device type and the model directory. The current implementation works with models in the OpenVINO IR format, which can be converted from most models on huggingface using their Optimum CLI tool. For our specific project, we used the default pre-converted models on OpenVINO’s huggingface page.

Before setting up each AI pipeline in the constructors, the code automatically detects the compatible devices. The code automatically defaults to using either the NPU or GPU if it is available and uses the first available Device if not. By default, the OpenVINO is compatible with all CPU structures, even though it may not be optimised so there is no need for error handlers when there are no availableDevices.

std::string getModelDevice() {
ov::Core core;
std::vector<std::string> availableDevices = core.get_available_devices();

// Prioritise GPU if available
for (const auto &device : availableDevices) {
if (device.find("NPU") != std::string::npos) {
return device;
}
if (device.find("GPU") != std::string::npos) {
return device;
}
}
return availableDevices[0];
}

LLM Class

The LLM class is structured in the following format:

class LLM {
private:
const std::string device;
ov::genai::LLMPipeline pipe;
const std::string songName;
const std::string lyrics;
const bool debug;
std::string lyricsSetup;
std::string shorterLyricsSetup;
std::string outputFilePath;
std::unordered_map<LLMOutputType, std::vector<std::string>> outputMap;

std::string generate(std::string prompt, int max_new_tokens);
void retrieveCurrentOutput();

public:
LLM(std::string llmModelPath, std::string songName, bool debug);
void extractColours();
void extractStatus();
void extractParticleEffect();
void extractObjects();
void extractBackgrounds();
void generateObjectPrompts();
void generateBackgroundPrompts();
void jsonStoreData();
};

The constructor initialises the LLM pipeline, preparing both full and truncated prompts and loading any existing output using the retrieveCurrentOutput function. This function gets the json file from the global outputFilePath variable and stores the data as an unordered_map . Although the LLM does not use any of the data from the current json file stored, nlohmann::json does not support updating the json file automatically by default. Therefore, when writing to the database, the whole file is deleted and rewritten. To save the original data that the LLM does not update, the initial values must be read in first. The LLMOutputType is a default enum value as detailed more in Global Arguments section

  void retrieveCurrentOutput() {
json j;
// read existing json data from file if it exists
std::ifstream inputFile(outputFilePath);
if (inputFile.is_open()) {
std::cout << "Reading existing data from file" << std::endl;
inputFile >> j;
inputFile.close();
} else {
j = json();
}
// store existing data in outputMap
for (const auto &output : j.items()) {
// std ::cout << "output type: " << output.key() << std::endl;
LLMOutputType outputType = outputTypeMapReverse.at(output.key());
if (outputTypeIsVector.at(outputType)) {
outputMap[outputType] = output.value();
} else {
// std::cout << "output value: " << output.value().get<std::string>()
// << std::endl;
outputMap[outputType] =
std::vector<std::string>{output.value().get<std::string>()};
}
}
}

Each other function in the LLM creates a final prompt combining the global prompts, detailed in the global arguments section. If the debug flag is set, the code automatically shows the raw output in the terminal. This function cannot be accessed from the frontend as it is not possible to access the terminal unless the exe is run from the terminal, which is very unlikely.

void extractColours() {
std ::cout << "Extracting colours from lyrics" << std::endl;
std::string colourPrompt = lyricsSetup + colourExtractionPrompt;
std::string colourOutput;
try{
colourOutput = generate(colourPrompt, 500);
} catch (const std::bad_alloc& e) {
std::cerr << "Bad allocation error: " << e.what() << std::endl;
std::cerr << "Trying with shorter lyrics" << std::endl;
colourOutput = generate(shorterLyricsSetup + colourExtractionPrompt, 500);
}

// some other logic to extract from llm output

outputMap[COLOURS] = colours;
outputMap[COLOURS_REASON] = {colourOutput};

if (debug) {
std::cout << "Colours extracted: " << std::endl;
std::cout << colourOutput << std::endl;
}
}

Whisper Class

The whisper class has the following structure:

class Whisper {
private:
const std::string device;
ov::genai::WhisperPipeline pipe;
const std::string songId;
const bool debug;

void saveLyrics(std::string lyrics) {
// Save lyrics to file
}

public:
Whisper(std::string songId, bool debug);
void generateLyrics();
};

The constructor initialises the whisper pipeline with the appropriate model and device, storing it under the pipe field.

After the Whisper class has been instantiated, the generateLyrics public functions is called with the songId setup. The whisper configs are are hardset to a maximum of 500 new tokens (around 380 words) and english language on transcribe mode. For processing of the audio file, I have reused the utils::audio::read_wav function defined in OpenVINO GenAI's sample code available https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/cpp/whisper_speech_recognition). It converts the wavPath into a raw speech input required by the library, hence the code requiring a specific wav format of 16kHz.

void Whisper::generateLyrics() {
std::string wavPath = (wavDirPath / (songId + ".wav")).string();

// Configure the generation settings
ov::genai::WhisperGenerationConfig config = pipe.get_generation_config();
config.max_new_tokens = 500;
config.language = "<|en|>";
config.task = "transcribe";
config.return_timestamps = true;

// Process the audio file
ov::genai::RawSpeechInput rawSpeech = utils::audio::read_wav(wavPath);
std::string lyrics = pipe.generate(rawSpeech, config);

// Save the results
saveLyrics(lyrics);
}

Global Arguments

As the system relies on a specific directory structure for models and assets, global constants are declared for commonly used. To support for multiple architectures, std::filesystem::path so that the pre-compiled code is portable to any system.

// ----------------- paths -----------------
std::filesystem::path currentDirectory = std::filesystem::current_path();
std::string gemmaModelPath;
std::string smallerLLMPath;
std::string stableDiffusionModelPath;
std::filesystem::path whisperModelPath;
std::filesystem::path songDataPath;
std::string particleListFilePath;
std::string logPath;
std::filesystem::path lyricsDirPath;
std::filesystem::path wavDirPath;
std::filesystem::path imageDirPath;

These paths can be adjusted based on the runtime environment, specifically when the -e, --electron flag is set, the path structure is formatted so it adheres to our project structure after the frontend is packaged.

Another global argument is the prompt templates, which we have carefully crafted to guide the LLM’s analysis.

std::string colourExtractionPrompt =
"Analyse the lyrics of the song provided and extract 5 unique,"
"unusual colors (avoid common colors like red, green, or blue) that are "
"explicitly mentioned or strongly implied."
// ... more prompt text

std::string statusPrompt = // ...
std::string particleSelectionPrompt = // ...
std::string lyricsPrompt = // ...
std::string objectExtractionPrompt =
"Analyse the lyrics of the song provided and extract 3 unique, unusual "
"objects that are explicitly mentioned or strongly implied."
"Give the output in the following exact format for easy extraction using "
"regex:"
"Object 1: $Object name$"
"Object 2: $Object name$"
"Object 3: $Object name$";
std::string backgroundExtractionPrompt = // ...
std::string imageSetup = // ..
std::string imageSettings = // ..
std::string objectSettings = // ..
std::string backgroundSettings = // ..

A combination of these global prompts and the lyrics are used to generate the final prompt to be passed into the LLM. When multiple outputs are expected, specifically for the object extraction and the background extraction, explicit instructions are given to wrap the output in $ symbols. Then a helper function called getOptionsFromLlmOutput is called to extract all the words wrapped with a $ using regex expression of "\\$(.*?)\\$" .

As this json is read from both the electron code and the C++ code, the json keys must be synced between them. To ensure an easier switch in JSON keys, a set of global hashmaps and enums are used to convert between all the possible LLM. As the json data contains both string and list values, we store the json data as a map of vectors and convert to a string by taking the first value of the vector, if the outputTypeIsVector is false.

enum LLMOutputType {
// fields which are not generated by LLM
ID,
TITLE,
// ...

// fields which are generated by LLM
STATUS,
COLOURS,
// ...
};

const std::unordered_map<LLMOutputType, std::string> outputTypeMap = {
{ID, "id"},
{TITLE, "title"},
// ...
};

const std::unordered_map<std::string, LLMOutputType> outputTypeMapReverse = {
{"id", ID},
{"title", TITLE},
// ...
};

const std::unordered_map<LLMOutputType, bool> outputTypeIsVector = {
{ID, false},
{TITLE, false},
// ...
};