AI Implementation

C++
Python
Integration

C++

The C++ code creates a data pipeline leveraging OpenVINO genAI for both LLM and S2T effects, storing the extracted data in JSON format for use by the frontend application, using the nlohmann::json library.

The OpenVINO GenAI library we used was a dynamic library downloaded from here. This is the recommended format from their sample code and although a static compilation exists, it proved to be barely efficient over the dynamic library format with the GenAI and required sacrificing flexibility of the code to only make the code hardware specific. However, this does come with the downside of having to run the setupvars.ps1 script before running the compiled executable every single time the executable is run. The library requires certain environment variables to be set in order to locate the openvino dll files, which the setup script temporarily does.

Command line arguments are supplied to the compiled executable using boost::program_options library to control which features must be extracted from the frontend. The same flags are stored in the frontend AI caller, which runs the binary file stored under the relative path of ./resources/cppVer.exe .

po::options_description general_options("Allowed options");
general_options.add_options()
    ("help,h", "produce help message")
    ("debug,d", "enable debug mode")
    ("whisper,w", "use whisper mode")
    // ... other options

po::options_description llm_options("LLM only options");
llm_options.add_options()
    ("status", "extract status from lyrics")
    ("extractColour,c", "extract colours from lyrics")
    // ... other LLM options

These allow the user control over the following arguments, giving the user limited control of the AI generation to prevent inappropriate content or data formats which may break the database.

  Allowed options:
  -h, --help: produce help message
  -d, --debug: enable debug mode
  -w, --whisper: use whisper mode
  -l, --llm: use llm mode
  -s, --song: specify song id
  --text_log: enable text logging
  -m, --model: specify model name
  -e, --electron: enable electron mode, exe is run from Super Happy Space

  Whisper only options
    --fixSampleRate: fix sample rate of audio file to 16kHz

  LLM only options
    --smallerLLM: use smaller LLM model, with less parameters
    --status: extract status from lyrics
    -c, --extractColour: extract colours from lyrics
    -p, --extractParticle: extract particle effect from lyrics
    -o, --extractObject: extract objects from lyrics
    -b, --extractBackground: extract backgrounds from lyrics
    --generateObjectPrompts: generate object image prompts
    --generateBackgroundPrompts: generate background image prompts
    --all: extract all llm features

Main Function

The main function orchestrates the overall flow of the application by:

parsing the command line arguments
configuring paths based on the runtime environment
validating the requirements for the requested operation
initialising and running the appropriate pipeline
performing cleanup before exiting

int main(int argc, char *argv[]) {
  // Parse command-line options
  // ...
  
  // Set paths based on environment
  // ...
  
  // validate requirements e.g.
  // check if model type is specified
  if (!vm.count("whisper") && !vm.count("llm") &&
      !vm.count("stable-diffusion")) {
    std::cerr << "Error: Please specify a model type to use" << std::endl;
    return 1;
  }
  
  // Execute the requested pipeline
  if (vm.count("whisper")) {
    // Run Whisper pipeline
    // ...
  }
  
  if (vm.count("llm")) {
    // Run LLM pipeline
    // ...
  }
  
  // Cleanup
  // ...
  return 0;
}

After the pipeline is configured, the code base splits into 2 main classes:

Whisper - Handles audio transcription of lyrics - calls the ov::genai::WhisperPipeLine
LLM - Performs text analysis and content generation - calls the ov::genai::LLMPipeline

Although I have used classes to bundle the relevant functions together, I did not use an OOP format by creating an AI superclass as the 2 classes had little in common apart from calling the constructor. A class was used instead of a namespace to bundle the classes as the relevant functions had variables that it needed to share between functions such as the pipeline and the final unordered_map to be stored as a json.

The constructor of each class sets up the appropriate pipeline from the OpenVINO GenAI library, by passing in the device type and the model directory. The current implementation works with models in the OpenVINO IR format, which can be converted from most models on huggingface using their Optimum CLI tool. For our specific project, we used the default pre-converted models on OpenVINO’s huggingface page.

Before setting up each AI pipeline in the constructors, the code automatically detects the compatible devices. The code automatically defaults to using either the NPU or GPU if it is available and uses the first available Device if not. By default, the OpenVINO is compatible with all CPU structures, even though it may not be optimised so there is no need for error handlers when there are no availableDevices.

std::string getModelDevice() {
  ov::Core core;
  std::vector<std::string> availableDevices = core.get_available_devices();
  
  // Prioritise GPU if available
  for (const auto &device : availableDevices) {
    if (device.find("NPU") != std::string::npos) {
      return device;
    }
    if (device.find("GPU") != std::string::npos) {
      return device;
    }
  }
  return availableDevices[0];
}

LLM Class

The LLM class is structured in the following format:

class LLM {
 private:
  const std::string device;
  ov::genai::LLMPipeline pipe;
  const std::string songName;
  const std::string lyrics;
  const bool debug;
  std::string lyricsSetup;
  std::string shorterLyricsSetup;
  std::string outputFilePath;
  std::unordered_map<LLMOutputType, std::vector<std::string>> outputMap;

  std::string generate(std::string prompt, int max_new_tokens);
  void retrieveCurrentOutput();

 public:
  LLM(std::string llmModelPath, std::string songName, bool debug);
  void extractColours();
  void extractStatus();
  void extractParticleEffect();
  void extractObjects();
  void extractBackgrounds();
  void generateObjectPrompts();
  void generateBackgroundPrompts();
  void jsonStoreData();
};

The constructor initialises the LLM pipeline, preparing both full and truncated prompts and loading any existing output using the retrieveCurrentOutput function. This function gets the json file from the global outputFilePath variable and stores the data as an unordered_map . Although the LLM does not use any of the data from the current json file stored, nlohmann::json does not support updating the json file automatically by default. Therefore, when writing to the database, the whole file is deleted and rewritten. To save the original data that the LLM does not update, the initial values must be read in first. The LLMOutputType is a default enum value as detailed more in Global Arguments section

  void retrieveCurrentOutput() {
    json j;
    // read existing json data from file if it exists
    std::ifstream inputFile(outputFilePath);
    if (inputFile.is_open()) {
      std::cout << "Reading existing data from file" << std::endl;
      inputFile >> j;
      inputFile.close();
    } else {
      j = json();
    }
    // store existing data in outputMap
    for (const auto &output : j.items()) {
      // std ::cout << "output type: " << output.key() << std::endl;
      LLMOutputType outputType = outputTypeMapReverse.at(output.key());
      if (outputTypeIsVector.at(outputType)) {
        outputMap[outputType] = output.value();
      } else {
        // std::cout << "output value: " << output.value().get<std::string>()
        //           << std::endl;
        outputMap[outputType] =
            std::vector<std::string>{output.value().get<std::string>()};
      }
    }
  }

Each other function in the LLM creates a final prompt combining the global prompts, detailed in the global arguments section. If the debug flag is set, the code automatically shows the raw output in the terminal. This function cannot be accessed from the frontend as it is not possible to access the terminal unless the exe is run from the terminal, which is very unlikely.

void extractColours() {
  std ::cout << "Extracting colours from lyrics" << std::endl;
  std::string colourPrompt = lyricsSetup + colourExtractionPrompt;
  std::string colourOutput;
  try{
    colourOutput = generate(colourPrompt, 500);
  } catch (const std::bad_alloc& e) {
    std::cerr << "Bad allocation error: " << e.what() << std::endl;
    std::cerr << "Trying with shorter lyrics" << std::endl;
    colourOutput = generate(shorterLyricsSetup + colourExtractionPrompt, 500);
  } 
  
  // some other logic to extract from llm output
  
  outputMap[COLOURS] = colours;
  outputMap[COLOURS_REASON] = {colourOutput};

  if (debug) {
    std::cout << "Colours extracted: " << std::endl;
    std::cout << colourOutput << std::endl;
  }
}

Whisper Class

The whisper class has the following structure:

class Whisper {
 private:
  const std::string device;
  ov::genai::WhisperPipeline pipe;
  const std::string songId;
  const bool debug;

  void saveLyrics(std::string lyrics) {
    // Save lyrics to file
  }

 public:
  Whisper(std::string songId, bool debug);
  void generateLyrics();
};

The constructor initialises the whisper pipeline with the appropriate model and device, storing it under the pipe field.

After the Whisper class has been instantiated, the generateLyrics public functions is called with the songId setup. The whisper configs are are hardset to a maximum of 500 new tokens (around 380 words) and english language on transcribe mode. For processing of the audio file, I have reused the utils::audio::read_wav function defined in OpenVINO GenAI's sample code available https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/cpp/whisper_speech_recognition). It converts the wavPath into a raw speech input required by the library, hence the code requiring a specific wav format of 16kHz.

void Whisper::generateLyrics() {
  std::string wavPath = (wavDirPath / (songId + ".wav")).string();
  
  // Configure the generation settings
  ov::genai::WhisperGenerationConfig config = pipe.get_generation_config();
  config.max_new_tokens = 500;
  config.language = "<|en|>";
  config.task = "transcribe";
  config.return_timestamps = true;
  
  // Process the audio file
  ov::genai::RawSpeechInput rawSpeech = utils::audio::read_wav(wavPath);
  std::string lyrics = pipe.generate(rawSpeech, config);
  
  // Save the results
  saveLyrics(lyrics);
}

Global Arguments

As the system relies on a specific directory structure for models and assets, global constants are declared for commonly used. To support for multiple architectures, std::filesystem::path so that the pre-compiled code is portable to any system.

// ----------------- paths -----------------
std::filesystem::path currentDirectory = std::filesystem::current_path();
std::string gemmaModelPath;
std::string smallerLLMPath;
std::string stableDiffusionModelPath;
std::filesystem::path whisperModelPath;
std::filesystem::path songDataPath;
std::string particleListFilePath;
std::string logPath;
std::filesystem::path lyricsDirPath;
std::filesystem::path wavDirPath;
std::filesystem::path imageDirPath;

These paths can be adjusted based on the runtime environment, specifically when the -e, --electron flag is set, the path structure is formatted so it adheres to our project structure after the frontend is packaged.

Another global argument is the prompt templates, which we have carefully crafted to guide the LLM’s analysis.

std::string colourExtractionPrompt =
    "Analyse the lyrics of the song provided and extract 5 unique,"
    "unusual colors (avoid common colors like red, green, or blue) that are "
    "explicitly mentioned or strongly implied."
    // ... more prompt text

std::string statusPrompt = // ...
std::string particleSelectionPrompt = // ...
std::string lyricsPrompt = // ...
std::string objectExtractionPrompt =
    "Analyse the lyrics of the song provided and extract 3 unique, unusual "
    "objects that are explicitly mentioned or strongly implied."
    "Give the output in the following exact format for easy extraction using "
    "regex:"
    "Object 1: $Object name$"
    "Object 2: $Object name$"
    "Object 3: $Object name$";
std::string backgroundExtractionPrompt = // ...
std::string imageSetup = // ..
std::string imageSettings = // ..
std::string objectSettings = // ..
std::string backgroundSettings = // ..

A combination of these global prompts and the lyrics are used to generate the final prompt to be passed into the LLM. When multiple outputs are expected, specifically for the object extraction and the background extraction, explicit instructions are given to wrap the output in $ symbols. Then a helper function called getOptionsFromLlmOutput is called to extract all the words wrapped with a $ using regex expression of "\\$(.*?)\\$" .

As this json is read from both the electron code and the C++ code, the json keys must be synced between them. To ensure an easier switch in JSON keys, a set of global hashmaps and enums are used to convert between all the possible LLM. As the json data contains both string and list values, we store the json data as a map of vectors and convert to a string by taking the first value of the vector, if the outputTypeIsVector is false.

enum LLMOutputType {
  // fields which are not generated by LLM
  ID,
  TITLE,
  // ...
  
  // fields which are generated by LLM
  STATUS,
  COLOURS,
  // ...
};

const std::unordered_map<LLMOutputType, std::string> outputTypeMap = {
    {ID, "id"},
    {TITLE, "title"},
    // ...
};

const std::unordered_map<std::string, LLMOutputType> outputTypeMapReverse = {
    {"id", ID},
    {"title", TITLE},
    // ...
};
 
const std::unordered_map<LLMOutputType, bool> outputTypeIsVector = {
    {ID, false},
    {TITLE, false},
    // ...
};

Python

The python code used the StableDiffusionPipeline command from the diffusers library to generate images from the prompts supplied by the command line arguments, which were specified using the argparse library with the following options:

parser = argparse.ArgumentParser("SD")
parser.add_argument("--prompt", help="prompt given to stable diffusion", type=str)
parser.add_argument("--device", help="device to run on", type=str, default="AUTO")
parser.add_argument("--model_id", help="model id to use", type=str, default="sdxs-512-dreamshaper")
parser.add_argument("--model_dir", help="model directory to use", type=str, default="AiResources")
parser.add_argument("--output-dir", help="output path to save results", type=str, default="assets/images")
parser.add_argument("--songId", help="song id to use", type=str)
parser.add_argument("--allSongs", help="use all songs", action="store_true")
parser.add_argument("-e", "--electron", help="run in electron mode", action="store_true")
args = parser.parse_args()

There are 3 different modes which the code accepts:

Prompt mode
Single Song mode
All song Mode

In prompt mode, the supplied prompt in the command line argument is the value that will be passed and generated by the T2I model. This is mainly for testing and debugging purposes and is not used in our product

The Single Song Mode is the main mode that is adopted by our software. After the pipe was instantiated from the model directory, the code searches through the json files to find the matching songId. The code then searches through all the background and object prompts, running the image generation per prompt, before storing it as png files under the specified image directory for that particular song.

pipe = StableDiffusionPipeline.from_pretrained(MODEL_DIR) 
jsonFields = ["background_prompts", "object_prompts"]

# SONG MODE
if args.songId:
    path = Path("assets","SongData", args.songId+".json")
    imagesDir = Path(args.output_dir, args.songId)
    if args.electron:
        path = Path("resources", "assets", "SongData", args.songId+".json")
        imagesDir = Path("resources", "assets", "images", args.songId)  
    if not path.exists():
        print("Song not found")
    if not imagesDir.exists():
        imagesDir.mkdir()
    with open(path) as f:
        data = json.load(f)
    for field in jsonFields:
        objectList = data[field]
        for (i, obj) in enumerate(objectList):
            sample_text = obj
            result = pipe(sample_text, num_inference_steps=1, guidance_scale=0)
            image = result.images[0]
            image.save(imagesDir / f"{field}_{i}.png")
            print(f"Finished {field}_{i}")
    print("Finished Stable Diffusion")

When generating the images, the num_inference_steps (number of times the AI refines the image and builds on top) was set to be 1 as the AI was specifically fine tuned for quick inference with this setting. Anything higher than this would give a foggy mess as shown below:

Due to the same logic, the guidance_scale (how random the image generated from the AI is) set to be 0 as otherwise the particular model makes horrifying images as shown below.

alt text

Image with num_inference_steps = 3

alt text

Image with guidance_scale = 0.5

The final mode, All Song Mode, was made with the assumption of running the image generation for all the songs in mind. However, this deletes any previous AI generated images and gives an error for the songs where the LLM has not yet been run, as the LLM is only run when the user manually runs it (explained more in section below). Therefore, this option was omitted from the final product in the end, although kept for experimental references for future products.

Electron Main integration and AI Runner

The 2 compiled AI executable files are stored under resources/cppVer.exe and resources/SD.exe .

For running the LLM commands, the frontend calls the ipcRenderer command run-gemma-with-options with the songId and the a list of options, specifying the features to be extracted when running the AI. These options are converted into their equivalent flags set by the exe file using the buildGemmaCommand function. By default, it is set in the -e electron mode, changing the file dependencies to suit electron’s file configuration, -l llm mode, -s specifying the songId of the json.

// Add the function to build Gemma command with options
function buildGemmaCommand(songId: string, options: Record<string, boolean>) {
  let command = `${exePath} -e -l -s ${songId}`;

  // Add flags based on options
  if (options.extractColour) command += ' -c';
  if (options.extractParticle) command += ' -p';
  // --- other flags 
}

After the command is created, it uses the spawn function in default JS to run the command consecutively the path to the setupvars.ps1 file (explained in detail in AI/C++ implementation section. This runs the AI as a background process, and stores it in the json file after it finishes execution. Although it is required to reload the json files (done automatically when exiting the song info page or manually via the button), the premise was to allow the user to enjoy playing with the product whilst the AI is processing the songs.

    'powershell',
    [
      '-ExecutionPolicy',
      'Bypass',
      '-Command',
      `& { . '${ps1Path}'; & ${command}; }`,
    ],

This method works perfectly, apart from the fact that there is no way to track the progress of how much the AI has processed from the frontend. As the user had no way of figuring out how much the AI’s current processing level, the runAIProessWithTracking was used to process the console logs from the executable files.

  // Store the process with its operationId
  activeProcesses[operationId] = process;

  process.stdout.on('data', (data) => {
    trackProgressFromStdout(data, sender, operationId);
  });

  process.stderr.on('data', (data) => {
    const errorMessage = data.toString();
    console.error(`⚠️ stderr: ${errorMessage}`);
    const errorData = {
      operationId,
      error: errorMessage
    };
    console.log("Sending error:", errorData);
    sender.send('ai-error', errorData);
  });

  process.on('close', (code) => {
    console.log(`✅ Process exited with code ${code}`);
    const completeData = {
      operationId,
      exitCode: code
    };
    console.log("Sending process complete:", completeData);
    sender.send('ai-process-complete', completeData);

    // Remove from active processes when done
    delete activeProcesses[operationId];

We defined preset finish commands hardcoded into both the AI executable and the Electron Main processes. When each AI process such as statusExtraction, aiSetup is finished, the executable logs the preset command on the console, which the Electron Main picks up and compares with the current list of progress steps set by the frontend. If any of them matches, the Electron Main sends a message to the Electron Renderer, stating that the specific progress step has been completed. After receiving this, the Electron Renderer frontend updates the processing status to complete.

// Define the possible progress steps for tracking
const progressSteps = {
  whisper: 'Finished Whisper',
  llm: 'Finished LLM',
  stableDiffusion: 'Finished Stable Diffusion',
  aiSetup: 'Finished AI Setup',
	statusExtraction: 'Finished Status Extraction',
  // etc...
  
function trackProgressFromStdout(data: Buffer, sender: Electron.WebContents, operationId: string) {
  const output = data.toString();
  console.log(`📜 stdout: ${output}`);

  Object.entries(progressSteps).forEach(([key, message]) => {
    if (output.includes(message)) {
      const progressData = {
        operationId,
        step: key,
        message: message,
        completed: true
      };
      console.log("Sending progress update:", progressData);
      sender.send('ai-progress-update', progressData);
    }
  });
}

As the same code is used throughout Electron Renderer in multiple places: the AIProgressTracker.tsx to display the progress tracker react components, the AIRunner.tsx to handle the commands sent to Electron Main and receive the status updates, the useAIProcessTracking.ts to setup the hooks to store the current process of the tracked AI process, have been created. This leverages a lot of the boilerplate code used in multiple locations to run the AI into a single location.

There is also a BatchLLMRunner which calls the run-gemma-with-options command sequentially on all songs selected with the llm options selected. This is to allow the users to run all of the LLM processing overnight, instead of having to spend time waiting for the AI to finish.

C++​

Main Function​

LLM Class​

Whisper Class​

Global Arguments​

Python​

Electron Main integration and AI Runner​