Audio Processing

Audio Analysis

The real-time beat detection and audio analysis functionalities are primarily managed within the Player component. This component utilises the Web Audio API to continuously analyse the audio stream, extracting frequency data and calculating energy levels across various frequency bands in order to identify musical beats and vocal dynamics in the audio playback.

Audio Context and Analyser Node Initialisation

The audio analysis pipeline is initiated by setting up an instance of the Web Audio API’s AudioContext. Within this context, an AnalyserNode is created, configured with a Fast Fourier Transform (FFT) sise of 2048, and a smoothing constant of 0.85. These parameters optimise the frequency resolution and provide smooth data for accurate beat detection.

The audio stream from the HTML audio element is routed through a MediaElementAudioSourceNode into the analyser, allowing direct processing of the live audio playback data.

Frequency Data Extraction

At regular intervals (approximately every 33 milliseconds), the analyser node extracts frequency data from the audio stream, representing the amplitude levels across a broad range of frequencies. This data is segmented into four distinct bands, each corresponding roughly to common musical ranges:

Bass frequencies: The lowest 8% of the frequency spectrum, capturing deep and impactful rhythmic elements such as kicks or bass notes.
Mid-low frequencies: From 8% to 25% of the spectrum, typically representing lower midrange sounds, including rhythmic elements like snare drums or rhythmic guitars.
Mid-high frequencies: From 25% to 60% of the spectrum, predominantly capturing vocals, melodic instruments, and other midrange dynamics.
High frequencies: The highest frequencies above 60%, capturing cymbals, high hats, and other percussive details.

Energy Level Computation

The energy within each frequency band is computed by averaging the amplitude levels of frequencies within the segment. Special weighting is applied to the bass frequencies, as they have a pronounced effect on the perceived rhythm and beats within the audio:

Bass Energy is boosted slightly to enhance the sensitivity to rhythmic bass patterns.
Mid-Low, Mid-High, and High Energy levels are calculated by simple averaging to provide balanced sensitivity to changes across these ranges.

Vocal Energy Estimation

Vocal energy is specifically monitored by focusing on mid-high and high-frequency bands, which commonly represent the frequency spectrum of human voices. This is essential for detecting vocal-driven musical elements and distinguishing them from purely instrumental segments.

The algorithm maintains a history of vocal energy levels, applying a decay factor to smooth fluctuations. Two averages are computed:

Recent Vocal Average: Focuses on recent vocal energy levels, representing short-term vocal dynamics.
Long-Term Vocal Average: Captures overall vocal energy trends throughout playback.

These averages determine when vocal energy crosses a dynamic threshold, indicating sustained vocal presence, and are used to enhance the responsiveness of lighting and visual effects.

Beat Detection Algorithm

Beat detection operates by continuously comparing the total combined energy (weighted heavily towards bass frequencies) against a dynamically adjusted threshold, calculated based on historical averages. Specifically, a rolling energy history (20 samples) is maintained, from which an average energy baseline is derived.

When the current total energy exceeds this dynamically adjusted threshold (typically between 1.1 to 1.4 times the historical average), and a minimum hold time has elapsed since the previous detected beat, a beat event is triggered.

Beat triggering conditions:

Total energy significantly exceeds the dynamically computed threshold.
Minimum interval (beatHoldTime, default 100ms) elapsed since the last beat to prevent rapid re-triggering.
Bass energy exceeds a minimum floor (15 amplitude points) to avoid false positives from non-rhythmic high-frequency noise.

When a beat is detected, internal state variables update accordingly, prompting potential visual changes (e.g., lighting colours, brightness pulses).

Colour Management and Beat Responsiveness

Each beat detection contributes to a counter that, upon reaching a configurable threshold, triggers a colour change for the integrated lighting system (such as Philips Hue). This periodic colour transition is managed with internal timers and counters to ensure visual coherence, avoiding overly frequent or abrupt colour shifts.

The colours used are either retrieved dynamically from a database (associated with the currently playing track) or default to a preset array of vibrant colours if database retrieval is unsuccessful.

Brightness Modulation

Brightness levels of visual elements respond dynamically to beats and vocal energy:

On beat detection, brightness briefly increases proportionally to the total energy detected.
During sustained periods of elevated vocal energy, brightness further elevates, reflecting vocal intensity in the visual output.

Integration with Philips Hue Lighting

When the Hue lighting integration (useHue) is active, detected beats and vocal dynamics are communicated in real-time to the Hue bridge using a structured data format. This data includes detailed energy metrics (bass, mid, high, vocal), brightness levels, and the current colour, enabling the Hue lights to synchronise precisely with musical elements. More information about the implementation of light syncing can be found in the Phillips Hue section.

System Audio

The default mode for adding a song is the system audio mode. The tsx file AddCustomSong.tsx creates a form that the user can submit to the Electron Main via ipcRenderer. To create a new song, the database needs the following information:

songName
artistName
thumbnailPath
audioPath

The songName & artistName is taken from a user filled in string form and the thumbnailPath uses the same logic as to upload custom images as background, invoking the ipcRenderer channel open-file-dialog , which is processed as below in Electron Main. It opens up a file dialog and allows for uploading of images with jpg, png, gif or jpeg extensions, returning the path as to be stored.

When the new song is created in the database by calling the save-custom-song channel, this image is copied to the path assets/images/{songId}/jacket.pngand stored as the jacket. On song creation, the statuses are all set randomly.

ipcMain.handle('open-file-dialog', async () => {
  const result = await dialog.showOpenDialog({
    properties: ['openFile'],
    filters: [{ name: 'Images', extensions: ['jpg', 'png', 'gif', 'jpeg'] }]
  });
  return result;
});

Initially, the songId was stored as the youtube Id before the custom audio file option was added. This was later changed to a uuidv4 code, which is very unlikely to overlap given the number of songs that our users may manually add.

In the same way as above, the audio file can be linked to the database by selecting the filePath to the audio file via the select-audio-file channel, which has the same logic as open-file-dialog but for audio files. Currently the code accepts mp3, wav, ogg, m4a, flac and webm files. After the audio file is chosen, it sends the generated songId and the audio filepath. The audio is then converted into 16kHz wav file for whisper analysis & the mp3 file for replay in the actual app.

const saveAudio = async (tempFile, outputFile, mp3OutputFile, onlyMp3) => {
  console.log('Converting audio to WAV and MP3...');
 // Convert directly to 16kHz WAV
 if (!onlyMp3) {
  await new Promise((resolve, reject) => {
    ffmpeg(tempFile)
      .audioFrequency(16000)
      .toFormat('wav')
      .on('error', (err) => {
        reject(new Error(`FFmpeg conversion error: ${err.message}`));
      })
      .on('end', resolve)
      .save(outputFile);
    });
  }

  // Convert to MP3
  await new Promise((resolve, reject) => {
    ffmpeg(tempFile)
    .audioBitrate('320k')
    .toFormat('mp3')
    .on('error', (err) => {
      reject(new Error(`FFmpeg conversion error: ${err.message}`));
    })
    .on('end', resolve)
    .save(mp3OutputFile);
  });

  // Clean up the temporary file
  if (tempFile.slice(-3) === 'm4a') {
    fs.unlinkSync(tempFile);
  }
}

Similar to the Youtube Audio section, whisper is ran on the lyrics directly after the wav file is generated.

Although most audio files can be linked, we also implemented a Screen Recorder feature, which allows the user to record the system audio using electron’s inbuilt MediaRecorder object. When the app is initially started, the Electron Main sets up a MediaRequestHandler which captures the video of the current first screen and audio and sends the data to the Electron Renderer upon request. When the start recording is pressed, a new MediaRecorder is initialised, taking the audio in the webm format. After the recording is stopped, the data is stored under RecordedChunks.

const mediaRecorder = new MediaRecorder(audioStream, { mimeType: 'audio/webm' });
mediaRecorder.ondataavailable = (event) => {
    if (event.data.size > 0) {
        // Add directly to ref
        chunksRef.current.push(event.data);
        // Also update state for UI if needed
        setRecordedChunks([...chunksRef.current]);
    }
};

The RecordedChunks is converted into a Blob format before being stored to a custom location as a webm file. The path to this file is saved and stored the same way as linking the audio above.

To prevent users from recording the song for too long and clogging the disk space, a timer is set to default 30minutes. When it exceeds this limit, it automatically stops the recording and prompts the user to save the audio file. This can be cancelled to save on space.

const MAX_DURATION = 30 * 60;

const interval = setInterval(() => {
  setDuration(prev => {
      const newDuration = prev + 1;
      if (newDuration >= MAX_DURATION) {
          clearInterval(interval);
          setTimeout(() => stopRecording)
          stopRecording();
          return MAX_DURATION
      }
      return newDuration;
  })
}, 1000);

YouTube Audio

Besides the default mode of adding songs through the System Audio the application includes functionality to process audio from YouTube videos, enabling users to integrate online content seamlessly. This involves downloading audio, converting it into multiple formats, extracting metadata, and organising it for playback and analysis.

Downloading and Conversion Process

To provide high-quality playback and analysis, the YouTube audio processing follows a structured pipeline:

Download – The application uses yt-dlp to extract the highest quality audio stream from the given YouTube URL.
Format Conversion – The audio file is processed using FFmpeg to generate:
- A 16kHz WAV file, optimised for Whisper-based lyric extraction.
- A 320kbps MP3 file, used for playback within the application.
Metadata Extraction – The system retrieves essential details, including the song title, artist name, and thumbnail image.
File Organisation – Processed files are stored within the application’s asset structure, ensuring accessibility for playback and further analysis.

Code Implementation

The main functionality is encapsulated in the youtubeToWav.js and youtubeToMP3.js file with two primary functions:

downloadYoutubeAudio: Handles downloading and converting the audio
getYoutubeMetadata: Extracts and saves metadata including thumbnails

The following function handles the entire download and conversion workflow:

const downloadYoutubeAudio = async (url, onlyMp3) => {
  ...
  // 1. Set up paths and extract video ID
  const videoId = url.match(/(v=)([^&]*)/)[2];
  ...
  // 2. Download audio using yt-dlp
  await ytdlp(url, {
    format: 'bestaudio',
    output: tempFile,
    'no-check-certificates': true,
    'prefer-free-formats': true,
  });
  ...
  // 3. Convert to WAV if needed (16kHz for audio analysis)
  if (!onlyMp3) {
    await new Promise((resolve, reject) => {
      ffmpeg(tempFile)
        .audioFrequency(16000)
        .toFormat('wav')
        .save(outputFile);
    });
  }
  ...
  // 4. Convert to MP3 (high quality)
  await new Promise((resolve, reject) => {
    ffmpeg(tempFile)
      .audioBitrate('320k')
      .toFormat('mp3')
      .save(mp3OutputFile);
  });
  ...
  // 5. Return the video ID for further use
  return videoId;
}

Metadata Extraction

To enhance the user experience, the system extracts metadata from YouTube, ensuring each song is displayed with relevant details. This includes:

Video title (used as song title)
Artists name (used as artist)
Thumbnail image (saved as album artwork)

This metadata enhances the user experience by providing context for the audio.

const getYoutubeMetadata = async (url) => {
  try {
    const metadata = await ytdlp(url, {
      dumpSingleJson: true,
      noCheckCertificates: true,
      preferFreeFormats: true,
      verbose: true,
    });
    console.log('Metadata:', metadata);
    const title = metadata.title;
    const artist = metadata.uploader;
    const thumbnailUrl = metadata.thumbnail; // Get the thumbnail URL
    
    // Extract video ID from the URL
    const videoId = url.match(/(v=)([^&]*)/)[2];
    
    // Create directory for the thumbnail
    const thumbnailDir = getResourcePath('assets', 'images', videoId);
    const thumbnailPath = path.join(thumbnailDir, 'jacket.png');
    
    if (!fs.existsSync(thumbnailDir)) {
      fs.mkdirSync(thumbnailDir, { recursive: true });
    }
    
    // Download the thumbnail
    await new Promise((resolve, reject) => {
      https.get(thumbnailUrl, (response) => {
        if (response.statusCode !== 200) {
          reject(new Error(`Failed to download thumbnail: ${response.statusCode}`));
          return;
        }
        
        const fileStream = fs.createWriteStream(thumbnailPath);
        response.pipe(fileStream);
        
        fileStream.on('finish', () => {
          fileStream.close();
          resolve();
        });
        
        fileStream.on('error', (err) => {
          fs.unlinkSync(thumbnailPath);
          reject(err);
        });
      }).on('error', reject);
    });

    console.log('Title:', title);
    console.log('Artist:', artist);
    console.log('Thumbnail saved to:', thumbnailPath);

    // Return the relative path to be stored in the database
    return { 
      title, 
      artist, 
      thumbnailPath: `images/${videoId}/jacket.png` 
    };
  } catch (error) {
    console.error('Error fetching metadata:', error);
    throw error;
  }
};

Configuration and Setup

The implementation carefully manages paths to ensure FFmpeg binaries are correctly located:

// Configure FFmpeg paths
const ffmpegPath = mainPaths.ffmpegPath;
const ffprobePath = mainPaths.ffprobePath;

// Verify FFmpeg binaries exist
if (!fs.existsSync(ffmpegPath)) {
  throw new Error(`FFmpeg not found at: ${ffmpegPath}`);
}

// Configure FFmpeg
ffmpeg.setFfmpegPath(ffmpegPath);
ffmpeg.setFfprobePath(ffprobePath);

By organising each track with a unique video ID, the system prevents duplicate entries and ensures efficient storage management.

Integration with the Application

YouTube audio processing is fully integrated into the system, allowing users to:

Select a YouTube video for processing.
Download, convert, and analyse its audio seamlessly.
Access extracted metadata and visuals in the UI.

This functionality expands the application’s versatility, enabling users to incorporate online music effortlessly.

Legal Considerations

Downloading audio directly from YouTube may violate the platform’s terms of service and copyright laws in certain jurisdictions. To ensure compliance, the default method for adding songs in the application is through system audio recording, which legally captures any audio played on the user’s device. This approach allows users to record music from any source without directly extracting copyrighted content, aligning with fair use policies and local recording laws. Once recorded, the audio file is processed and linked to the database, similar to the system audio workflow, where song details (e.g., name, artist, thumbnail, and file path) are stored for seamless playback and management.

Audio Analysis​

Audio Context and Analyser Node Initialisation​

Frequency Data Extraction​

Energy Level Computation​

Vocal Energy Estimation​

Beat Detection Algorithm​

Beat triggering conditions:​

Colour Management and Beat Responsiveness​

Brightness Modulation​

Integration with Philips Hue Lighting​

System Audio​

YouTube Audio​

Downloading and Conversion Process​

Code Implementation​

Metadata Extraction​

Configuration and Setup​

Integration with the Application​

Legal Considerations​