Speech-to-Text (STT) Implementation

1. Model Selection and Voice Cloning

This feature allows users to select AI models and upload audio samples for voice cloning using the following frameworks and libraries:

Vue.js: For building the user interface.
Vuetify: For UI components.
LocalStorage: For storing user settings and uploaded files.

Key Components

The SettingsOverlay.vue component includes:

Dropdowns for AI and Speech-to-Text model selection.
File input for uploading audio samples.
Functions to handle file uploads and model selection.


<v-select
  v-model="selectedModel"
  label="Select Model"
  :items="['OpenAI', '9b Model', '3b Model']"
  @update:modelValue="saveSelectedModel"
></v-select>

<v-file-input
  v-model="voiceClips"
  multiple
  label="Upload audio clips"
  accept="audio/*"
  @change="handleFileUpload"
></v-file-input>

<v-btn @click="startVoiceCloning" :disabled="uploadedClips.length === 0">Clone Voice</v-btn>

Code Explanation

This section uses Vue's Composition API for reactive state management. The selected model is stored in the settingsStore and saved to local storage, while audio files are managed through a file input component, with voice cloning triggered via the startVoiceCloning function.

2. Microphone Recording and Transcription

This feature enables microphone recording and transcription using the following technologies:

Vue.js: For building the user interface.
Hugging Face Transformers: For leveraging the Whisper ASR models.
MediaDevices API: For accessing the microphone input.

Key Components of speech to text standard version

The MicButton.vue component includes:

Functions to start and stop recording using the MediaDevices API.
Integration with Hugging Face Whisper models for transcription.


async function startRecording() {
  audioStream = await navigator.mediaDevices.getUserMedia({ audio: true });
  mediaRecorder = new MediaRecorder(audioStream);
  audioChunks = [];

  mediaRecorder.ondataavailable = (e) => {
    if (e.data.size > 0) audioChunks.push(e.data);
  };

  mediaRecorder.onstop = async () => {
    const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });
    const output = await transcriber(URL.createObjectURL(audioBlob));
    if (output?.text) {
      model.value = output.text;
      emit("textAvailable", output.text);
    }
    cleanup();
  };

  mediaRecorder.start();
}

Code Explanation

This section uses the MediaDevices API for microphone access. Recorded audio is processed using the Hugging Face Whisper model to generate and display transcriptions.

Key Components of Speaker Diarization

This feature enables speaker diarization using the following technologies:

Pyannote: For speaker segmentation.

Key Components with demo code rewrite for logic explain only

1. Model Loading Architecture

Dual-model system with specialized initialization sequence:

// Component mounting sequence
async onMounted() {
  // Speech-to-Text Model
  transcriber = await pipeline("automatic-speech-recognition", selectedModel);
  
  // Diarization Processor
  segmentationProcessor = await AutoProcessor.from_pretrained(
    'onnx-community/pyannote-segmentation-3.0'
  );
  
  // Speaker Segmentation Model (WebAssembly optimized)
  segmentationModel = await AutoModelForAudioFrameClassification.from_pretrained(
    'onnx-community/pyannote-segmentation-3.0', 
    { device: 'wasm', dtype: 'fp32' }
  );
}

2. Core Merging Algorithm

Three-phase temporal alignment strategy:

function mergeResults(transcription, diarization) {
  // Phase 1: Data sanitization
  const validSegments = diarization.filter(s => 
    s.end - s.start >= 0.5 && s.confidence >= 0.8
  );
  
  // Phase 2: Temporal alignment
  const formattedSegments = transcription.chunks.reduce((acc, chunk) => {
    const speaker = findOptimalSpeaker(chunk.timestamp, validSegments);
    return mergeSegments(acc, chunk, speaker);
  }, []);
  
  // Phase 3: Segment consolidation
  function mergeSegments(acc, chunk, speaker) {
    const last = acc[acc.length - 1];
    const newSegment = {
      start: chunk.timestamp[0],
      end: chunk.timestamp[1],
      text: chunk.text.trim(),
      speaker
    };
  
    if (last && last.speaker === speaker && (chunk.timestamp[0] - last.end < 1.5)) {
      last.text += ` ${newSegment.text}`;
      last.end = newSegment.end;
      return acc;
    }
    return [...acc, newSegment];
  }
}

3. Speaker Matching Logic

Multi-criteria temporal analysis:

function findSpeaker([start, end], segments) {
  // 1. Full containment check
  const container = segments.find(s => 
    s.start <= start && s.end >= end
  );
  
  // 2. Significant overlap (35% threshold)
  const overlapped = segments.filter(s => {
    const overlap = Math.min(end, s.end) - Math.max(start, s.start);
    return overlap / (end - start) >= 0.35;
  });
  
  // 3. Midpoint proximity fallback
  const midpoint = (start + end) / 2;
  return container || overlapped[0] || 
    segments.reduce((a,b) => 
      Math.abs(midpoint - (a.start+a.end)/2) < 
      Math.abs(midpoint - (b.start+b.end)/2) ? a : b
    );
}

Key Components of Realtime Speech-to-Text

1. Intelligent Noise Reduction

Real-time audio processing pipeline with multi-stage filtering:


  const noiseGate = audioContext.createDynamicsCompressor();
  noiseGate.threshold.value = -50; // Signal threshold in dB 
  noiseGate.knee.value = 40;       // Transition range in dB
  noiseGate.ratio.value = 12;      // Compression ratio of 12:1 for effective noise reduction
  noiseGate.attack.value = 0.003;  // Attack time in seconds (3ms) 
  noiseGate.release.value = 0.25;  // Release time in seconds (250ms)
  
  // Apply spectral filtering via low-pass filter
  const lowPassFilter = audioContext.createBiquadFilter();
  lowPassFilter.type = 'lowpass';
  lowPassFilter.frequency.value = 8000; // Cutoff frequency at 8kHz preserves speech harmonics
  
  // Configure audio processing signal chain
  mediaStreamSource.connect(noiseGate);
  noiseGate.connect(lowPassFilter);


scriptProcessor.onaudioprocess = e => {
const chunk = e.inputBuffer.getChannelData(0);
// Skip silent chunks
if (!chunk.some(s => s !== 0)) return;

// Buffer audio in 1s intervals
audioBuffers.push(new Float32Array(chunk));
if (Date.now() - lastSendTime >= Interval) {
  //merge into buffer
}
}

2. Fallback Processing System

Robust failure recovery mechanisms:

Web Worker Fallback: Secondary real-time processing
Main Thread Pipeline: High-quality Primary processing


async function stopRecording() {
try {
  // Primary processing attempt
  const result = await transcriber(fullAudio);
  model.value = result.text;
  //send to other components
} catch (error) {
  // Fallback to worker results
}
}

3. Realtime display with throttling


  /**
 * Handle partial transcription results with throttling
 * Updates model value with intermediate results for responsive UI
 */
const throttledUpdate = throttle((text) => {
  if (!text) return;
  
  if (partialResult.value !== text) {
    if (partialResult.value) {
      accumulatedText.value += partialResult.value + ' ';
    }
    partialResult.value = text;
  }
  
  if (model !== undefined) {
    model.value = accumulatedText.value + text;
  }
}, 200);

3. Message Generation

This feature generates messages using selected models with the following technologies:

CreateMLCEngine: For generating messages using LLM models.
OpenAI API: For generating responses using OpenAI's models.

Key Components

The MessageStore.js file includes:

Functions for message generation based on selected models and prompt inputed.
Integration with CreateMLCEngine or OpenAI API for generating messages.


const chatCompletionModel = settingStore.selectedLLMModel === "OpenAI"
  ? new OpenAIImplementation()
  : new WebLLMImplementation();

async create(messages, retryCount = 3) {
    try {
      while (this.engineLoading) {
        await sleep(100);
      }

      if (this.engine === null) {
        this.engineLoading = true;
        try {
          await this.setup();
        } finally {
          this.engineLoading = false;
        }
      }
      
      const completion = await this.engine.chat.completions.create({
        messages,
      });
      console.log(completion);
      console.log('The result is', completion.choices[0].message.content);
      return JSON.parse(
        completion.choices[0].message.content.trim()
          .replace(/(^[^[]+|[^\]]+$)/g, '')     
          .replace(/,\s*]$/, ']')               
          .replace(/"\s+"/g, '", "')            
          .replace(/(?<=\{)\s*([^"]+?)\s*:/g, '"$1":')
      );
    }

Code Explanation

This section selects the appropriate model based on user settings and generates messages by sending commands to the model, then processing and returning the responses.


async function generateSentences() {
    const command = `Given the Current Conversation History, generate a list of 3 to 5 short generic sentences the 
    assistant may want to say. You must respond only with a valid JSON list of suggestions and NOTHING else.`
    let messages = [
        {role: "system", content: getSentenceSystemMessage()},
        {role: "user", content: command}
        ]
    sentenceSuggestions.value = await chatCompletionModel.getResponse(messages, false, true) || sentenceSuggestions.value
    activeEditHistory.value = activeEditHistory.value.concat([
        {role: "system", content: command},
        {role: "assistant", content: `{"suggestions": ["${sentenceSuggestions.value.join('", "')}"]}`}
        ])
        }

async function generateWords() {
    const command = `Given the Current Conversation History, generate a short list of key words or 
    very short phrases the user can select from to build a new sentence. You must respond only with a valid JSON list 
    of suggestions and NOTHING else.`
    let messages = [
        {role: "system", content: getKeywordSystemMessage()},
        {role: "user", content: command}
        ]
        wordSuggestions.value = await chatCompletionModel.getResponse(messages, true, false) || wordSuggestions.value
        }

Code Explanation

This section give command to LLM model to generate sentences and words based on the conversation history.

Text-to-Speech (TTS) Implementation

Implementation Details

Frameworks/Libraries Used:

Huggingface Transformers: Utilized for text-to-speech conversion.
Web Audio API: Used to handle audio playback in the browser.

Implementation Steps

1. Library Imports:


import { AutoTokenizer, AutoProcessor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, Tensor } from '@huggingface/transformers';

2. Tokenization and Processing:


const tokenizer = await AutoTokenizer.from_pretrained('Xenova/speecht5_tts');
const processor = await AutoProcessor.from_pretrained('Xenova/speecht5_tts');

3. Model Loading:


const model = await SpeechT5ForTextToSpeech.from_pretrained('Xenova/speecht5_tts', { dtype: 'fp32' });
const vocoder = await SpeechT5HifiGan.from_pretrained('Xenova/speecht5_hifigan', { dtype: 'fp32' });

4. Speaker Embeddings:


const speaker_embeddings_data = new Float32Array(
  await (await fetch('/custom_speaker_embedding_single.bin')).arrayBuffer()
);
const speaker_embeddings = new Tensor(
  'float32',
  speaker_embeddings_data,
  [1, speaker_embeddings_data.length]
);

5. Generate Speech:


const { input_ids } = tokenizer(text);
const { waveform } = await model.generate_speech(input_ids, speaker_embeddings, { vocoder });

6. Audio Playback:


const sampleRate = 16000; // Replace with your actual sampling rate if different

// Create an AudioContext
const audioContext = new AudioContext();

// Create an AudioBuffer with 1 channel, length equal to waveform.size, and the desired sample rate
const audioBuffer = audioContext.createBuffer(1, waveform.size, sampleRate);

// Copy the waveform data into the AudioBuffer (assumes waveform.data is a Float32Array)
audioBuffer.copyToChannel(waveform.data, 0, 0);

// Create a buffer source node and set its buffer to the AudioBuffer
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;

// Connect the source node to the AudioContext's destination (the speakers)
source.connect(audioContext.destination);

// Start playback. (Browsers require a user gesture to start AudioContext if it is suspended.)
source.start();

Summary

The TTS feature is implemented using the Huggingface Transformers library for text tokenization, processing, and speech synthesis. The generated audio waveform is played back via the Web Audio API, and custom speaker embeddings personalize the voice output.