1. Model Selection and Voice Cloning
This feature allows users to select AI models and upload audio samples for voice cloning using the following frameworks and libraries:
- Vue.js: For building the user interface.
- Vuetify: For UI components.
- LocalStorage: For storing user settings and uploaded files.
Key Components
The SettingsOverlay.vue
component includes:
- Dropdowns for AI and Speech-to-Text model selection.
- File input for uploading audio samples.
- Functions to handle file uploads and model selection.
<v-select
v-model="selectedModel"
label="Select Model"
:items="['OpenAI', '9b Model', '3b Model']"
@update:modelValue="saveSelectedModel"
></v-select>
<v-file-input
v-model="voiceClips"
multiple
label="Upload audio clips"
accept="audio/*"
@change="handleFileUpload"
></v-file-input>
<v-btn @click="startVoiceCloning" :disabled="uploadedClips.length === 0">Clone Voice</v-btn>
Code Explanation
This section uses Vue's Composition API for reactive state management. The selected model is stored in the settingsStore
and saved to local storage, while audio files are managed through a file input component, with voice cloning triggered via the startVoiceCloning
function.
2. Microphone Recording and Transcription
This feature enables microphone recording and transcription using the following technologies:
- Vue.js: For building the user interface.
- Hugging Face Transformers: For leveraging the Whisper ASR models.
- MediaDevices API: For accessing the microphone input.
Key Components of speech to text standard version
The MicButton.vue
component includes:
- Functions to start and stop recording using the MediaDevices API.
- Integration with Hugging Face Whisper models for transcription.
async function startRecording() {
audioStream = await navigator.mediaDevices.getUserMedia({ audio: true });
mediaRecorder = new MediaRecorder(audioStream);
audioChunks = [];
mediaRecorder.ondataavailable = (e) => {
if (e.data.size > 0) audioChunks.push(e.data);
};
mediaRecorder.onstop = async () => {
const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });
const output = await transcriber(URL.createObjectURL(audioBlob));
if (output?.text) {
model.value = output.text;
emit("textAvailable", output.text);
}
cleanup();
};
mediaRecorder.start();
}
Code Explanation
This section uses the MediaDevices API for microphone access. Recorded audio is processed using the Hugging Face Whisper model to generate and display transcriptions.
Key Components of Speaker Diarization
This feature enables speaker diarization using the following technologies:
- Pyannote: For speaker segmentation.
Key Components with demo code rewrite for logic explain only
1. Model Loading Architecture
Dual-model system with specialized initialization sequence:
// Component mounting sequence
async onMounted() {
// Speech-to-Text Model
transcriber = await pipeline("automatic-speech-recognition", selectedModel);
// Diarization Processor
segmentationProcessor = await AutoProcessor.from_pretrained(
'onnx-community/pyannote-segmentation-3.0'
);
// Speaker Segmentation Model (WebAssembly optimized)
segmentationModel = await AutoModelForAudioFrameClassification.from_pretrained(
'onnx-community/pyannote-segmentation-3.0',
{ device: 'wasm', dtype: 'fp32' }
);
}
2. Core Merging Algorithm
Three-phase temporal alignment strategy:
function mergeResults(transcription, diarization) {
// Phase 1: Data sanitization
const validSegments = diarization.filter(s =>
s.end - s.start >= 0.5 && s.confidence >= 0.8
);
// Phase 2: Temporal alignment
const formattedSegments = transcription.chunks.reduce((acc, chunk) => {
const speaker = findOptimalSpeaker(chunk.timestamp, validSegments);
return mergeSegments(acc, chunk, speaker);
}, []);
// Phase 3: Segment consolidation
function mergeSegments(acc, chunk, speaker) {
const last = acc[acc.length - 1];
const newSegment = {
start: chunk.timestamp[0],
end: chunk.timestamp[1],
text: chunk.text.trim(),
speaker
};
if (last && last.speaker === speaker && (chunk.timestamp[0] - last.end < 1.5)) {
last.text += ` ${newSegment.text}`;
last.end = newSegment.end;
return acc;
}
return [...acc, newSegment];
}
}
3. Speaker Matching Logic
Multi-criteria temporal analysis:
function findSpeaker([start, end], segments) {
// 1. Full containment check
const container = segments.find(s =>
s.start <= start && s.end >= end
);
// 2. Significant overlap (35% threshold)
const overlapped = segments.filter(s => {
const overlap = Math.min(end, s.end) - Math.max(start, s.start);
return overlap / (end - start) >= 0.35;
});
// 3. Midpoint proximity fallback
const midpoint = (start + end) / 2;
return container || overlapped[0] ||
segments.reduce((a,b) =>
Math.abs(midpoint - (a.start+a.end)/2) <
Math.abs(midpoint - (b.start+b.end)/2) ? a : b
);
}
Key Components of Realtime Speech-to-Text
1. Intelligent Noise Reduction
Real-time audio processing pipeline with multi-stage filtering:
const noiseGate = audioContext.createDynamicsCompressor();
noiseGate.threshold.value = -50; // Signal threshold in dB
noiseGate.knee.value = 40; // Transition range in dB
noiseGate.ratio.value = 12; // Compression ratio of 12:1 for effective noise reduction
noiseGate.attack.value = 0.003; // Attack time in seconds (3ms)
noiseGate.release.value = 0.25; // Release time in seconds (250ms)
// Apply spectral filtering via low-pass filter
const lowPassFilter = audioContext.createBiquadFilter();
lowPassFilter.type = 'lowpass';
lowPassFilter.frequency.value = 8000; // Cutoff frequency at 8kHz preserves speech harmonics
// Configure audio processing signal chain
mediaStreamSource.connect(noiseGate);
noiseGate.connect(lowPassFilter);
scriptProcessor.onaudioprocess = e => {
const chunk = e.inputBuffer.getChannelData(0);
// Skip silent chunks
if (!chunk.some(s => s !== 0)) return;
// Buffer audio in 1s intervals
audioBuffers.push(new Float32Array(chunk));
if (Date.now() - lastSendTime >= Interval) {
//merge into buffer
}
}
2. Fallback Processing System
Robust failure recovery mechanisms:
- Web Worker Fallback: Secondary real-time processing
- Main Thread Pipeline: High-quality Primary processing
async function stopRecording() {
try {
// Primary processing attempt
const result = await transcriber(fullAudio);
model.value = result.text;
//send to other components
} catch (error) {
// Fallback to worker results
}
}
3. Realtime display with throttling
/**
* Handle partial transcription results with throttling
* Updates model value with intermediate results for responsive UI
*/
const throttledUpdate = throttle((text) => {
if (!text) return;
if (partialResult.value !== text) {
if (partialResult.value) {
accumulatedText.value += partialResult.value + ' ';
}
partialResult.value = text;
}
if (model !== undefined) {
model.value = accumulatedText.value + text;
}
}, 200);
3. Message Generation
This feature generates messages using selected models with the following technologies:
- CreateMLCEngine: For generating messages using LLM models.
- OpenAI API: For generating responses using OpenAI's models.
Key Components
The MessageStore.js
file includes:
- Functions for message generation based on selected models and prompt inputed.
- Integration with CreateMLCEngine or OpenAI API for generating messages.
const chatCompletionModel = settingStore.selectedLLMModel === "OpenAI"
? new OpenAIImplementation()
: new WebLLMImplementation();
async create(messages, retryCount = 3) {
try {
while (this.engineLoading) {
await sleep(100);
}
if (this.engine === null) {
this.engineLoading = true;
try {
await this.setup();
} finally {
this.engineLoading = false;
}
}
const completion = await this.engine.chat.completions.create({
messages,
});
console.log(completion);
console.log('The result is', completion.choices[0].message.content);
return JSON.parse(
completion.choices[0].message.content.trim()
.replace(/(^[^[]+|[^\]]+$)/g, '')
.replace(/,\s*]$/, ']')
.replace(/"\s+"/g, '", "')
.replace(/(?<=\{)\s*([^"]+?)\s*:/g, '"$1":')
);
}
Code Explanation
This section selects the appropriate model based on user settings and generates messages by sending commands to the model, then processing and returning the responses.
async function generateSentences() {
const command = `Given the Current Conversation History, generate a list of 3 to 5 short generic sentences the
assistant may want to say. You must respond only with a valid JSON list of suggestions and NOTHING else.`
let messages = [
{role: "system", content: getSentenceSystemMessage()},
{role: "user", content: command}
]
sentenceSuggestions.value = await chatCompletionModel.getResponse(messages, false, true) || sentenceSuggestions.value
activeEditHistory.value = activeEditHistory.value.concat([
{role: "system", content: command},
{role: "assistant", content: `{"suggestions": ["${sentenceSuggestions.value.join('", "')}"]}`}
])
}
async function generateWords() {
const command = `Given the Current Conversation History, generate a short list of key words or
very short phrases the user can select from to build a new sentence. You must respond only with a valid JSON list
of suggestions and NOTHING else.`
let messages = [
{role: "system", content: getKeywordSystemMessage()},
{role: "user", content: command}
]
wordSuggestions.value = await chatCompletionModel.getResponse(messages, true, false) || wordSuggestions.value
}
Code Explanation
This section give command to LLM model to generate sentences and words based on the conversation history.