System Overview
The OssiaVoice project leverages a modular architecture optimized for accessibility, responsiveness, and privacy. It integrates powerful language models, real-time speech processing, and user-specific settings management to provide an adaptive communication solution.
High-Level Architecture
flowchart TD
User[User Interface] --> State[State Management - Pinia]
State --> Logic[Business Logic]
Logic --> Integration["External Integrations
(where necessary)"]
Integration --> Output[Audio Output & Suggestions]
The high-level architecture is user-centric, integrating UI, state management, logic, and external services to deliver adaptive communication.
Detailed Architecture Layers
flowchart TD
subgraph UI[User Interface]
Browser --> VueApp[Vue.js Application]
VueApp --> Components[Vue Components: Message Builder, Accessibility Controls]
end
subgraph State[State Management]
Components --> PiniaStores[Pinia Stores: Message, Settings, Alert, Loading, Microphone]
end
subgraph Logic[Business Logic]
PiniaStores --> TextEngine[Text Generation Engine]
TextEngine --> PromptGen[Prompt Generation & Editing Logic]
end
subgraph Integration[Integrations]
TextEngine --> OpenAI[OpenAI API]
TextEngine --> WebLLM[WebLLM Models]
Components --> STT[Speech-to-Text Module]
Components --> TTS[Text-to-Speech Module]
end
OssiaVoice features interconnected layers including Vue.js for the UI, Pinia for state management, and integrations like OpenAI and WebLLM.
Speech-to-Text Workflow
flowchart TD
STT[Speech-to-Text Module] --> Mode{STT Branch}
Mode --> Standard[Standard STT Engine]
Mode --> Diarization[Diarization STT Engine]
Mode --> Realtime[Realtime Whisper STT Engine]
%% Standard Flow
Standard --> A1[Audio]
A1 --> W1[Whisper]
W1 --> T1[Text]
%% Diarization Flow
Diarization --> A2[Audio]
A2 --> W2[Whisper]
A2 --> S2[Segment Model]
W2 --> M2["Merge Algorithm
(3 modes) + Filter"]
S2 --> M2
M2 --> T2[Text]
%% Realtime Flow
Realtime --> A3[Audio Flow]
A3 --> WF["Worker File (WebGPU)"]
WF --> RT[Realtime Text]
Realtime --> A4["Audio (Finalized)"]
A4 --> W3[Whisper]
W3 --> FT[Text]
OssiaVoice features an adaptive Speech-to-Text (STT) framework designed with modularity and extensibility in mind,
leveraging GitHub repositories to allow continuous evolution and integration of new STT engines.
This architecture enables developers to easily introduce, update, or optimize specific STT modes tailored to various user scenarios
Text-to-Speech Workflow
flowchart LR
User[User] -->|Uploads Voice Clips| WebApp[OssiaVoice Web Application]
WebApp -->|Sends clips for processing| PythonAPI[External Python API]
PythonAPI -->|Generates Speaker Embeddings| Embeddings[Speaker Embeddings]
Embeddings -->|Returned embeddings| WebApp
WebApp -->|Combines Embeddings & Text Input| SpeechT5[SpeechT5 TTS Model]
SpeechT5 -->|Generates Custom Voice Audio| Output[Custom Cloned Voice Audio]
Output -->|Plays synthesized audio| User
The OssiaVoice system allows users to create highly personalized speech synthesis by uploading their own voice clips.
These voice samples are securely transmitted to an external Python API, which processes the audio to generate unique speaker embeddings.
The embeddings capture the distinct characteristics of the user's voice, enabling a custom vocal profile.
Suggestion Generation Sequence
sequenceDiagram
User ->> AppState: Initialize interaction
AppState ->> Model: Select Language Model
Model ->> Engine: Instantiate Text Generation Engine
User ->> Engine: Request Suggestions
Engine ->> Engine: Generate Prompts
Engine ->> Model: Retrieve suggestions
Model ->> AppState: Return suggestions (JSON)
AppState ->> User: Update UI with suggestions
User ->> Engine: Provide Edit Hint (optional)
Engine ->> Model: Regenerate refined suggestions
Model ->> AppState: Update suggestions
This sequence shows the interaction flow from initial user interaction through the generation and refinement of communication suggestions.