System Overview

The OssiaVoice project leverages a modular architecture optimized for accessibility, responsiveness, and privacy. It integrates powerful language models, real-time speech processing, and user-specific settings management to provide an adaptive communication solution.

High-Level Architecture

flowchart TD User[User Interface] --> State[State Management - Pinia] State --> Logic[Business Logic] Logic --> Integration["External Integrations
(where necessary)"] Integration --> Output[Audio Output & Suggestions]

The high-level architecture is user-centric, integrating UI, state management, logic, and external services to deliver adaptive communication.

Detailed Architecture Layers

flowchart TD subgraph UI[User Interface] Browser --> VueApp[Vue.js Application] VueApp --> Components[Vue Components: Message Builder, Accessibility Controls] end subgraph State[State Management] Components --> PiniaStores[Pinia Stores: Message, Settings, Alert, Loading, Microphone] end subgraph Logic[Business Logic] PiniaStores --> TextEngine[Text Generation Engine] TextEngine --> PromptGen[Prompt Generation & Editing Logic] end subgraph Integration[Integrations] TextEngine --> OpenAI[OpenAI API] TextEngine --> WebLLM[WebLLM Models] Components --> STT[Speech-to-Text Module] Components --> TTS[Text-to-Speech Module] end

OssiaVoice features interconnected layers including Vue.js for the UI, Pinia for state management, and integrations like OpenAI and WebLLM.

Speech-to-Text Workflow

flowchart TD STT[Speech-to-Text Module] --> Mode{STT Branch} Mode --> Standard[Standard STT Engine] Mode --> Diarization[Diarization STT Engine] Mode --> Realtime[Realtime Whisper STT Engine] %% Standard Flow Standard --> A1[Audio] A1 --> W1[Whisper] W1 --> T1[Text] %% Diarization Flow Diarization --> A2[Audio] A2 --> W2[Whisper] A2 --> S2[Segment Model] W2 --> M2["Merge Algorithm
(3 modes) + Filter"] S2 --> M2 M2 --> T2[Text] %% Realtime Flow Realtime --> A3[Audio Flow] A3 --> WF["Worker File (WebGPU)"] WF --> RT[Realtime Text] Realtime --> A4["Audio (Finalized)"] A4 --> W3[Whisper] W3 --> FT[Text]

OssiaVoice features an adaptive Speech-to-Text (STT) framework designed with modularity and extensibility in mind, leveraging GitHub repositories to allow continuous evolution and integration of new STT engines. This architecture enables developers to easily introduce, update, or optimize specific STT modes tailored to various user scenarios

Text-to-Speech Workflow

flowchart LR User[User] -->|Uploads Voice Clips| WebApp[OssiaVoice Web Application] WebApp -->|Sends clips for processing| PythonAPI[External Python API] PythonAPI -->|Generates Speaker Embeddings| Embeddings[Speaker Embeddings] Embeddings -->|Returned embeddings| WebApp WebApp -->|Combines Embeddings & Text Input| SpeechT5[SpeechT5 TTS Model] SpeechT5 -->|Generates Custom Voice Audio| Output[Custom Cloned Voice Audio] Output -->|Plays synthesized audio| User

The OssiaVoice system allows users to create highly personalized speech synthesis by uploading their own voice clips. These voice samples are securely transmitted to an external Python API, which processes the audio to generate unique speaker embeddings. The embeddings capture the distinct characteristics of the user's voice, enabling a custom vocal profile.

Suggestion Generation Sequence

sequenceDiagram User ->> AppState: Initialize interaction AppState ->> Model: Select Language Model Model ->> Engine: Instantiate Text Generation Engine User ->> Engine: Request Suggestions Engine ->> Engine: Generate Prompts Engine ->> Model: Retrieve suggestions Model ->> AppState: Return suggestions (JSON) AppState ->> User: Update UI with suggestions User ->> Engine: Provide Edit Hint (optional) Engine ->> Model: Regenerate refined suggestions Model ->> AppState: Update suggestions

This sequence shows the interaction flow from initial user interaction through the generation and refinement of communication suggestions.