Implementation

Contents [-]

AI music teacher backend
AI voice cloning backend
Unity frontend
Virtual instruments
Settings
Multiplayer

AI music teacher backend

FastAPI with llama.cpp

Project Structure Overview

jamboxx_infinite_backends/
├── app/
│   ├── main.py             # Entrypoint with FastAPI app
│   ├── services/           # Llama.cpp logic goes here
│   ├── config.py           # Configuration (e.g., model path, llama-cli settings)
│   ├── models/             # Pydantic models for request/response
│   └── api/                # Route definitions
├── run.py                  # Development entry script
├── requirements.txt        # Dependencies
└── README.md

FastAPI Endpoint `/chat/`

Key Features

Pydantic Models: Used for request/response data validation and parsing.
Llama.cpp Service Integration: Implements model inference using llama-cli.
App Initialization: Configures the FastAPI app and related services.

Unity Frontend Integration

Key Functionalities

Uses UnityWebRequest to send POST requests to FastAPI at http://127.0.0.1:8001/chat/.
Sends a prompt string in JSON format to the backend.
Receives streamed responses via DownloadHandlerRawText, updating the chat history in real time.
UI elements include:
- Mood/genre dropdowns
- A slider for song length
- A panel for lyrics

Data Flow Diagram

[Unity UI]
   ↓ (prompt)
[LlamaManager.cs]
   ↓ (POST request /chat/)
[FastAPI Endpoint]
   ↓
[Llama Service (llama-cli, mitral-7b)]
   ↓
[Model Output]
   ↓
[Unity UI Updated]

AI voice cloning backend

Jamboxx-infinite-Backend-AI-Voice-Clone

Key Feature: Audio Processing Pipeline

Overview

[Jamboxx-infinite-AI-voice-clone-backend]

The audio processing pipeline is a core feature of the project, enabling users to process audio files through multiple stages: separation, voice conversion, and merging. This feature is implemented using the FastAPI framework for building RESTful APIs, along with several specialized libraries for audio processing, such as NumPy, SoundFile, and PyTorch.

The pipeline is designed to handle complex audio processing tasks while maintaining modularity and scalability. Each stage of the pipeline is implemented as a separate service, ensuring clear separation of concerns and reusability.

Frameworks and Libraries Used

FastAPI: Provides the API endpoints and request handling.
NumPy: Used for numerical operations on audio data.
SoundFile: Handles reading and writing audio files in various formats.
PyTorch: Powers the machine learning models for voice conversion and audio separation.
aiofiles: Enables asynchronous file operations for efficient I/O handling.
UUID: Generates unique identifiers for temporary and output files.
Logging: Logs debug and error information for monitoring and debugging.

Project Structure and Functional Analysis

The project is structured to implement a modular and scalable backend for audio processing tasks. Below is an analysis of the functionality implemented in each folder based on the provided file paths and code.

1. routers

Purpose: This folder contains the API route definitions for the FastAPI application. It acts as the entry point for client requests and orchestrates the interaction between services.

Key File: router.py
- Implements endpoints for various audio processing tasks:
  - /voice/convert: Handles voice conversion using the DDSP-SVC model.
  - /audio/separate: Separates audio into vocals and instruments.
  - /audio/merge: Merges vocals and instruments into a single audio file.
  - /audio/process-all: A complete pipeline that combines separation, conversion, and merging.
  - /model/load and /model/info: Manage and retrieve information about the loaded machine learning models.
- Frameworks Used:
  - FastAPI: For defining RESTful API endpoints.
  - Pydantic: For request parameter validation.
  - aiofiles: For asynchronous file handling.
- Key Features:
  - Modular endpoints for each audio processing task.
  - Error handling and logging for debugging and monitoring.
  - Temporary file management to ensure efficient resource usage.

2. services

Purpose: This folder contains the core business logic for audio processing. Each service encapsulates a specific functionality, making the codebase modular and reusable.

Key Files:
- ddsp_service.py:
  - Implements the DDSP-SVC (Differentiable Digital Signal Processing - Singing Voice Conversion) model for voice conversion.
  - Provides methods like infer to process audio files with configurable parameters (e.g., speaker ID, pitch adjustment).
  - Frameworks Used:
    - PyTorch: For running the DDSP-SVC model.
    - NumPy: For numerical operations on audio data.
  - Key Features:
    - Supports dynamic model loading and inference.
    - Handles pitch extraction and audio enhancement.
- separator_service.py:
  - Implements audio source separation using a pre-trained Demucs model.
  - Provides methods like separate_tracks to split audio into vocals and instruments.
  - Frameworks Used:
    - PyTorch: For running the Demucs model.
    - SoundFile: For reading and writing audio files.
  - Key Features:
    - Processes multi-channel audio data.
    - Normalizes and validates audio to ensure compatibility with downstream tasks.
  - Also includes merge_tracks for combining separated audio tracks with configurable volume levels.

3. `app/static/`

Purpose: This folder serves as the storage location for processed audio files that are accessible via static URLs.

Subdirectories:
- convert/: Stores converted audio files from the /voice/convert endpoint.
- separator/: Stores separated audio files (vocals and instruments) from the /audio/separate endpoint.
- merge/: Stores merged audio files from the /audio/merge endpoint.
- pipeline/: Stores the final output of the complete audio processing pipeline from /audio/process-all.
Key Features:
- Organized storage for processed files.
- Enables clients to download files via static URLs.

4. `app/models/`

Purpose: This folder likely contains pre-trained machine learning models used for audio processing tasks.

Key Features:
- Stores models for DDSP-SVC and Demucs.
- Supports dynamic model loading via the /model/load endpoint.

5. utils

Purpose: This folder likely contains utility functions for common tasks such as file handling, logging, and data validation.

Key Features:
- Functions for saving and cleaning up temporary files.
- Helper methods for logging and error handling.

6. `app/tests/`

Purpose: This folder likely contains unit and integration tests for the application.

Key Features:
- Tests for API endpoints to ensure correct functionality.
- Mocking of services to isolate and test individual components.
- Validation of edge cases and error handling.

7. `app/main.py`

Purpose: This file is likely the entry point for the FastAPI application.

Key Features:
- Initializes the FastAPI app and includes the routers from routers.
- Configures middleware, logging, and static file serving.

Summary of Functional Design

The project is designed with a clear separation of concerns:

Routers handle API requests and responses.
Services encapsulate the core business logic for audio processing.
Static Files provide a mechanism for serving processed audio files.
Models and Utilities support the core functionality with reusable components.

Folder-Level Functional Analysis

The project is organized into several key folders, each serving a specific purpose in the audio processing pipeline. Below is an academic analysis of the functionality implemented in the enhancer, encoder, nsf_hifigan, and ddsp folders.

1. enhancer

Purpose: Implements audio enhancement functionality, focusing on improving the quality of audio signals after processing.

Key File: enhancer.py
- Class: Enhancer
  - Provides an interface for enhancing audio signals using different enhancement models.
  - Supports adaptive key adjustments and silence padding for better audio quality.
  - Uses NsfHifiGAN as the primary enhancement model.
- Class: NsfHifiGAN
  - Implements a neural vocoder based on HiFi-GAN for high-quality audio synthesis.
  - Includes methods for mel-spectrogram extraction and audio generation.
  - Dependencies:
    - PyTorch: For model inference.
    - STFT: For mel-spectrogram computation.
    - Resample: For resampling audio to match model requirements.

Key Features:

Adaptive resampling and pitch adjustment for enhanced audio fidelity.
Modular design to support additional enhancement models in the future.

2. encoder

Purpose: Provides feature extraction and encoding capabilities for audio signals, enabling downstream tasks like pitch extraction and content vector generation.

Subfolder: hubert/
- File: model.py
  - Implements the HuBERT model for extracting soft content vectors from audio.
  - Uses Fairseq and Transformers libraries for model loading and inference.
  - Supports multiple configurations for different feature extraction needs.
Subfolder: rmvpe/
- Files:
  - model.py: Implements the RMVPE model for robust pitch extraction.
  - inference.py: Provides utilities for running inference with the RMVPE model.
  - seq.py, spec.py: Handle sequence and spectrogram processing for pitch extraction.
  - constants.py: Defines constants used across the RMVPE implementation.
- Key Features:
  - High-accuracy pitch extraction using RMVPE.
  - Modular design for integrating additional pitch extraction algorithms.

Key Features:

Encodes audio into feature representations (e.g., pitch, content vectors) for use in voice conversion and synthesis.
Supports multiple feature extraction models, including HuBERT and RMVPE.

3. nsf_hifigan

Purpose: Implements the NsfHiFiGAN neural vocoder for high-quality audio synthesis.

Key Files:
- models.py: Defines the architecture of the NsfHiFiGAN model.
- nvSTFT.py: Implements the Short-Time Fourier Transform (STFT) for mel-spectrogram computation.
- env.py: Handles environment-specific configurations for the vocoder.
- utils.py: Provides utility functions for model loading and inference.

Key Features:

High-quality audio synthesis using neural vocoding techniques.
Efficient mel-spectrogram computation and audio generation.
Modular utilities for model loading and configuration.

Dependencies:

PyTorch: For model training and inference.
STFT: For spectrogram-based audio processing.

4. ddsp

Purpose: Implements the Differentiable Digital Signal Processing (DDSP) framework for voice conversion and synthesis.

Key Files:
- vocoder.py: Implements the F0_Extractor and other utilities for pitch extraction and audio synthesis.
  - Class: F0_Extractor
    - Extracts fundamental frequency (F0) from audio using algorithms like Parselmouth, PyWorld, and RMVPE.
    - Supports adaptive resampling and silence trimming for accurate pitch extraction.
- core.py: Provides core utilities for frequency filtering, upsampling, and other DSP operations.
- unit2control.py: Maps unit features to control signals for synthesis.
- loss.py: Defines loss functions for training DDSP models.
- pcmer.py: Implements pitch contour modeling for enhanced synthesis.

Key Features:

Modular design for pitch extraction, unit encoding, and audio synthesis.
Integration with multiple pitch extraction algorithms (e.g., Parselmouth, RMVPE).
Differentiable DSP operations for end-to-end training and inference.

Dependencies:

PyTorch: For model training and inference.
Parselmouth and PyWorld: For pitch extraction.
Resampy: For audio resampling.

Summary of Folder-Level Contributions

enhancer/: Focuses on post-processing and enhancing audio quality using neural vocoders like NsfHiFiGAN.
encoder/: Provides feature extraction capabilities, including pitch and content vector encoding, using models like HuBERT and RMVPE.
nsf_hifigan/: Implements the NsfHiFiGAN vocoder for high-quality audio synthesis.
ddsp/: Implements the DDSP framework for voice conversion and synthesis, including pitch extraction and differentiable DSP operations.

This modular structure ensures that each folder encapsulates a specific aspect of the audio processing pipeline, enabling scalability, maintainability, and ease of integration.

Implementation Details

1. Audio Separation

The first step in the pipeline is separating the input audio into vocals and instruments. This is achieved using the AudioSeparatorService class, which leverages a pre-trained Demucs model for source separation.

Input: A single audio file uploaded by the user.
Output: Two separate audio files (vocals and instruments) saved as temporary files.

Code Snippet:

vocals, instruments, sr = await separator_service.separate_tracks(input_path)
sf.write(vocals_path, vocals, sr)
sf.write(instruments_path, instruments, sr)

Key Steps:
- The input audio is validated and converted to a compatible format.
- The separate_tracks method processes the audio and returns the separated tracks.
- The separated tracks are normalized and saved to temporary files.

2. Voice Conversion

The second step involves converting the vocals using a DDSP-SVC (Differentiable Digital Signal Processing - Singing Voice Conversion) model. This is handled by the DDSPService class.

Input: The separated vocals file and user-specified parameters (e.g., speaker ID, pitch adjustment).
Output: A converted vocals file.

Code Snippet:

result_path = ddsp_service.infer(
    input_path=vocals_path,
    output_path=converted_vocals_path,
    spk_id=speaker_id,
    key=key,
    enhance=enhance,
    pitch_extractor=pitch_extractor,
    f0_min=f0_min,
    f0_max=f0_max,
    threhold=threhold,
    enhancer_adaptive_key=enhancer_adaptive_key
)

Key Steps:
- The infer method applies the DDSP-SVC model to the vocals file.
- Parameters such as pitch adjustment and speaker ID are passed to customize the conversion.
- The converted vocals are saved to a temporary file.

3. Audio Merging

The final step merges the converted vocals with the original instruments. This is performed using the merge_tracks method of the AudioSeparatorService.

Input: The converted vocals file, the instruments file, and volume multipliers for each track.
Output: A merged audio file.

Code Snippet:

merged_path = await separator_service.merge_tracks(
    vocals_path=converted_vocals_path,
    instruments_path=instruments_path,
    output_path=output_path,
    vocals_volume=vocals_volume,
    instruments_volume=instruments_volume
)

Key Steps:
- The merge_tracks method adjusts the volume of each track and combines them into a single audio file.
- The merged file is saved to a static directory for client access.

Sequence Diagram

Below is a high-level sequence diagram illustrating the pipeline:

User -> FastAPI Endpoint (/audio/process-all)
    -> AudioSeparatorService.separate_tracks()
        -> Separate vocals and instruments
    -> DDSPService.infer()
        -> Convert vocals
    -> AudioSeparatorService.merge_tracks()
        -> Merge converted vocals with instruments
    -> Return JSON response with file URLs

Error Handling and Cleanup

Error Handling: Each stage of the pipeline is wrapped in a try-except block to catch and log errors. If an error occurs, an appropriate HTTP response is returned.
Temporary File Cleanup: Temporary files are deleted in the finally block to prevent resource leaks.

Code Snippet:

finally:
    for path in temp_files:
        if path and os.path.exists(path):
            os.unlink(path)

Conclusion

The audio processing pipeline is a robust and modular implementation that leverages state-of-the-art machine learning models and efficient file handling techniques. By combining FastAPI with specialized audio libraries, the project achieves high performance and scalability while maintaining code clarity and reusability.

Unity frontend

Unity-Front-End

Virtual Instruments

Feature for playing virtual instruments with multiple input methods:

The script clickableButton.cs implements an interactive button system that handles multiple input methods (mouse clicks, keyboard presses, and hover interactions) with visual and audio feedback.

System works with a KeyMappingManager and PlayModeManager.

Attach script to every note of instrument.

Checks for assigned key presses to play note sound
Handles hover mode interactions (IsMouseOverButton() checks position of mouse)
Prevents rapid repeated triggers with hasPlayed flag (ResetPlayFlag method to reset hasPlayed flag to false after set duration in script)

Play sound effect with temporary AudioSource (assigned audio file to note in Unity Inspector)
Updates output text display
Triggers visual highlight effect (HightlightEffect method temporarily changes button color to gray to visualise click, returning to original color after specified duration)

Button loads its assigned key from KeyMappingManager based on its name (formatted as “instrument-note”)
Explanation for GetKeyForNote method in KeyMappingManager.cs script

Feature for moving notes of virtual instruments:

The movableButton.cs script implements a draggable UI button system that allows users to reposition buttons on screen when in “move-mode”. Script utilises Unity’s Event System interfaces to handle pointer interactions.

Attach script to every note of instrument.

Checks if move mode is enabled via the toggle
Calculates the offset between:
- The click position (in local space of parent RectTransform)
- The button’s current position
Stores this offset to use during dragging

Checks if move mode is enabled
Converts screen position to local position within parent
Updates button position using initial offset to maintain relative positioning

Feature for recording playing instruments:

The InstrumentRecorder.cs script implements a complete audio recording system for Unity that captures microphone input and saves it as WAV files. The system provides a user interface with start/stop controls and status feedbacks

Script attached to RecordingManager (empty) object.

Begins microphone capture for 60 seconds max at 44.1kHz (duration can be modified here)
Updates button states and status text (start button cannot be clicked and stop button can be clicked)

Stops microphone capture
Updates UI (start button can be clicked whereas stop button cannot be clicked)
Prepares for file saving

Configures file browser for WAV files
Calls SaveAbsolutePath method in SavWav.cs script for saving file to any specified location
Handles both successful save and cancel scenarios
Provides user feedback through status text (HideStatusTextAfterDelay method used to clear status message after set delay of 3 seconds)

Settings

Feature: Selecting instrument from dropdown

The InstrumentSelector.cs script implements an instrument selection system that allows users to choose between different virtual instruments through a dropdown UI. This manages the visibility of instrument GameObjects based on user selection.

Script attached to InstrumentManager (empty) object.

Hides all instruments initially when load into Settings page (HideAllInstruments sets the instruments .SetActive to false to hide the instrument)
Sets default dropdown selection to “Select Instrument”
Adds listener for selection changes

Hides all instruments to ensure no instruments are active
Uses switch-case to show selected instrument

Feature for assigning keys to notes:

The KeyMappingManager.cs script implements a comprehensive key binding system that allows users to assign custom keyboard keys to different instrument notes. It serves as a persistent singleton that maintains key mapping across all scenes in the project and provides visual feedbacks when selecting notes in Settings page.

Script attached to KeyMappingManager (empty) object in Settings and Dashboard scenes.

Implements singleton pattern to remain persistent across scene loads
Subscribes to scene change events

Parses button names (following the format of “instrument-note”)
Initiates key listening mode
Provides user feedback (ShowMessage displays the message in script)

Checks all possible keys each frame
Prevents duplicate assignments (IsKeyAlreadyAssignedForInstrument checks the key mapping if a key is already assigned for the specific instrument)
Handles successful assignments

Maintains two separate dictionaries for efficient lookup
Handles overwriting existing mappings
Organises by instrument

Updates TMP_Text object called “AssignedKeyText” (attached as a child to each note) to display assigned key for that specific note

Ensures that all assigned key texts are updated when scene is loaded

Get the assigned key for a specific note and instrument
Used in clickableButton.cs to load the assigned key for each note for virtual instruments
GetNoteForKey method works the same as this method, but instead gets the note for a given key press and instrument

Feature for switching between different play modes for virtual instruments:

The PlayModeManager.cs script implements a persistent play mode selection system that allows users to switch between different interaction modes (Click to play and Hover to play) for the virtual instruments. It maintains the selected mode across different scenes.

Script attached to PlayModeManager (empty) object in Settings and Dashboard scenes.

Defines available play modes (click and hover)
Default to set to click mode
Implements singleton pattern and remains in different scenes
Subscribes to scene load events

Handles mode selection changes
Updates current mode

Multiplayer

Overview

This project implements a basic multiplayer system using Unity Relay and Netcode for GameObjects (NGO). The networking is peer-hosted, where one player acts as the host and others join as clients via a Unity Relay Join Code.

Key Libraries Used

using Unity.Netcode;
using Unity.Services.Relay;
using Unity.Networking.Transport;

Server & Network Architecture

Host: The player who starts the session acts as the game server.
Unity Relay: Forwards data between host and clients.
Clients: Connect through Relay using a Join Code.
Network Management: The host manages all state synchronization, player interactions, and object spawning.

Port Information

Component	Detail
Relay Port	Dynamically assigned by Unity
Client → Relay	Client sends data to Relay
Relay → Host	Relay forwards to host

Networking Flow

1. Host Starts the Game

Host creates a Relay session using Unity’s Relay SDK.
A Join Code (e.g., FCQRWF) is generated.
Unity assigns a dynamic port for the session.
Unity Transport connects through Relay.

2. Client Joins the Session

The client enters the Join Code.
A connection request is sent to Relay.
Relay forwards the request to the host, establishing a connection.

3. Gameplay Events & RPC Communication

When a player interacts (e.g., presses a button), a ServerRpc is sent to the host.
The host processes the action and sends a ClientRpc to update all players.

Sound Synchronization

Audio feedback is handled by the NetworkAudioManager. Here’s how sound events are synced across players:

Client triggers a sound action.
A ServerRpc is sent to the host: PlaySoundServerRpc().
The host sends a ClientRpc to all players: PlaySoundClientRpc().

Handling Player Disconnects

When a client disconnects, Unity Relay notifies the host.
The host removes the player from the active session and updates any necessary game state.

AI music teacher backend

FastAPI with llama.cpp

Project Structure Overview

FastAPI Endpoint /chat/

Key Features

Unity Frontend Integration

Key Functionalities

Data Flow Diagram

AI voice cloning backend

Key Feature: Audio Processing Pipeline

Overview

Frameworks and Libraries Used

Project Structure and Functional Analysis

1. routers

2. services

3. app/static/

4. app/models/

5. utils

6. app/tests/

7. app/main.py

Summary of Functional Design

Folder-Level Functional Analysis

1. enhancer

2. encoder

3. nsf_hifigan

4. ddsp

Summary of Folder-Level Contributions

Implementation Details

1. Audio Separation

2. Voice Conversion

3. Audio Merging

Sequence Diagram

Error Handling and Cleanup

Conclusion

Unity frontend

Virtual Instruments

Settings

Multiplayer

Overview

Key Libraries Used

Server & Network Architecture

Port Information

Networking Flow

1. Host Starts the Game

2. Client Joins the Session

3. Gameplay Events & RPC Communication

Sound Synchronization

Handling Player Disconnects

FastAPI Endpoint `/chat/`

3. `app/static/`

4. `app/models/`

6. `app/tests/`

7. `app/main.py`