AI voice conversion development blogs #1

Development Blog: Enhancing the Audio Processing Backend

Date: 2025-01-05

Author: Wesley Xu

Overview

In this development cycle, we focused on expanding the functionality of the audio processing backend by integrating new workflows, debugging existing features, and updating the documentation. These updates aim to improve the system’s capabilities in voice interaction, encoding/decoding, and diffusion-based audio synthesis. Below, we provide a detailed breakdown of the changes made during this update.

1. Updated requirements.txt

Commit: 00b50b8
Description:
To support the new features and workflows, we updated the requirements.txt file to include additional dependencies. These dependencies are essential for running the new functionalities, such as voice interaction and diffusion workflows.

Key Changes:

Added libraries for advanced audio processing, such as torch, librosa, and soundfile.
Included dependencies for diffusion-based synthesis, such as diffusers and transformers.
Ensured compatibility with the latest versions of FastAPI and Pydantic.

Impact: This update ensures that all required libraries are installed, reducing setup time for new developers and improving the reliability of the system.

2. Added Voice Interaction Workflow

Commit: 8ae9b90
Description:
We integrated a voice interaction workflow using the DDSP (Differentiable Digital Signal Processing) framework. This workflow allows users to interact with the system by converting their voice into a target speaker’s voice with enhanced quality and pitch control.

Key Features:

Voice Conversion: Converts the input voice to a target speaker’s voice using DDSP-SVC.
Pitch Control: Allows users to adjust the pitch of the converted voice using semitone-based controls.
Enhancement: Integrated the NsfHiFiGAN vocoder for high-quality audio synthesis.

Implementation Details:

Added a new endpoint /voice/convert in router.py to handle voice conversion requests.
Utilized the DDSPService class to process the input audio and apply the voice conversion model.
Enhanced the workflow with adaptive pitch extraction using RMVPE and Parselmouth.

Challenges:

Ensuring compatibility between different pitch extraction algorithms.
Handling edge cases where the input audio quality is poor or contains noise.

Outcome: This workflow significantly improves the system’s ability to handle voice conversion tasks, making it more versatile for real-world applications.

3. Added Voice Encode-Decode Workflow

Commit: aa1bd2d
Description:
We implemented a voice encode-decode workflow to extract and reconstruct audio features. This workflow is essential for tasks like feature extraction, voice cloning, and audio synthesis.

Key Features:

Encoding: Extracts content vectors and pitch information from the input audio using models like HuBERT and RMVPE.
Decoding: Reconstructs audio from the extracted features using DDSP and NsfHiFiGAN.

Implementation Details:

Updated encoder/hubert/model.py to support feature extraction using HuBERT.
Enhanced ddsp/unit2control.py to map extracted features to control signals for synthesis.
Added utilities for mel-spectrogram computation and audio reconstruction in nsf_hifigan/models.py.

Challenges:

Ensuring the accuracy of feature extraction and reconstruction.
Managing computational overhead for real-time processing.

Outcome: This workflow enables advanced audio manipulation capabilities, paving the way for future features like voice cloning and style transfer.

4. Debugged and Added Diffusion Workflow

Commit: 638ba9b
Description:
We debugged and integrated a diffusion-based audio synthesis workflow. Diffusion models are state-of-the-art generative models that produce high-quality audio by iteratively refining noise.

Key Features:

Diffusion Synthesis: Generates audio from latent representations using a diffusion model.
Integration: Seamlessly integrates with the existing DDSP framework for end-to-end processing.

Implementation Details:

Updated ddsp/core.py to include utilities for diffusion-based synthesis.
Enhanced ddsp/loss.py to support loss functions specific to diffusion models.
Added debugging logs and error handling to ensure smooth integration.

Challenges:

Balancing synthesis quality and computational efficiency.
Debugging issues related to model convergence and stability.

Outcome: The diffusion workflow adds cutting-edge generative capabilities to the system, enabling the synthesis of high-quality audio from scratch.

5. Updated Documentation

Commits: d8394d6, 46b9529
Description:
We updated the README.md file to reflect the new features and workflows. The documentation now includes detailed instructions for setting up and using the system.

Key Updates:

Installation Instructions: Added steps for installing the new dependencies.
Feature Descriptions: Provided an overview of the voice interaction, encode-decode, and diffusion workflows.
Acknowledgments: Credited the libraries and frameworks used in the project.

Impact: The updated documentation makes it easier for new developers and users to understand and utilize the system.

Summary of Changes

Feature	Commit	Description
Updated requirements.txt	`00b50b8`	Added necessary dependencies for new workflows.
Voice Interaction Workflow	`8ae9b90`	Integrated voice conversion using DDSP-SVC.
Voice Encode-Decode Workflow	`aa1bd2d`	Implemented feature extraction and reconstruction workflows.
Diffusion Workflow	`638ba9b`	Debugged and integrated diffusion-based audio synthesis.
Documentation Update	`d8394d6`, `46b9529`	Added detailed instructions and acknowledgments in README.md.

Future Work

Real-Time Processing: Optimize the workflows for real-time audio processing.
Model Training: Train custom models to improve performance on specific datasets.
User Interface: Develop a frontend to make the system more accessible to non-technical users.

This update marks a significant milestone in the development of the audio processing backend, enhancing its capabilities and laying the groundwork for future innovations.