Research

Contents [-]

Jamboxx overview and key features
Technology review
Summary of technical decisions
References

Research Methodology and Findings

Our project, Jamboxx-Infinite, is an advanced iteration of the original Jamboxx project. As developers with no prior music experience, we drew significant inspiration and knowledge from Jamboxx, particularly in understanding music-related functionalities and user interactions. Below is a detailed review of the existing project and how it informed our development process.

Jamboxx: Overview and Key Features

Project Name: Jamboxx
Main Features:
1. Play: Users can select and play a variety of virtual musical instruments.
2. Learn: Interactive mini-games designed to teach basic music theory and instrument skills.
3. Jam: A mode for freestyle music creation and experimentation.

Lessons Learned and Improvements in Jamboxx-Infinite

Virtual Instruments (Play Mode)
- Original Implementation: Jamboxx offered a functional but simplistic interface for instrument selection and playback.
- Our Enhancements:
  - Redesigned the interface with larger, more accessible icons to accommodate users with disabilities, particularly those relying on motion input.
  - Introduced a more graphical and intuitive UI, making it easier for school-aged children to navigate and engage with the instruments.
Educational Mini-Games (Learn Mode)
- Original Implementation: Jamboxx included basic games to teach music fundamentals.
- Our Enhancements:
  - Expanded the variety of mini-games to cover a broader range of music concepts.
  - Revamped the UI with cartoon-inspired visuals and a child-friendly color palette to increase appeal and engagement for younger users.
  - Integrated progressive difficulty levels to cater to different age groups and skill levels.

By building on Jamboxx’s foundation, Jamboxx-Infinite not only preserves the core functionalities but also introduces significant accessibility and user experience improvements, making it more inclusive and engaging for its target audience.

Technology Review

Voice Cloning

Model Selection
- TTS (Text-to-Speech) vs. SVC (Singing Voice Conversion) Models:
  - TTS generates speech from text but lacks flexibility for voice conversion in existing audio.
  - SVC transforms the voice in an audio file while preserving pitch and timing, making it ideal for music applications.
- Chosen Model: DDSP-SVC (GitHub)
  - Compared to simpler TTS models, DDSP-SVC uses neural signal processing to achieve high-quality voice conversion with minimal artifacts.
  - Alternative Considered: So-VITS-SVC (GitHub)
    - So-VITS-SVC is user-friendly but requires more training data. DDSP-SVC was chosen for its real-time performance and better stability with limited datasets.
Development Stack
- Backend: Python + FastAPI
  - Why not Java/Spring Boot?
    - Python has stronger ML library support (e.g., PyTorch, Transformers) and faster prototyping.
  - Key Libraries:
    - demucs: Audio source separation.
    - librosa/pyworld: Pitch and audio analysis.
    - torchcrepe: Pitch estimation.
    - transformers: Integration with pre-trained models.

AI Teacher

Framework: llama.cpp
- Why not Hugging Face Transformers?
  - llama.cpp optimizes offline inference and reduces hardware demands, critical for accessibility.
- Model: Mistral-7B
  - Outperforms similar-sized models (e.g., LLaMA-7B) in reasoning tasks and has a permissive license.
Compilation: Nuitka
- Why compile Python?
  - Improves startup speed and hides proprietary logic.
- Why not PyInstaller/Cython?
  - Nuitka produces faster, smaller binaries and supports more Python features.

Summary of Technical Decisions

Component	Choice	Reason
Voice Cloning	DDSP-SVC	Real-time, stable, minimal artifacts
Backend	Python + FastAPI	ML ecosystem, rapid development
AI Teacher	Mistral-7B + llama.cpp	Offline support, performance
Compilation	Nuitka	Efficiency, compatibility

References

[1] DDSP-SVC GitHub Repository. (2024). Available: https://github.com/yxlllc/DDSP-SVC

[2] So-Vits-SVC GitHub Repository. (2024). Available: https://github.com/svc-develop-team/so-vits-svc

[3] Mistral 7B: A Language Model for Text Generation. (2023). Available: arXiv:2310.06825.