Getting Started
Welcome to OpenVoiceLab - a web interface for working with VibeVoice TTS models.
What is OpenVoiceLab?
OpenVoiceLab is a web interface for working with text-to-speech (TTS) models. Instead of dealing with command-line tools and scattered scripts, you get a Gradio interface that handles the workflow.
Quick Start
Choose your path based on what you want to do:
Path 1: Try it out (5 minutes)
- Install OpenVoiceLab
- Generate speech with pretrained model
- No training needed - works immediately with voice cloning
Path 2: Finetune a custom voice (several hours)
- Install OpenVoiceLab
- Collect 30+ minutes of audio (more is better)
- Process your audio data (10-20 min)
- Train a custom voice (several hours)
- Generate speech with your voice
Most people come here to finetune - that's Path 2.
Understanding Voice Cloning vs Finetuning
Voice Cloning (Zero-Shot) - No training required
- Provide a short audio sample (~30s)
- Model mimics it on the fly
- Works immediately
- Quality is mediocre
- Good for experimenting
Finetuning - Requires training
- Provide 30+ minutes of audio
- Model learns the voice
- Much better quality and consistency
You can also combine both approaches, but finetuning is generally better.
How Finetuning Works
OpenVoiceLab uses LoRA (Low-Rank Adaptation) to finetune VibeVoice efficiently:
- Data Preparation - Upload raw audio files, automatic segmentation and transcription
- Training - Trains the model to sound like your voice
- Generation - Load your adapter to generate speech in your trained voice
The adapter is only a few hundred MB and trains on consumer GPUs (16+ GB VRAM).
Requirements
Hardware Requirements
For Training:
- NVIDIA GPU with 16+ GB VRAM (recommended for 1.5B model)
- RTX 3090, 4090, or similar
- Apple Silicon Macs work but are slower
- 24GB+ VRAM for best performance
For Inference (Generating Speech):
- 8+ GB VRAM (can even run on CPU, just slower)
- Works on most modern computers
Software Requirements
- Python 3.9 or newer
- That's it! OpenVoiceLab handles the rest
Data Requirements
- Minimum: 30 minutes of clean audio
- Recommended: 1-3 hours of audio
- Best: 3+ hours of varied audio
Audio quality matters more than quantity! 30 minutes of clean audio beats 3 hours of noisy audio.
Next Steps
Ready to dive in? Follow these guides in order:
- Installation - Set up OpenVoiceLab on your computer
- Quick Start - Try generating speech with a pretrained model
- Data Preparation - Prepare your audio files for training (you must do this before finetuning)
- Finetuning - Train your custom voice model
Need Help?
- Check the FAQ for common questions
- Read the Troubleshooting guide if you encounter issues
- Join our Discord community for support
A Quick Note
OpenVoiceLab is currently in beta. Things work well, but you might encounter rough edges. The community is actively improving the project, and your feedback is valuable!
If something doesn't work as expected, it's probably not your fault - let us know on Discord or GitHub.