FAQ
Common questions about OpenVoiceLab and VibeVoice.
General Questions
What is OpenVoiceLab?
A web interface for finetuning and running VibeVoice text-to-speech models. It handles data preparation, training, and speech generation through a Gradio UI.
What is VibeVoice?
A text-to-speech model developed by Microsoft (now community-maintained). It can generate long-form conversational speech with multiple speakers.
Is OpenVoiceLab free?
Yes. OpenVoiceLab is open source under BSD-3-Clause license. VibeVoice models are also freely available.
Do I need a GPU?
For inference: No, but CPU is slow. GPU recommended for reasonable speed.
For training: Yes, you need an NVIDIA GPU with 16+ GB VRAM. Training on CPU is impractically slow.
Which GPU do I need?
Training: RTX 3090, 4090, A5000, or similar with 16+ GB VRAM
Inference: Any GPU with 8+ GB VRAM works. CPU also works but is slow.
Does it work on Mac?
Yes. Apple Silicon (M1/M2/M3) Macs use MPS for acceleration. Works for both training and inference, though slower than high-end NVIDIA GPUs.
Can I run this in the cloud?
Yes. Use any cloud GPU (Google Colab, Runpod, Vast.ai, etc.). Just install OpenVoiceLab and run it. Use --share flag to get a public URL.
Data and Training
How much audio do I need?
Minimum: 30 minutes of clean audio (for testing)
Recommended: 1-3 hours
More data generally helps, but quality matters more than quantity.
What audio format do I need?
Input: MP3, WAV, FLAC, M4A
OpenVoiceLab converts everything automatically during processing.
Can I train on multiple speakers?
You can, but results may be mixed. Better to train separate models for each speaker.
Can I pause and resume training?
Not currently. Training runs must complete or be stopped. You can use early checkpoints if you stop training before completion.
How do I know if training worked?
Test it in the Inference tab. Load your checkpoint and generate speech. If it sounds like your training data, it worked.
Why doesn't my model sound like the training voice?
Possible reasons:
- Not enough training data
- Poor quality data
- Not enough epochs
- LoRA rank too low
- Model overtrained
Try training longer or with more data.
Can I finetune on non-English audio?
Yes! While VibeVoice only officially supports English and Chinese, members of the community have found that it produces flawless results for many other languages, such as German, French, Spanish, Arabic, and more.
Training vs Voice Cloning
What's the difference between finetuning and voice cloning?
These are two different approaches to customizing the voice:
Voice Cloning (Zero-Shot):
- No training required
- Provide a 10-30 second audio sample
- Model tries to mimic that voice on the fly
- Fast - works immediately
- Quality varies - not perfect replication
- This is the "Enable Voice Cloning" checkbox in Inference tab
Finetuning (Training):
- Requires training process (2-6 hours)
- Need 30+ minutes to hours of audio
- Model actually learns the voice patterns
- Creates a LoRA adapter specific to that voice
- Better quality and consistency than zero-shot
- This is what the Training tab does
Which should you use?
- Just experimenting? Use voice cloning (zero-shot)
- Need quick results? Use voice cloning
- Want best quality? Finetune a model
- Have lots of audio? Definitely finetune
- Professional project? Finetune
You can also combine them - finetune a model, then use voice cloning on top for extra guidance.
Can I use VibeVoice without finetuning?
Yes! The pretrained model works out of the box. Just:
- Load the model in Inference tab
- Enable voice cloning
- Select a reference voice
- Generate speech
The model will try to match the reference voice without any training.
Do I need to finetune to use OpenVoiceLab?
No. You can use just the inference features with the pretrained model and voice cloning.
Finetuning is optional - for when you want better quality on a specific voice.
Inference and Generation
Why is generation slow?
Large models are slow. RTF (Real-Time Factor) of 0.5-1.0x is normal on consumer GPUs. Use GPU (not CPU) for better speed.
What is CFG scale?
Classifier-Free Guidance scale. Controls how closely the model follows a reference voice.
- Low (1.0-1.2): More creative, less similar to reference
- Medium (1.3-1.5): Balanced
- High (1.6-2.0): Very similar to reference
What is voice cloning?
Using a reference audio sample to guide speech generation. The model tries to match the style and characteristics of the reference.
Why does my output have background music?
VibeVoice sometimes generates spontaneous background sounds or music. This is a known behavior from the model's training data.
Try:
- Different reference voice
- Different text
- Regenerating (it's somewhat random)
Can I generate multiple speakers?
Yes, but you need to generate each speaker separately and combine the audio files yourself. OpenVoiceLab doesn't have built-in multi-speaker generation.
How long can I generate?
The 1.5B model supports up to 90 minutes of audio, while the 7B model only supports 45 minutes.
Can I control emotion or emphasis?
Not directly. You can:
- Use reference voices with different emotions
- Include cues in text (exclamations, questions)
- Adjust CFG scale
- Try different reference samples
Technical Questions
What is LoRA?
Low-Rank Adaptation. A method for efficiently finetuning large models by training small adapter layers instead of the full model. Uses less memory and creates smaller files.
Where are models stored?
Pretrained models: ~/.cache/huggingface/hub/
LoRA adapters: training_runs/run_TIMESTAMP/checkpoints/
Can I use my own model weights?
If you have VibeVoice weights, place them locally and point to that path in the Model Path field.
What Python version do I need?
Python 3.9 or newer. 3.10 or 3.11 recommended.
Using Trained Models
Where is my trained model?
At training_runs/run_TIMESTAMP/checkpoints/
The timestamp is from when training started (e.g., run_20251008_143022).
Can I share my trained voice?
Yes. Share the entire checkpoints/ folder. Others load it as a LoRA adapter.
Note: Respect copyright and privacy. Don't share voices of people without permission.
Can I use my model with other tools?
The LoRA adapter is compatible with VibeVoice directly. You can load it in any tool that supports VibeVoice + LoRA.
How do I delete old models?
Delete the entire training_runs/run_TIMESTAMP/ folder:
rm -rf training_runs/run_20251008_143022Troubleshooting
Why won't my model load?
Check:
- Path is correct
- Model files exist
- Enough VRAM available
- No other GPU processes interfering
See Troubleshooting for more.
Why is data processing failing?
Common causes:
- Invalid audio files
- No speech detected (VAD issue)
- Out of memory
- Permissions issues
Check logs in logs/openvoicelab.log.
Where can I get help?
Best Practices
What's the best workflow?
- Start with 30 minutes of clean audio
- Process it in Data tab
- Train with default settings (3 epochs, batch size 4)
- Test in Inference tab
- If quality is poor, train more epochs or get better data
How can I improve quality?
Data quality:
- Cleaner audio
- More audio (1+ hours)
- Single speaker
- Varied content
Training:
- More epochs (5-7)
- Higher LoRA rank (16-32)
- Good batch size for your GPU
Inference:
- Good reference voices
- Adjust CFG scale
- Try multiple generations
Should I use voice cloning?
Try both with and without. Sometimes voice cloning helps consistency, sometimes the pure finetuned voice sounds better.
Experiment to see what works for your case.
How often should I retrain?
- If results are poor, try different training settings
- If you get more audio data, train a new model
- Otherwise, one trained model can be used indefinitely
Project Status
Is OpenVoiceLab ready for production?
It's in beta. Core features work, but:
- Some rough edges remain
- Training results can vary
- Documentation is improving
Use it for experiments and projects, but expect some issues.
What features are planned?
See the GitHub repository for roadmap and issues.
Known improvements:
- Better audio chunking
- Voice prompt drop rate support
- More training options
Can I contribute?
Yes. OpenVoiceLab is open source. Contributions welcome on GitHub.
How is this different from other TTS tools?
OpenVoiceLab focuses specifically on VibeVoice finetuning with a user-friendly interface. Other tools may support different models or approaches.
Didn't Find Your Answer?
- Check the guides for detailed walkthroughs
- Search GitHub issues
- Ask on Discord
Still stuck? Open a GitHub issue with details about your problem.