Skip to content

FAQ

Common questions about OpenVoiceLab and VibeVoice.

General Questions

What is OpenVoiceLab?

A web interface for finetuning and running VibeVoice text-to-speech models. It handles data preparation, training, and speech generation through a Gradio UI.

What is VibeVoice?

A text-to-speech model developed by Microsoft (now community-maintained). It can generate long-form conversational speech with multiple speakers.

Is OpenVoiceLab free?

Yes. OpenVoiceLab is open source under BSD-3-Clause license. VibeVoice models are also freely available.

Do I need a GPU?

For inference: No, but CPU is slow. GPU recommended for reasonable speed.

For training: Yes, you need an NVIDIA GPU with 16+ GB VRAM. Training on CPU is impractically slow.

Which GPU do I need?

Training: RTX 3090, 4090, A5000, or similar with 16+ GB VRAM

Inference: Any GPU with 8+ GB VRAM works. CPU also works but is slow.

Does it work on Mac?

Yes. Apple Silicon (M1/M2/M3) Macs use MPS for acceleration. Works for both training and inference, though slower than high-end NVIDIA GPUs.

Can I run this in the cloud?

Yes. Use any cloud GPU (Google Colab, Runpod, Vast.ai, etc.). Just install OpenVoiceLab and run it. Use --share flag to get a public URL.

Data and Training

How much audio do I need?

Minimum: 30 minutes of clean audio (for testing)

Recommended: 1-3 hours

More data generally helps, but quality matters more than quantity.

What audio format do I need?

Input: MP3, WAV, FLAC, M4A

OpenVoiceLab converts everything automatically during processing.

Can I train on multiple speakers?

You can, but results may be mixed. Better to train separate models for each speaker.

Can I pause and resume training?

Not currently. Training runs must complete or be stopped. You can use early checkpoints if you stop training before completion.

How do I know if training worked?

Test it in the Inference tab. Load your checkpoint and generate speech. If it sounds like your training data, it worked.

Why doesn't my model sound like the training voice?

Possible reasons:

  • Not enough training data
  • Poor quality data
  • Not enough epochs
  • LoRA rank too low
  • Model overtrained

Try training longer or with more data.

Can I finetune on non-English audio?

Yes! While VibeVoice only officially supports English and Chinese, members of the community have found that it produces flawless results for many other languages, such as German, French, Spanish, Arabic, and more.

Training vs Voice Cloning

What's the difference between finetuning and voice cloning?

These are two different approaches to customizing the voice:

Voice Cloning (Zero-Shot):

  • No training required
  • Provide a 10-30 second audio sample
  • Model tries to mimic that voice on the fly
  • Fast - works immediately
  • Quality varies - not perfect replication
  • This is the "Enable Voice Cloning" checkbox in Inference tab

Finetuning (Training):

  • Requires training process (2-6 hours)
  • Need 30+ minutes to hours of audio
  • Model actually learns the voice patterns
  • Creates a LoRA adapter specific to that voice
  • Better quality and consistency than zero-shot
  • This is what the Training tab does

Which should you use?

  • Just experimenting? Use voice cloning (zero-shot)
  • Need quick results? Use voice cloning
  • Want best quality? Finetune a model
  • Have lots of audio? Definitely finetune
  • Professional project? Finetune

You can also combine them - finetune a model, then use voice cloning on top for extra guidance.

Can I use VibeVoice without finetuning?

Yes! The pretrained model works out of the box. Just:

  1. Load the model in Inference tab
  2. Enable voice cloning
  3. Select a reference voice
  4. Generate speech

The model will try to match the reference voice without any training.

Do I need to finetune to use OpenVoiceLab?

No. You can use just the inference features with the pretrained model and voice cloning.

Finetuning is optional - for when you want better quality on a specific voice.

Inference and Generation

Why is generation slow?

Large models are slow. RTF (Real-Time Factor) of 0.5-1.0x is normal on consumer GPUs. Use GPU (not CPU) for better speed.

What is CFG scale?

Classifier-Free Guidance scale. Controls how closely the model follows a reference voice.

  • Low (1.0-1.2): More creative, less similar to reference
  • Medium (1.3-1.5): Balanced
  • High (1.6-2.0): Very similar to reference

What is voice cloning?

Using a reference audio sample to guide speech generation. The model tries to match the style and characteristics of the reference.

Why does my output have background music?

VibeVoice sometimes generates spontaneous background sounds or music. This is a known behavior from the model's training data.

Try:

  • Different reference voice
  • Different text
  • Regenerating (it's somewhat random)

Can I generate multiple speakers?

Yes, but you need to generate each speaker separately and combine the audio files yourself. OpenVoiceLab doesn't have built-in multi-speaker generation.

How long can I generate?

The 1.5B model supports up to 90 minutes of audio, while the 7B model only supports 45 minutes.

Can I control emotion or emphasis?

Not directly. You can:

  • Use reference voices with different emotions
  • Include cues in text (exclamations, questions)
  • Adjust CFG scale
  • Try different reference samples

Technical Questions

What is LoRA?

Low-Rank Adaptation. A method for efficiently finetuning large models by training small adapter layers instead of the full model. Uses less memory and creates smaller files.

Where are models stored?

Pretrained models: ~/.cache/huggingface/hub/

LoRA adapters: training_runs/run_TIMESTAMP/checkpoints/

Can I use my own model weights?

If you have VibeVoice weights, place them locally and point to that path in the Model Path field.

What Python version do I need?

Python 3.9 or newer. 3.10 or 3.11 recommended.

Using Trained Models

Where is my trained model?

At training_runs/run_TIMESTAMP/checkpoints/

The timestamp is from when training started (e.g., run_20251008_143022).

Can I share my trained voice?

Yes. Share the entire checkpoints/ folder. Others load it as a LoRA adapter.

Note: Respect copyright and privacy. Don't share voices of people without permission.

Can I use my model with other tools?

The LoRA adapter is compatible with VibeVoice directly. You can load it in any tool that supports VibeVoice + LoRA.

How do I delete old models?

Delete the entire training_runs/run_TIMESTAMP/ folder:

bash
rm -rf training_runs/run_20251008_143022

Troubleshooting

Why won't my model load?

Check:

  • Path is correct
  • Model files exist
  • Enough VRAM available
  • No other GPU processes interfering

See Troubleshooting for more.

Why is data processing failing?

Common causes:

  • Invalid audio files
  • No speech detected (VAD issue)
  • Out of memory
  • Permissions issues

Check logs in logs/openvoicelab.log.

Where can I get help?

Best Practices

What's the best workflow?

  1. Start with 30 minutes of clean audio
  2. Process it in Data tab
  3. Train with default settings (3 epochs, batch size 4)
  4. Test in Inference tab
  5. If quality is poor, train more epochs or get better data

How can I improve quality?

Data quality:

  • Cleaner audio
  • More audio (1+ hours)
  • Single speaker
  • Varied content

Training:

  • More epochs (5-7)
  • Higher LoRA rank (16-32)
  • Good batch size for your GPU

Inference:

  • Good reference voices
  • Adjust CFG scale
  • Try multiple generations

Should I use voice cloning?

Try both with and without. Sometimes voice cloning helps consistency, sometimes the pure finetuned voice sounds better.

Experiment to see what works for your case.

How often should I retrain?

  • If results are poor, try different training settings
  • If you get more audio data, train a new model
  • Otherwise, one trained model can be used indefinitely

Project Status

Is OpenVoiceLab ready for production?

It's in beta. Core features work, but:

  • Some rough edges remain
  • Training results can vary
  • Documentation is improving

Use it for experiments and projects, but expect some issues.

What features are planned?

See the GitHub repository for roadmap and issues.

Known improvements:

  • Better audio chunking
  • Voice prompt drop rate support
  • More training options

Can I contribute?

Yes. OpenVoiceLab is open source. Contributions welcome on GitHub.

How is this different from other TTS tools?

OpenVoiceLab focuses specifically on VibeVoice finetuning with a user-friendly interface. Other tools may support different models or approaches.

Didn't Find Your Answer?

Still stuck? Open a GitHub issue with details about your problem.

Released under the BSD-3-Clause License.