FAQ

Common questions about OpenVoiceLab and VibeVoice.

General Questions

What is OpenVoiceLab?

A web interface for finetuning and running VibeVoice text-to-speech models. It handles data preparation, training, and speech generation through a Gradio UI.

What is VibeVoice?

A text-to-speech model developed by Microsoft (now community-maintained). It can generate long-form conversational speech with multiple speakers.

Is OpenVoiceLab free?

Yes. OpenVoiceLab is open source under BSD-3-Clause license. VibeVoice models are also freely available.

Do I need a GPU?

For inference: No, but CPU is slow. GPU recommended for reasonable speed.

For training: Yes, you need an NVIDIA GPU with 16+ GB VRAM. Training on CPU is impractically slow.

Which GPU do I need?

Training: RTX 3090, 4090, A5000, or similar with 16+ GB VRAM

Inference: Any GPU with 8+ GB VRAM works. CPU also works but is slow.

Does it work on Mac?

Yes. Apple Silicon (M1/M2/M3) Macs use MPS for acceleration. Works for both training and inference, though slower than high-end NVIDIA GPUs.

Can I run this in the cloud?

Yes. Use any cloud GPU (Google Colab, Runpod, Vast.ai, etc.). Just install OpenVoiceLab and run it. Use --share flag to get a public URL.

Data and Training

How much audio do I need?

Minimum: 30 minutes of clean audio (for testing)

Recommended: 1-3 hours

More data generally helps, but quality matters more than quantity.

What audio format do I need?

Input: MP3, WAV, FLAC, M4A

OpenVoiceLab converts everything automatically during processing.

Can I train on multiple speakers?

You can, but results may be mixed. Better to train separate models for each speaker.

Can I pause and resume training?

Not currently. Training runs must complete or be stopped. You can use early checkpoints if you stop training before completion.

How do I know if training worked?

Test it in the Inference tab. Load your checkpoint and generate speech. If it sounds like your training data, it worked.

Why doesn't my model sound like the training voice?

Possible reasons:

Not enough training data
Poor quality data
Not enough epochs
LoRA rank too low
Model overtrained

Try training longer or with more data.

Can I finetune on non-English audio?

Yes! While VibeVoice only officially supports English and Chinese, members of the community have found that it produces flawless results for many other languages, such as German, French, Spanish, Arabic, and more.

Training vs Voice Cloning

What's the difference between finetuning and voice cloning?

These are two different approaches to customizing the voice:

Voice Cloning (Zero-Shot):

No training required
Provide a 10-30 second audio sample
Model tries to mimic that voice on the fly
Fast - works immediately
Quality varies - not perfect replication
This is the "Enable Voice Cloning" checkbox in Inference tab

Finetuning (Training):

Requires training process (2-6 hours)
Need 30+ minutes to hours of audio
Model actually learns the voice patterns
Creates a LoRA adapter specific to that voice
Better quality and consistency than zero-shot
This is what the Training tab does

Which should you use?

Just experimenting? Use voice cloning (zero-shot)
Need quick results? Use voice cloning
Want best quality? Finetune a model
Have lots of audio? Definitely finetune
Professional project? Finetune

You can also combine them - finetune a model, then use voice cloning on top for extra guidance.

Can I use VibeVoice without finetuning?

Yes! The pretrained model works out of the box. Just:

Load the model in Inference tab
Enable voice cloning
Select a reference voice
Generate speech

The model will try to match the reference voice without any training.

Do I need to finetune to use OpenVoiceLab?

No. You can use just the inference features with the pretrained model and voice cloning.

Finetuning is optional - for when you want better quality on a specific voice.

Inference and Generation

Why is generation slow?

Large models are slow. RTF (Real-Time Factor) of 0.5-1.0x is normal on consumer GPUs. Use GPU (not CPU) for better speed.

What is CFG scale?

Classifier-Free Guidance scale. Controls how closely the model follows a reference voice.

Low (1.0-1.2): More creative, less similar to reference
Medium (1.3-1.5): Balanced
High (1.6-2.0): Very similar to reference

What is voice cloning?

Using a reference audio sample to guide speech generation. The model tries to match the style and characteristics of the reference.

Why does my output have background music?

VibeVoice sometimes generates spontaneous background sounds or music. This is a known behavior from the model's training data.

Try:

Different reference voice
Different text
Regenerating (it's somewhat random)

Can I generate multiple speakers?

Yes, but you need to generate each speaker separately and combine the audio files yourself. OpenVoiceLab doesn't have built-in multi-speaker generation.

How long can I generate?

The 1.5B model supports up to 90 minutes of audio, while the 7B model only supports 45 minutes.

Can I control emotion or emphasis?

Not directly. You can:

Use reference voices with different emotions
Include cues in text (exclamations, questions)
Adjust CFG scale
Try different reference samples

Technical Questions

What is LoRA?

Low-Rank Adaptation. A method for efficiently finetuning large models by training small adapter layers instead of the full model. Uses less memory and creates smaller files.

Where are models stored?

Pretrained models: ~/.cache/huggingface/hub/

LoRA adapters: training_runs/run_TIMESTAMP/checkpoints/

Can I use my own model weights?

If you have VibeVoice weights, place them locally and point to that path in the Model Path field.

What Python version do I need?

Python 3.9 or newer. 3.10 or 3.11 recommended.

Using Trained Models

Where is my trained model?

At training_runs/run_TIMESTAMP/checkpoints/

The timestamp is from when training started (e.g., run_20251008_143022).

Yes. Share the entire checkpoints/ folder. Others load it as a LoRA adapter.

Note: Respect copyright and privacy. Don't share voices of people without permission.

Can I use my model with other tools?

The LoRA adapter is compatible with VibeVoice directly. You can load it in any tool that supports VibeVoice + LoRA.

How do I delete old models?

Delete the entire training_runs/run_TIMESTAMP/ folder:

bash

rm -rf training_runs/run_20251008_143022

Troubleshooting

Why won't my model load?

Check:

Path is correct
Model files exist
Enough VRAM available
No other GPU processes interfering

See Troubleshooting for more.

Why is data processing failing?

Common causes:

Invalid audio files
No speech detected (VAD issue)
Out of memory
Permissions issues

Check logs in logs/openvoicelab.log.

Where can I get help?

Best Practices

What's the best workflow?

Start with 30 minutes of clean audio
Process it in Data tab
Train with default settings (3 epochs, batch size 4)
Test in Inference tab
If quality is poor, train more epochs or get better data

How can I improve quality?

Data quality:

Cleaner audio
More audio (1+ hours)
Single speaker
Varied content

Training:

More epochs (5-7)
Higher LoRA rank (16-32)
Good batch size for your GPU

Inference:

Good reference voices
Adjust CFG scale
Try multiple generations

Should I use voice cloning?

Try both with and without. Sometimes voice cloning helps consistency, sometimes the pure finetuned voice sounds better.

Experiment to see what works for your case.

How often should I retrain?

If results are poor, try different training settings
If you get more audio data, train a new model
Otherwise, one trained model can be used indefinitely

Project Status

Is OpenVoiceLab ready for production?

It's in beta. Core features work, but:

Some rough edges remain
Training results can vary
Documentation is improving

Use it for experiments and projects, but expect some issues.

What features are planned?

See the GitHub repository for roadmap and issues.

Known improvements:

Better audio chunking
Voice prompt drop rate support
More training options

Can I contribute?

Yes. OpenVoiceLab is open source. Contributions welcome on GitHub.

How is this different from other TTS tools?

OpenVoiceLab focuses specifically on VibeVoice finetuning with a user-friendly interface. Other tools may support different models or approaches.

Didn't Find Your Answer?

Check the guides for detailed walkthroughs
Search GitHub issues
Ask on Discord

Still stuck? Open a GitHub issue with details about your problem.

FAQ ​

General Questions ​

What is OpenVoiceLab? ​

What is VibeVoice? ​

Is OpenVoiceLab free? ​

Do I need a GPU? ​

Which GPU do I need? ​

Does it work on Mac? ​

Can I run this in the cloud? ​

Data and Training ​

How much audio do I need? ​

What audio format do I need? ​

Can I train on multiple speakers? ​

Can I pause and resume training? ​

How do I know if training worked? ​

Why doesn't my model sound like the training voice? ​

Can I finetune on non-English audio? ​

Training vs Voice Cloning ​

What's the difference between finetuning and voice cloning? ​

Can I use VibeVoice without finetuning? ​

Do I need to finetune to use OpenVoiceLab? ​

Inference and Generation ​

Why is generation slow? ​

What is CFG scale? ​

What is voice cloning? ​

Why does my output have background music? ​

Can I generate multiple speakers? ​

How long can I generate? ​

Can I control emotion or emphasis? ​

Technical Questions ​

What is LoRA? ​

Where are models stored? ​

Can I use my own model weights? ​

What Python version do I need? ​

Using Trained Models ​

Where is my trained model? ​

Can I share my trained voice? ​

Can I use my model with other tools? ​

How do I delete old models? ​

Troubleshooting ​

Why won't my model load? ​

Why is data processing failing? ​

Where can I get help? ​

Best Practices ​

What's the best workflow? ​

How can I improve quality? ​

Should I use voice cloning? ​

How often should I retrain? ​

Project Status ​

Is OpenVoiceLab ready for production? ​

What features are planned? ​

Can I contribute? ​

How is this different from other TTS tools? ​

Didn't Find Your Answer? ​

FAQ