Getting Started

Welcome to OpenVoiceLab - a web interface for working with VibeVoice TTS models.

What is OpenVoiceLab?

OpenVoiceLab is a web interface for working with text-to-speech (TTS) models. Instead of dealing with command-line tools and scattered scripts, you get a Gradio interface that handles the workflow.

Quick Start

Choose your path based on what you want to do:

Path 1: Try it out (5 minutes)

Install OpenVoiceLab
Generate speech with pretrained model
No training needed - works immediately with voice cloning

Path 2: Finetune a custom voice (several hours)

Install OpenVoiceLab
Collect 30+ minutes of audio (more is better)
Process your audio data (10-20 min)
Train a custom voice (several hours)
Generate speech with your voice

Most people come here to finetune - that's Path 2.

Understanding Voice Cloning vs Finetuning

Voice Cloning (Zero-Shot) - No training required

Provide a short audio sample (~30s)
Model mimics it on the fly
Works immediately
Quality is mediocre
Good for experimenting

Finetuning - Requires training

Provide 30+ minutes of audio
Model learns the voice
Much better quality and consistency

You can also combine both approaches, but finetuning is generally better.

How Finetuning Works

OpenVoiceLab uses LoRA (Low-Rank Adaptation) to finetune VibeVoice efficiently:

Data Preparation - Upload raw audio files, automatic segmentation and transcription
Training - Trains the model to sound like your voice
Generation - Load your adapter to generate speech in your trained voice

The adapter is only a few hundred MB and trains on consumer GPUs (16+ GB VRAM).

Requirements

Hardware Requirements

For Training:

NVIDIA GPU with 16+ GB VRAM (recommended for 1.5B model)
RTX 3090, 4090, or similar
Apple Silicon Macs work but are slower
24GB+ VRAM for best performance

For Inference (Generating Speech):

8+ GB VRAM (can even run on CPU, just slower)
Works on most modern computers

Software Requirements

Python 3.9 or newer
That's it! OpenVoiceLab handles the rest

Data Requirements

Minimum: 30 minutes of clean audio
Recommended: 1-3 hours of audio
Best: 3+ hours of varied audio

Audio quality matters more than quantity! 30 minutes of clean audio beats 3 hours of noisy audio.

Next Steps

Ready to dive in? Follow these guides in order:

Installation - Set up OpenVoiceLab on your computer
Quick Start - Try generating speech with a pretrained model
Data Preparation - Prepare your audio files for training (you must do this before finetuning)
Finetuning - Train your custom voice model

Need Help?

Check the FAQ for common questions
Read the Troubleshooting guide if you encounter issues
Join our Discord community for support

A Quick Note

OpenVoiceLab is currently in beta. Things work well, but you might encounter rough edges. The community is actively improving the project, and your feedback is valuable!

If something doesn't work as expected, it's probably not your fault - let us know on Discord or GitHub.

Getting Started ​

What is OpenVoiceLab? ​

Quick Start ​

Path 1: Try it out (5 minutes) ​

Path 2: Finetune a custom voice (several hours) ​

Understanding Voice Cloning vs Finetuning ​

How Finetuning Works ​

Requirements ​

Hardware Requirements ​

Software Requirements ​

Data Requirements ​

Next Steps ​

Need Help? ​

A Quick Note ​