Skip to content

Getting Started

Welcome to OpenVoiceLab - a web interface for working with VibeVoice TTS models.

What is OpenVoiceLab?

OpenVoiceLab is a web interface for working with text-to-speech (TTS) models. Instead of dealing with command-line tools and scattered scripts, you get a Gradio interface that handles the workflow.

Quick Start

Choose your path based on what you want to do:

Path 1: Try it out (5 minutes)

  1. Install OpenVoiceLab
  2. Generate speech with pretrained model
  3. No training needed - works immediately with voice cloning

Path 2: Finetune a custom voice (several hours)

  1. Install OpenVoiceLab
  2. Collect 30+ minutes of audio (more is better)
  3. Process your audio data (10-20 min)
  4. Train a custom voice (several hours)
  5. Generate speech with your voice

Most people come here to finetune - that's Path 2.

Understanding Voice Cloning vs Finetuning

Voice Cloning (Zero-Shot) - No training required

  • Provide a short audio sample (~30s)
  • Model mimics it on the fly
  • Works immediately
  • Quality is mediocre
  • Good for experimenting

Finetuning - Requires training

  • Provide 30+ minutes of audio
  • Model learns the voice
  • Much better quality and consistency

You can also combine both approaches, but finetuning is generally better.

How Finetuning Works

OpenVoiceLab uses LoRA (Low-Rank Adaptation) to finetune VibeVoice efficiently:

  1. Data Preparation - Upload raw audio files, automatic segmentation and transcription
  2. Training - Trains the model to sound like your voice
  3. Generation - Load your adapter to generate speech in your trained voice

The adapter is only a few hundred MB and trains on consumer GPUs (16+ GB VRAM).

Requirements

Hardware Requirements

For Training:

  • NVIDIA GPU with 16+ GB VRAM (recommended for 1.5B model)
  • RTX 3090, 4090, or similar
  • Apple Silicon Macs work but are slower
  • 24GB+ VRAM for best performance

For Inference (Generating Speech):

  • 8+ GB VRAM (can even run on CPU, just slower)
  • Works on most modern computers

Software Requirements

  • Python 3.9 or newer
  • That's it! OpenVoiceLab handles the rest

Data Requirements

  • Minimum: 30 minutes of clean audio
  • Recommended: 1-3 hours of audio
  • Best: 3+ hours of varied audio

Audio quality matters more than quantity! 30 minutes of clean audio beats 3 hours of noisy audio.

Next Steps

Ready to dive in? Follow these guides in order:

  1. Installation - Set up OpenVoiceLab on your computer
  2. Quick Start - Try generating speech with a pretrained model
  3. Data Preparation - Prepare your audio files for training (you must do this before finetuning)
  4. Finetuning - Train your custom voice model

Need Help?

A Quick Note

OpenVoiceLab is currently in beta. Things work well, but you might encounter rough edges. The community is actively improving the project, and your feedback is valuable!

If something doesn't work as expected, it's probably not your fault - let us know on Discord or GitHub.

Released under the BSD-3-Clause License.