Skip to content

Finetuning Guide

This guide covers how to train VibeVoice on custom voices using the Training tab.

Prerequisites

Before training, you need:

  1. A processed dataset - See the data preparation guide
  2. GPU with 16+ GB VRAM - RTX 3090, 4090, or similar
  3. Time - Training can take anywhere from 10 minutes to several hours depending on dataset size and epochs

Training Process

Step 1: Open the Training Tab

Click the Training tab in OpenVoiceLab.

Step 2: Select Dataset

Click Refresh Datasets to load available datasets.

Select your dataset from the dropdown. It will show the dataset name and number of samples:

my_voice (1247 samples)

If you don't see your dataset:

  • Make sure you processed it in the Data tab first
  • Check that the dataset exists in the data/ folder
  • Try clicking Refresh Datasets again

Step 3: Configure Model

Model Path: Leave as vibevoice/VibeVoice-1.5B (recommended)

You can also use:

  • vibevoice/VibeVoice-7B - Higher quality but needs 48GB+ VRAM and trains more slowly
  • Local path if you've downloaded the model

Step 4: Set Training Parameters

Click Training Parameters to expand the options.

Epochs (default: 3)

  • How many times to train on the full dataset
  • Start with 3-5 epochs
  • More epochs = longer training, may improve quality
  • Too many epochs can cause overfitting (model sounds robotic)

Batch Size (default: 4)

  • Number of samples processed together
  • Reduce to 2 or 1 if you get out-of-memory errors
  • Increase to 8 if you have 24GB+ VRAM
  • Larger = faster training but more memory

Learning Rate (default: 1e-4)

  • How fast the model learns
  • Default works well for most cases
  • Don't change unless you know what you're doing

LoRA Rank (default: 8)

  • Complexity of the adapter
  • Default (8) balances quality and efficiency
  • Increase to 16 or 32 for potentially better quality (slower, more VRAM)
  • Decrease to 4 for faster training (may sacrifice quality)

Step 5: Start Training

Click Start Training.

OpenVoiceLab will:

  1. Prepare the training data (converts to JSONL format)
  2. Launch the training process in the background
  3. Start TensorBoard for monitoring

You'll see a status message:

✅ Training started: run_20251008_143022
Waiting for TensorBoard...

Training runs in a background process, so you can close the browser and it continues.

Step 6: Monitor Training

TensorBoard

Wait 10-20 seconds, then click Refresh TensorBoard.

You'll see graphs showing:

  • Loss curves - Should decrease over time (lower = better)
  • Learning rate - How it changes during training
  • Steps per second - Training speed

A decreasing loss curve means training is working. If loss stops decreasing, training has plateaued.

Training Logs

Click the Training Logs accordion to expand.

Click Refresh Logs to see real-time output:

Step 100/1500 | Loss: 0.234 | LR: 0.0001 | 2.3s/it
Step 200/1500 | Loss: 0.198 | LR: 0.0001 | 2.1s/it

This shows:

  • Current step / total steps
  • Current loss value
  • Learning rate
  • Time per iteration

Training History

Click Refresh Runs to see all your training runs:

🟢 run_20251008_143022
- Created: 2025-10-08T14:30:22
- Status: running
- Dataset: data/my_voice

Step 7: Wait for Completion

Training takes time depending on:

  • Dataset size (more samples = longer)
  • Number of epochs
  • Batch size
  • GPU speed

Typical times:

  • 1000 samples, 3 epochs, batch size 4: 2-3 hours on RTX 4090
  • 2000 samples, 5 epochs, batch size 2: 5-6 hours

You can check progress in the training logs. Look for lines like:

Epoch 1/3 complete
Epoch 2/3 complete

When finished, status will change to "stopped" in Training History.

Step 8: Find Your Trained Model

The trained adapter is saved to:

training_runs/run_TIMESTAMP/checkpoints/

You'll need this path for inference. The full path looks like:

training_runs/run_20251008_143022/checkpoints/

Stopping Training Early

Click Stop Training to halt the current training run.

The latest checkpoint will still be saved and usable. You don't need to wait for all epochs to complete - sometimes 1-2 epochs is enough.

Training Parameters Explained

What is LoRA?

LoRA (Low-Rank Adaptation) trains small "adapter" layers instead of the full model. Benefits:

  • Much less VRAM needed (16GB vs 80GB+)
  • Faster training
  • Creates small files (few hundred MB vs multiple GB)
  • You can load/unload different adapters easily

The adapter modifies how the base model generates speech without changing the base model itself.

What are Epochs?

One epoch = training on the entire dataset once.

  • 3 epochs = model sees each sample 3 times
  • More epochs can improve quality but risk overfitting
  • Overfitting = model memorizes training data, sounds unnatural on new text

Start with 3 epochs. If results are poor, try 5. If results sound robotic, reduce to 2.

Batch Size vs VRAM

Batch size determines how many samples are processed simultaneously.

Larger batch sizes train faster but use more VRAM. If you get OOM (out of memory) errors, reduce batch size.

Evaluating Results

After training, use the Inference tab to test your model:

  1. Load the base model (vibevoice/VibeVoice-1.5B)
  2. Check Load LoRA Adapter
  3. Enter the path: training_runs/run_TIMESTAMP/checkpoints/
  4. Generate speech from text

Listen for:

  • Does it sound like the training voice?
  • Is speech natural or robotic?
  • Does it handle different text well?

If quality is poor, see the troubleshooting section below.

Training Multiple Voices

You can train multiple adapters and swap between them:

  1. Process different datasets (one per voice)
  2. Train each dataset separately
  3. Each creates a separate checkpoint in training_runs/
  4. At inference time, load different checkpoints to use different voices

Troubleshooting

Out of memory during training

  • Reduce batch size to 2 or 1
  • Use a smaller LoRA rank (4 instead of 8)
  • Close other GPU applications
  • Use the 1.5B model instead of 7B
  • Process fewer samples at once

Training loss not decreasing

  • Check dataset quality (listen to samples in data/my_dataset/wavs/)
  • Make sure transcriptions are accurate
  • Try more epochs (5 instead of 3)
  • Try higher learning rate (1.5e-4)

Generated speech doesn't sound like training voice

  • Train for more epochs (5-7)
  • Increase LoRA rank (16 or 32)
  • Check training data quality
  • Make sure you're loading the adapter correctly at inference
  • Try using voice cloning with a reference sample

Training crashes or hangs

  • Check training logs for error messages
  • Make sure dataset is formatted correctly
  • Verify enough disk space
  • Try a smaller dataset first (100-200 samples) to test

TensorBoard won't load

  • Wait 30-60 seconds after starting training
  • Click Refresh TensorBoard
  • Check if port 6006 is blocked by firewall
  • Look at tensorboard.log in the run directory

Tips for Better Results

Data quality matters most

  • 30 minutes of clean audio > 3 hours of noisy audio
  • Single speaker, minimal background noise
  • Natural speech patterns

Start conservative

  • Use default settings first
  • 3 epochs, batch size 4, LoRA rank 8
  • Only adjust if results are poor

Monitor training

  • Check loss curves in TensorBoard
  • Loss should decrease steadily
  • If loss flatlines early, training may need adjustment

Test early and often

  • Don't wait for all epochs to finish
  • After 1 epoch, test the checkpoint
  • If it's good enough, stop training

Experiment

  • Try different training parameters
  • Train on different data
  • Compare results

Next Steps

After training a model:

👉 Inference Guide - Generate speech with your finetuned voice

Care to share your finetuned model with the community?

  • Sign up for an account on Hugging Face
  • Visit this page and click Join this org
  • Click the New button, then click Model
  • Give your model a name and click Create
  • Click Files and versions, then the + Contribute button
  • Select Upload files and drag and drop the fine-tuned model in your training_runs/ folder (NOT the whole folder)

This will allow other members of the community to use your model.

Advanced Topics

Resuming Training

Currently, you can't resume interrupted training. If training stops, you need to start over. The latest checkpoint before stopping is still usable.

Multi-Speaker Training

To train on multiple speakers:

  • Process each speaker's audio into separate datasets
  • Train separate adapters for each
  • Or combine datasets if you want a multi-speaker model

Training on Non-English

VibeVoice was trained primarily on English but has some multilingual capability. Training on other languages:

  • Make sure Whisper transcribes correctly
  • May need more data for good results
  • Results may vary

Custom Training Scripts

Advanced users can modify training parameters by editing ovl/training/trainer.py or running the training script directly with custom arguments.

Released under the BSD-3-Clause License.