Finetuning Guide

This guide covers how to train VibeVoice on custom voices using the Training tab.

Prerequisites

Before training, you need:

A processed dataset - See the data preparation guide
GPU with 16+ GB VRAM - RTX 3090, 4090, or similar
Time - Training can take anywhere from 10 minutes to several hours depending on dataset size and epochs

Training Process

Step 1: Open the Training Tab

Click the Training tab in OpenVoiceLab.

Step 2: Select Dataset

Click Refresh Datasets to load available datasets.

Select your dataset from the dropdown. It will show the dataset name and number of samples:

my_voice (1247 samples)

If you don't see your dataset:

Make sure you processed it in the Data tab first
Check that the dataset exists in the data/ folder
Try clicking Refresh Datasets again

Step 3: Configure Model

Model Path: Leave as vibevoice/VibeVoice-1.5B (recommended)

You can also use:

vibevoice/VibeVoice-7B - Higher quality but needs 48GB+ VRAM and trains more slowly
Local path if you've downloaded the model

Step 4: Set Training Parameters

Click Training Parameters to expand the options.

Epochs (default: 3)

How many times to train on the full dataset
Start with 3-5 epochs
More epochs = longer training, may improve quality
Too many epochs can cause overfitting (model sounds robotic)

Batch Size (default: 4)

Number of samples processed together
Reduce to 2 or 1 if you get out-of-memory errors
Increase to 8 if you have 24GB+ VRAM
Larger = faster training but more memory

Learning Rate (default: 1e-4)

How fast the model learns
Default works well for most cases
Don't change unless you know what you're doing

LoRA Rank (default: 8)

Complexity of the adapter
Default (8) balances quality and efficiency
Increase to 16 or 32 for potentially better quality (slower, more VRAM)
Decrease to 4 for faster training (may sacrifice quality)

Step 5: Start Training

Click Start Training.

OpenVoiceLab will:

Prepare the training data (converts to JSONL format)
Launch the training process in the background
Start TensorBoard for monitoring

You'll see a status message:

✅ Training started: run_20251008_143022
Waiting for TensorBoard...

Training runs in a background process, so you can close the browser and it continues.

Step 6: Monitor Training

TensorBoard

Wait 10-20 seconds, then click Refresh TensorBoard.

You'll see graphs showing:

Loss curves - Should decrease over time (lower = better)
Learning rate - How it changes during training
Steps per second - Training speed

A decreasing loss curve means training is working. If loss stops decreasing, training has plateaued.

Training Logs

Click the Training Logs accordion to expand.

Click Refresh Logs to see real-time output:

Step 100/1500 | Loss: 0.234 | LR: 0.0001 | 2.3s/it
Step 200/1500 | Loss: 0.198 | LR: 0.0001 | 2.1s/it

This shows:

Current step / total steps
Current loss value
Learning rate
Time per iteration

Training History

Click Refresh Runs to see all your training runs:

🟢 run_20251008_143022
- Created: 2025-10-08T14:30:22
- Status: running
- Dataset: data/my_voice

Step 7: Wait for Completion

Training takes time depending on:

Dataset size (more samples = longer)
Number of epochs
Batch size
GPU speed

Typical times:

1000 samples, 3 epochs, batch size 4: 2-3 hours on RTX 4090
2000 samples, 5 epochs, batch size 2: 5-6 hours

You can check progress in the training logs. Look for lines like:

Epoch 1/3 complete
Epoch 2/3 complete

When finished, status will change to "stopped" in Training History.

Step 8: Find Your Trained Model

The trained adapter is saved to:

training_runs/run_TIMESTAMP/checkpoints/

You'll need this path for inference. The full path looks like:

training_runs/run_20251008_143022/checkpoints/

Stopping Training Early

Click Stop Training to halt the current training run.

The latest checkpoint will still be saved and usable. You don't need to wait for all epochs to complete - sometimes 1-2 epochs is enough.

Training Parameters Explained

What is LoRA?

LoRA (Low-Rank Adaptation) trains small "adapter" layers instead of the full model. Benefits:

Much less VRAM needed (16GB vs 80GB+)
Faster training
Creates small files (few hundred MB vs multiple GB)
You can load/unload different adapters easily

The adapter modifies how the base model generates speech without changing the base model itself.

What are Epochs?

One epoch = training on the entire dataset once.

3 epochs = model sees each sample 3 times
More epochs can improve quality but risk overfitting
Overfitting = model memorizes training data, sounds unnatural on new text

Start with 3 epochs. If results are poor, try 5. If results sound robotic, reduce to 2.

Batch Size vs VRAM

Batch size determines how many samples are processed simultaneously.

Larger batch sizes train faster but use more VRAM. If you get OOM (out of memory) errors, reduce batch size.

Evaluating Results

After training, use the Inference tab to test your model:

Load the base model (vibevoice/VibeVoice-1.5B)
Check Load LoRA Adapter
Enter the path: training_runs/run_TIMESTAMP/checkpoints/
Generate speech from text

Listen for:

Does it sound like the training voice?
Is speech natural or robotic?
Does it handle different text well?

If quality is poor, see the troubleshooting section below.

Training Multiple Voices

You can train multiple adapters and swap between them:

Process different datasets (one per voice)
Train each dataset separately
Each creates a separate checkpoint in training_runs/
At inference time, load different checkpoints to use different voices

Troubleshooting

Out of memory during training

Reduce batch size to 2 or 1
Use a smaller LoRA rank (4 instead of 8)
Close other GPU applications
Use the 1.5B model instead of 7B
Process fewer samples at once

Training loss not decreasing

Check dataset quality (listen to samples in data/my_dataset/wavs/)
Make sure transcriptions are accurate
Try more epochs (5 instead of 3)
Try higher learning rate (1.5e-4)

Generated speech doesn't sound like training voice

Train for more epochs (5-7)
Increase LoRA rank (16 or 32)
Check training data quality
Make sure you're loading the adapter correctly at inference
Try using voice cloning with a reference sample

Training crashes or hangs

Check training logs for error messages
Make sure dataset is formatted correctly
Verify enough disk space
Try a smaller dataset first (100-200 samples) to test

TensorBoard won't load

Wait 30-60 seconds after starting training
Click Refresh TensorBoard
Check if port 6006 is blocked by firewall
Look at tensorboard.log in the run directory

Tips for Better Results

Data quality matters most

30 minutes of clean audio > 3 hours of noisy audio
Single speaker, minimal background noise
Natural speech patterns

Start conservative

Use default settings first
3 epochs, batch size 4, LoRA rank 8
Only adjust if results are poor

Monitor training

Check loss curves in TensorBoard
Loss should decrease steadily
If loss flatlines early, training may need adjustment

Test early and often

Don't wait for all epochs to finish
After 1 epoch, test the checkpoint
If it's good enough, stop training

Experiment

Try different training parameters
Train on different data
Compare results

Next Steps

After training a model:

👉 Inference Guide - Generate speech with your finetuned voice

Care to share your finetuned model with the community?

Sign up for an account on Hugging Face
Visit this page and click Join this org
Click the New button, then click Model
Give your model a name and click Create
Click Files and versions, then the + Contribute button
Select Upload files and drag and drop the fine-tuned model in your training_runs/ folder (NOT the whole folder)

This will allow other members of the community to use your model.

Advanced Topics

Resuming Training

Currently, you can't resume interrupted training. If training stops, you need to start over. The latest checkpoint before stopping is still usable.

Multi-Speaker Training

To train on multiple speakers:

Process each speaker's audio into separate datasets
Train separate adapters for each
Or combine datasets if you want a multi-speaker model

Training on Non-English

VibeVoice was trained primarily on English but has some multilingual capability. Training on other languages:

Make sure Whisper transcribes correctly
May need more data for good results
Results may vary

Custom Training Scripts

Advanced users can modify training parameters by editing ovl/training/trainer.py or running the training script directly with custom arguments.

Finetuning Guide ​

Prerequisites ​

Training Process ​

Step 1: Open the Training Tab ​

Step 2: Select Dataset ​

Step 3: Configure Model ​

Step 4: Set Training Parameters ​

Step 5: Start Training ​

Step 6: Monitor Training ​

Step 7: Wait for Completion ​

Step 8: Find Your Trained Model ​

Stopping Training Early ​

Training Parameters Explained ​

What is LoRA? ​

What are Epochs? ​

Batch Size vs VRAM ​

Evaluating Results ​

Training Multiple Voices ​

Troubleshooting ​

Out of memory during training ​

Training loss not decreasing ​

Generated speech doesn't sound like training voice ​

Training crashes or hangs ​

TensorBoard won't load ​

Tips for Better Results ​

Next Steps ​

Advanced Topics ​

Resuming Training ​

Multi-Speaker Training ​

Training on Non-English ​

Custom Training Scripts ​