Finetuning Guide
This guide covers how to train VibeVoice on custom voices using the Training tab.
Prerequisites
Before training, you need:
- A processed dataset - See the data preparation guide
- GPU with 16+ GB VRAM - RTX 3090, 4090, or similar
- Time - Training can take anywhere from 10 minutes to several hours depending on dataset size and epochs
Training Process
Step 1: Open the Training Tab
Click the Training tab in OpenVoiceLab.
Step 2: Select Dataset
Click Refresh Datasets to load available datasets.
Select your dataset from the dropdown. It will show the dataset name and number of samples:
my_voice (1247 samples)If you don't see your dataset:
- Make sure you processed it in the Data tab first
- Check that the dataset exists in the
data/folder - Try clicking Refresh Datasets again
Step 3: Configure Model
Model Path: Leave as vibevoice/VibeVoice-1.5B (recommended)
You can also use:
vibevoice/VibeVoice-7B- Higher quality but needs 48GB+ VRAM and trains more slowly- Local path if you've downloaded the model
Step 4: Set Training Parameters
Click Training Parameters to expand the options.
Epochs (default: 3)
- How many times to train on the full dataset
- Start with 3-5 epochs
- More epochs = longer training, may improve quality
- Too many epochs can cause overfitting (model sounds robotic)
Batch Size (default: 4)
- Number of samples processed together
- Reduce to 2 or 1 if you get out-of-memory errors
- Increase to 8 if you have 24GB+ VRAM
- Larger = faster training but more memory
Learning Rate (default: 1e-4)
- How fast the model learns
- Default works well for most cases
- Don't change unless you know what you're doing
LoRA Rank (default: 8)
- Complexity of the adapter
- Default (8) balances quality and efficiency
- Increase to 16 or 32 for potentially better quality (slower, more VRAM)
- Decrease to 4 for faster training (may sacrifice quality)
Step 5: Start Training
Click Start Training.
OpenVoiceLab will:
- Prepare the training data (converts to JSONL format)
- Launch the training process in the background
- Start TensorBoard for monitoring
You'll see a status message:
✅ Training started: run_20251008_143022
Waiting for TensorBoard...Training runs in a background process, so you can close the browser and it continues.
Step 6: Monitor Training
TensorBoard
Wait 10-20 seconds, then click Refresh TensorBoard.
You'll see graphs showing:
- Loss curves - Should decrease over time (lower = better)
- Learning rate - How it changes during training
- Steps per second - Training speed
A decreasing loss curve means training is working. If loss stops decreasing, training has plateaued.
Training Logs
Click the Training Logs accordion to expand.
Click Refresh Logs to see real-time output:
Step 100/1500 | Loss: 0.234 | LR: 0.0001 | 2.3s/it
Step 200/1500 | Loss: 0.198 | LR: 0.0001 | 2.1s/itThis shows:
- Current step / total steps
- Current loss value
- Learning rate
- Time per iteration
Training History
Click Refresh Runs to see all your training runs:
🟢 run_20251008_143022
- Created: 2025-10-08T14:30:22
- Status: running
- Dataset: data/my_voiceStep 7: Wait for Completion
Training takes time depending on:
- Dataset size (more samples = longer)
- Number of epochs
- Batch size
- GPU speed
Typical times:
- 1000 samples, 3 epochs, batch size 4: 2-3 hours on RTX 4090
- 2000 samples, 5 epochs, batch size 2: 5-6 hours
You can check progress in the training logs. Look for lines like:
Epoch 1/3 complete
Epoch 2/3 completeWhen finished, status will change to "stopped" in Training History.
Step 8: Find Your Trained Model
The trained adapter is saved to:
training_runs/run_TIMESTAMP/checkpoints/You'll need this path for inference. The full path looks like:
training_runs/run_20251008_143022/checkpoints/Stopping Training Early
Click Stop Training to halt the current training run.
The latest checkpoint will still be saved and usable. You don't need to wait for all epochs to complete - sometimes 1-2 epochs is enough.
Training Parameters Explained
What is LoRA?
LoRA (Low-Rank Adaptation) trains small "adapter" layers instead of the full model. Benefits:
- Much less VRAM needed (16GB vs 80GB+)
- Faster training
- Creates small files (few hundred MB vs multiple GB)
- You can load/unload different adapters easily
The adapter modifies how the base model generates speech without changing the base model itself.
What are Epochs?
One epoch = training on the entire dataset once.
- 3 epochs = model sees each sample 3 times
- More epochs can improve quality but risk overfitting
- Overfitting = model memorizes training data, sounds unnatural on new text
Start with 3 epochs. If results are poor, try 5. If results sound robotic, reduce to 2.
Batch Size vs VRAM
Batch size determines how many samples are processed simultaneously.
Larger batch sizes train faster but use more VRAM. If you get OOM (out of memory) errors, reduce batch size.
Evaluating Results
After training, use the Inference tab to test your model:
- Load the base model (
vibevoice/VibeVoice-1.5B) - Check Load LoRA Adapter
- Enter the path:
training_runs/run_TIMESTAMP/checkpoints/ - Generate speech from text
Listen for:
- Does it sound like the training voice?
- Is speech natural or robotic?
- Does it handle different text well?
If quality is poor, see the troubleshooting section below.
Training Multiple Voices
You can train multiple adapters and swap between them:
- Process different datasets (one per voice)
- Train each dataset separately
- Each creates a separate checkpoint in
training_runs/ - At inference time, load different checkpoints to use different voices
Troubleshooting
Out of memory during training
- Reduce batch size to 2 or 1
- Use a smaller LoRA rank (4 instead of 8)
- Close other GPU applications
- Use the 1.5B model instead of 7B
- Process fewer samples at once
Training loss not decreasing
- Check dataset quality (listen to samples in
data/my_dataset/wavs/) - Make sure transcriptions are accurate
- Try more epochs (5 instead of 3)
- Try higher learning rate (1.5e-4)
Generated speech doesn't sound like training voice
- Train for more epochs (5-7)
- Increase LoRA rank (16 or 32)
- Check training data quality
- Make sure you're loading the adapter correctly at inference
- Try using voice cloning with a reference sample
Training crashes or hangs
- Check training logs for error messages
- Make sure dataset is formatted correctly
- Verify enough disk space
- Try a smaller dataset first (100-200 samples) to test
TensorBoard won't load
- Wait 30-60 seconds after starting training
- Click Refresh TensorBoard
- Check if port 6006 is blocked by firewall
- Look at
tensorboard.login the run directory
Tips for Better Results
Data quality matters most
- 30 minutes of clean audio > 3 hours of noisy audio
- Single speaker, minimal background noise
- Natural speech patterns
Start conservative
- Use default settings first
- 3 epochs, batch size 4, LoRA rank 8
- Only adjust if results are poor
Monitor training
- Check loss curves in TensorBoard
- Loss should decrease steadily
- If loss flatlines early, training may need adjustment
Test early and often
- Don't wait for all epochs to finish
- After 1 epoch, test the checkpoint
- If it's good enough, stop training
Experiment
- Try different training parameters
- Train on different data
- Compare results
Next Steps
After training a model:
👉 Inference Guide - Generate speech with your finetuned voice
Care to share your finetuned model with the community?
- Sign up for an account on Hugging Face
- Visit this page and click
Join this org - Click the
Newbutton, then clickModel - Give your model a name and click
Create - Click
Files and versions, then the+ Contributebutton - Select
Upload filesand drag and drop the fine-tuned model in yourtraining_runs/folder (NOT the whole folder)
This will allow other members of the community to use your model.
Advanced Topics
Resuming Training
Currently, you can't resume interrupted training. If training stops, you need to start over. The latest checkpoint before stopping is still usable.
Multi-Speaker Training
To train on multiple speakers:
- Process each speaker's audio into separate datasets
- Train separate adapters for each
- Or combine datasets if you want a multi-speaker model
Training on Non-English
VibeVoice was trained primarily on English but has some multilingual capability. Training on other languages:
- Make sure Whisper transcribes correctly
- May need more data for good results
- Results may vary
Custom Training Scripts
Advanced users can modify training parameters by editing ovl/training/trainer.py or running the training script directly with custom arguments.