Managing Voices
This guide covers the Voices tab and how to manage reference voice samples for voice cloning.
What are Reference Voices?
Reference voices are short audio samples used to guide speech generation. When you enable voice cloning in the Inference tab, the model uses these samples to match speaking style.
Think of it as showing the model an example: "generate speech that sounds like this."
The Voices Tab
The Voices tab lets you:
- See available reference voices
- Upload new voice samples
- Test voices with sample text
- Organize your voice library
Default Voices
OpenVoiceLab comes with some default reference voices in the voices/ folder. These are used for voice cloning during inference.
Adding New Voices
Step 1: Prepare Your Audio
Voice samples should be:
- Short (~30 seconds)
- Clean (no background noise)
- Single speaker
- Natural speech (not monotone)
- WAV, MP3, or FLAC format
You can use:
- A recording of the target speaker
- A clip from a podcast or video
- Any clear speech sample
Step 2: Upload to the Voice Manager
Open the Voices tab and drag and drop your audio file into the file upload area. Your voice will now be available in the dropdown.
Step 3: Refresh in OpenVoiceLab
In the Inference tab:
- Click the 🔄 refresh button next to the Voice dropdown
- Your new voice appears in the list
- Select it to use for generation
Choosing Good Reference Samples
What works well:
- Clear, natural speech
- Moderate speaking pace
- Expressive but not theatrical
- Representative of the desired style
What doesn't work well:
- Very quiet or loud audio
- Background music or noise
- Whispering or shouting
- Heavily processed audio (effects, autotune)
Voice Cloning Behavior
The model uses reference voices to:
- Match speaking style and tone
- Capture prosody (rhythm and intonation)
- Approximate voice characteristics
It doesn't perfectly clone a voice - it guides the generation toward that style.
Important: Even with a reference voice, the model's base characteristics (from pretraining or finetuning) dominate. Reference voices provide guidance, not complete voice replacement.
Using Voices with Finetuned Models
When you finetune a model, you can still use reference voices:
Finetuned model + reference voice:
- Model generates in its finetuned style
- Reference voice adds extra guidance
- Can help consistency
Finetuned model without reference:
- Model uses purely learned style
- May sound more "pure" to the training data
- Less external influence
Try both to see what works better for your case.
Multiple Voices for One Person
You can have multiple reference samples for the same person:
voices/
├── john_casual.wav
├── john_formal.wav
└── john_excited.wavUse different samples to guide different speaking styles from the same voice.
Organizing Your Voice Library
For larger collections, you can organize with prefixes:
voices/
├── male_deep.wav
├── male_energetic.wav
├── female_calm.wav
└── female_professional.wavOr by project:
voices/
├── podcast_host.wav
├── podcast_guest.wav
├── tutorial_narrator.wav
└── character_villain.wavTesting Voices
To test how a voice sounds:
- Go to Inference tab
- Load model (pretrained is fine)
- Enable voice cloning
- Select your voice
- Generate with sample text:
Hello, this is a test of the voice sample. Let's hear how it sounds. - Listen and evaluate
Try the same text with different voices to compare.
Voice Quality Tips
Recording your own reference samples:
- Use a decent microphone
- Record in a quiet room
- Speak naturally, not reading robotically
- Include varied intonation
- Keep it short (10-20 seconds)
Using existing audio:
- Extract clean segments (no music/noise)
- Choose representative samples
- Avoid heavily compressed audio
- Use the highest quality source available
Technical Details
Supported Formats
- WAV (any sample rate)
- MP3
- FLAC
- M4A
The model internally processes at 24kHz, so your sample is resampled if needed.
Sample Length
- Too short (< 5 seconds): May not capture enough style information
- Good range (10-30 seconds): Ideal for most cases
- Too long (> 60 seconds): Unnecessary, first portion is most important
How Voice Cloning Works
VibeVoice uses "prefill" - it processes the reference audio first, then generates new speech conditioned on that style.
The CFG (Classifier-Free Guidance) scale controls how strongly the model follows the reference:
- Low CFG = less influence from reference
- High CFG = more influence from reference
Common Issues
Voice doesn't sound like reference
- Reference sample may be too short
- Increase CFG scale (try 1.5-1.8)
- Try a different reference sample
- Model's base voice may be too different
Inconsistent results
- Voice cloning has some randomness
- Try generating multiple times
- Use a clearer reference sample
- Adjust CFG scale
Voice sounds robotic
- CFG scale may be too high
- Try lower CFG (1.0-1.3)
- Use a more natural reference sample
Voices Folder Location
The voices folder is at the root of the OpenVoiceLab directory:
openvoicelab/
├── voices/ # <- Put voice samples here
│ ├── voice1.wav
│ └── voice2.wav
├── data/
├── outputs/
└── ...Backup and Sharing
To backup your voices:
cp -r voices/ voices_backup/To share a voice with someone:
# Just send them the wav file
cp voices/my_voice.wav ~/Desktop/They put it in their voices/ folder and it works the same way.
Next Steps
- Inference Guide - Use voices for generation
- FAQ - Common questions about voices
Example Workflow
Creating a custom narrator voice:
- Record or find a 20-second clip of desired narrator style
- Clean up audio (remove noise, normalize volume)
- Save as
voices/narrator.wav - Refresh voices in Inference tab
- Generate with narrator voice selected
- Adjust CFG scale to taste
Using podcast guest voices:
- Extract clean speech segments from podcast
- Save each guest as separate file:
guest1.wav,guest2.wav - Add to voices folder
- Generate dialogue by switching voices for each speaker
Testing finetuned voices:
- Train model on speaker A
- Add reference samples from speaker A to voices
- Generate with and without reference
- Compare which sounds better