Managing Voices

This guide covers the Voices tab and how to manage reference voice samples for voice cloning.

What are Reference Voices?

Reference voices are short audio samples used to guide speech generation. When you enable voice cloning in the Inference tab, the model uses these samples to match speaking style.

Think of it as showing the model an example: "generate speech that sounds like this."

The Voices Tab

The Voices tab lets you:

See available reference voices
Upload new voice samples
Test voices with sample text
Organize your voice library

Default Voices

OpenVoiceLab comes with some default reference voices in the voices/ folder. These are used for voice cloning during inference.

Adding New Voices

Step 1: Prepare Your Audio

Voice samples should be:

Short (~30 seconds)
Clean (no background noise)
Single speaker
Natural speech (not monotone)
WAV, MP3, or FLAC format

You can use:

A recording of the target speaker
A clip from a podcast or video
Any clear speech sample

Step 2: Upload to the Voice Manager

Open the Voices tab and drag and drop your audio file into the file upload area. Your voice will now be available in the dropdown.

Step 3: Refresh in OpenVoiceLab

In the Inference tab:

Click the 🔄 refresh button next to the Voice dropdown
Your new voice appears in the list
Select it to use for generation

Choosing Good Reference Samples

What works well:

Clear, natural speech
Moderate speaking pace
Expressive but not theatrical
Representative of the desired style

What doesn't work well:

Very quiet or loud audio
Background music or noise
Whispering or shouting
Heavily processed audio (effects, autotune)

Voice Cloning Behavior

The model uses reference voices to:

Match speaking style and tone
Capture prosody (rhythm and intonation)
Approximate voice characteristics

It doesn't perfectly clone a voice - it guides the generation toward that style.

Important: Even with a reference voice, the model's base characteristics (from pretraining or finetuning) dominate. Reference voices provide guidance, not complete voice replacement.

Using Voices with Finetuned Models

When you finetune a model, you can still use reference voices:

Finetuned model + reference voice:

Model generates in its finetuned style
Reference voice adds extra guidance
Can help consistency

Finetuned model without reference:

Model uses purely learned style
May sound more "pure" to the training data
Less external influence

Try both to see what works better for your case.

Multiple Voices for One Person

You can have multiple reference samples for the same person:

voices/
├── john_casual.wav
├── john_formal.wav
└── john_excited.wav

Use different samples to guide different speaking styles from the same voice.

Organizing Your Voice Library

For larger collections, you can organize with prefixes:

voices/
├── male_deep.wav
├── male_energetic.wav
├── female_calm.wav
└── female_professional.wav

Or by project:

voices/
├── podcast_host.wav
├── podcast_guest.wav
├── tutorial_narrator.wav
└── character_villain.wav

Testing Voices

To test how a voice sounds:

Go to Inference tab
Load model (pretrained is fine)
Enable voice cloning
Select your voice

Generate with sample text:

Hello, this is a test of the voice sample.
Let's hear how it sounds.

Listen and evaluate

Try the same text with different voices to compare.

Voice Quality Tips

Recording your own reference samples:

Use a decent microphone
Record in a quiet room
Speak naturally, not reading robotically
Include varied intonation
Keep it short (10-20 seconds)

Using existing audio:

Extract clean segments (no music/noise)
Choose representative samples
Avoid heavily compressed audio
Use the highest quality source available

Technical Details

Supported Formats

WAV (any sample rate)
MP3
FLAC
M4A

The model internally processes at 24kHz, so your sample is resampled if needed.

Sample Length

Too short (< 5 seconds): May not capture enough style information
Good range (10-30 seconds): Ideal for most cases
Too long (> 60 seconds): Unnecessary, first portion is most important

How Voice Cloning Works

VibeVoice uses "prefill" - it processes the reference audio first, then generates new speech conditioned on that style.

The CFG (Classifier-Free Guidance) scale controls how strongly the model follows the reference:

Low CFG = less influence from reference
High CFG = more influence from reference

Common Issues

Voice doesn't sound like reference

Reference sample may be too short
Increase CFG scale (try 1.5-1.8)
Try a different reference sample
Model's base voice may be too different

Inconsistent results

Voice cloning has some randomness
Try generating multiple times
Use a clearer reference sample
Adjust CFG scale

Voice sounds robotic

CFG scale may be too high
Try lower CFG (1.0-1.3)
Use a more natural reference sample

Voices Folder Location

The voices folder is at the root of the OpenVoiceLab directory:

openvoicelab/
├── voices/          # <- Put voice samples here
│   ├── voice1.wav
│   └── voice2.wav
├── data/
├── outputs/
└── ...

To backup your voices:

bash

cp -r voices/ voices_backup/

To share a voice with someone:

bash

# Just send them the wav file
cp voices/my_voice.wav ~/Desktop/

They put it in their voices/ folder and it works the same way.

Next Steps

Inference Guide - Use voices for generation
FAQ - Common questions about voices

Example Workflow

Creating a custom narrator voice:

Record or find a 20-second clip of desired narrator style
Clean up audio (remove noise, normalize volume)
Save as voices/narrator.wav
Refresh voices in Inference tab
Generate with narrator voice selected
Adjust CFG scale to taste

Using podcast guest voices:

Extract clean speech segments from podcast
Save each guest as separate file: guest1.wav, guest2.wav
Add to voices folder
Generate dialogue by switching voices for each speaker

Testing finetuned voices:

Train model on speaker A
Add reference samples from speaker A to voices
Generate with and without reference
Compare which sounds better

Managing Voices ​

What are Reference Voices? ​

The Voices Tab ​

Default Voices ​

Adding New Voices ​

Step 1: Prepare Your Audio ​

Step 2: Upload to the Voice Manager ​

Step 3: Refresh in OpenVoiceLab ​

Choosing Good Reference Samples ​

Voice Cloning Behavior ​

Using Voices with Finetuned Models ​

Multiple Voices for One Person ​

Organizing Your Voice Library ​

Testing Voices ​

Voice Quality Tips ​

Technical Details ​

Supported Formats ​

Sample Length ​

How Voice Cloning Works ​

Common Issues ​

Voice doesn't sound like reference ​

Inconsistent results ​

Voice sounds robotic ​

Voices Folder Location ​

Backup and Sharing ​

Next Steps ​

Example Workflow ​