Generating Speech (Inference)

This guide covers using the Inference tab to generate speech from text.

Overview

The Inference tab lets you:

Load the VibeVoice model (with or without LoRA adapters)
Generate speech from text
Use voice cloning with reference samples
Adjust generation settings

Basic Workflow

Load the model
Enter text
Select voice (optional)
Adjust settings
Generate
Listen to output

Loading the Model

First Time Setup

Go to the Inference tab
Model Path: Use vibevoice/VibeVoice-1.5B (default)
- Or vibevoice/VibeVoice-7B if you have 24GB+ VRAM
Device: Auto-selected (cuda/mps/cpu)
Leave Load LoRA Adapter unchecked (for now)
Click Load Model

First time will download ~3-6 GB from HuggingFace. Subsequent loads are instant.

When loaded, you'll see:

✓ Model loaded on cuda

Loading with LoRA (Finetuned Voice)

To use your trained voice:

Check Load LoRA Adapter
LoRA Path field appears

Enter path to your training checkpoint:

training_runs/run_20251008_143022/checkpoints/

Click Load Model

You'll see:

✓ Model loaded on cuda
✓ LoRA loaded from training_runs/run_20251008_143022/checkpoints/

Now the model will generate speech in your finetuned voice style.

Unloading

Click Unload Model to free up VRAM. You'll need to load again before generating.

Generating Speech

Enter Text

In the Text box, enter what you want to synthesize:

Hello! This is a test of my custom voice.

Tips:

Use proper punctuation for natural pausing
Keep sentences reasonable length
Add commas for pauses
Use periods for stops

Voice Cloning Options

Enable Voice Cloning (checkbox)

When checked:

Uses a reference voice to guide generation
Makes output sound more like a specific speaker
Select a voice from the dropdown

When unchecked:

Uses the model's "default" learned voice
Good for generic speech
No reference needed

Voice (dropdown)

Select a reference voice sample. These are audio files in the voices/ folder.

Try different voices to hear style variations.

CFG Scale

Range: 1.0 - 2.0 (default: 1.3)

Controls how closely the model follows the reference voice:

1.0-1.2: More creative, natural, less similar to reference
1.3-1.5: Good balance (recommended)
1.6-2.0: Very similar to reference, may sound less natural

Start at 1.3 and adjust based on results.

Generate

Click Generate Speech.

You'll see a progress bar. Generation time depends on:

Text length
Your hardware
Model size (7B is slower than 1.5B)

Output

When complete:

Audio player appears with your generated speech

Status shows metrics:

✓ Generated successfully
Duration: 5.2s
Generation time: 2.1s
RTF: 0.40x

Audio saved to outputs/generated_TIMESTAMP.wav

RTF (Real-Time Factor):

< 1.0 = Faster than real-time
= 1.0 = Same speed
> 1.0 = Slower than real-time

Lower is faster.

Using Finetuned Models

After training your voice:

Load base model: vibevoice/VibeVoice-1.5B
Check Load LoRA Adapter
Enter checkpoint path
Generate as normal

The model combines:

Base VibeVoice knowledge (how to speak)
Your LoRA adapter (your voice style)

With vs Without Voice Cloning

With voice cloning enabled:

More control over speaking style
Can match specific reference samples
Good for mimicking a particular recording

With voice cloning disabled:

Uses purely the finetuned adapter
Model's learned voice without guidance
Good for seeing what the model learned

Try both to see which works better for your use case.

Generation Settings Explained

Voice Selection

Reference voices are stored in voices/ folder. The model uses these to understand speaking style.

You can add your own reference voices - see the voices guide.

Device Selection

cuda: NVIDIA GPUs (fastest)
mps: Apple Silicon (fast)
cpu: Works anywhere (slowest)

Device is auto-selected on load. Can't change without reloading model.

Tips for Better Results

Text formatting:

Use natural punctuation
Avoid very long run-on sentences
Include appropriate capitalization
Don't use unusual symbols

Voice cloning:

Try different reference voices
Adjust CFG scale if output doesn't match expectations
Some voices work better than others

Finetuned models:

Make sure you trained on good quality data
If results are poor, retrain with more epochs or better data
Voice cloning can help or hurt depending on the reference

Experimentation:

Generate the same text multiple times
Try different CFG scales
Test with and without voice cloning
Compare different reference voices

Troubleshooting

Model won't load

Out of memory:

Close other GPU applications
Try CPU device (slower)
Use 1.5B instead of 7B

Connection timeout:

Model is downloading - wait longer
Check internet connection

Generation fails

"Please load the model first":

Click Load Model before generating

"Please select a voice":

Either select a voice or disable voice cloning

Out of memory during generation:

Shorten your text
Close other GPU applications
Restart OpenVoiceLab

Audio sounds wrong

Robotic or unnatural:

Try lower CFG scale (1.0-1.2)
Use different reference voice
If using finetuned model, may be overtrained

Doesn't match expected voice:

Try higher CFG scale (1.5-1.8)
Use different reference voice
Check that LoRA loaded correctly

Background music or noise:

This is a known VibeVoice behavior
Try different text or reference voice
Rerun generation (results vary)

Generation is slow

This is normal for large models
GPU is much faster than CPU
7B model is slower than 1.5B
Shorter text generates faster

Output Files

Generated audio is saved to outputs/ with timestamps:

outputs/
├── generated_20251008_143022.wav
├── generated_20251008_143045.wav
└── generated_20251008_143108.wav

Files are WAV format at 24kHz sample rate.

You can:

Play in any audio player
Convert to MP3 with ffmpeg
Edit in audio software
Use in your projects

Advanced Usage

Batch Generation

To generate multiple outputs:

Enter text
Generate
Change text
Generate again
Repeat

Each output is saved with a new timestamp.

Different Voices

To compare voices:

Generate with voice A
Change voice dropdown to voice B
Generate same text
Compare outputs

Tuning CFG Scale

To find optimal CFG scale:

Generate at 1.0
Generate at 1.5
Generate at 2.0
Listen and pick best

Next Steps

Managing Voices - Add your own reference voices
Finetuning - Train custom voice models
FAQ - Common questions

Common Workflows

Using pretrained model:

Load model (no LoRA)
Select reference voice
Generate speech

Using finetuned model:

Load model with LoRA adapter
Optionally use voice cloning
Generate speech

Testing different voices:

Load model once
Change voice dropdown
Regenerate to compare

Experimenting with settings:

Keep same text
Adjust CFG scale
Toggle voice cloning
Compare results

Generating Speech (Inference) ​

Overview ​

Basic Workflow ​

Loading the Model ​

First Time Setup ​

Loading with LoRA (Finetuned Voice) ​

Unloading ​

Generating Speech ​

Enter Text ​

Voice Cloning Options ​

CFG Scale ​

Generate ​

Output ​

Using Finetuned Models ​

With vs Without Voice Cloning ​

Generation Settings Explained ​

Voice Selection ​

Device Selection ​

Tips for Better Results ​

Troubleshooting ​

Model won't load ​

Generation fails ​

Audio sounds wrong ​

Generation is slow ​

Output Files ​

Advanced Usage ​

Batch Generation ​

Different Voices ​

Tuning CFG Scale ​

Next Steps ​

Common Workflows ​