Skip to content

Generating Speech (Inference)

This guide covers using the Inference tab to generate speech from text.

Overview

The Inference tab lets you:

  • Load the VibeVoice model (with or without LoRA adapters)
  • Generate speech from text
  • Use voice cloning with reference samples
  • Adjust generation settings

Basic Workflow

  1. Load the model
  2. Enter text
  3. Select voice (optional)
  4. Adjust settings
  5. Generate
  6. Listen to output

Loading the Model

First Time Setup

  1. Go to the Inference tab
  2. Model Path: Use vibevoice/VibeVoice-1.5B (default)
    • Or vibevoice/VibeVoice-7B if you have 24GB+ VRAM
  3. Device: Auto-selected (cuda/mps/cpu)
  4. Leave Load LoRA Adapter unchecked (for now)
  5. Click Load Model

First time will download ~3-6 GB from HuggingFace. Subsequent loads are instant.

When loaded, you'll see:

✓ Model loaded on cuda

Loading with LoRA (Finetuned Voice)

To use your trained voice:

  1. Check Load LoRA Adapter
  2. LoRA Path field appears
  3. Enter path to your training checkpoint:
    training_runs/run_20251008_143022/checkpoints/
  4. Click Load Model

You'll see:

✓ Model loaded on cuda
✓ LoRA loaded from training_runs/run_20251008_143022/checkpoints/

Now the model will generate speech in your finetuned voice style.

Unloading

Click Unload Model to free up VRAM. You'll need to load again before generating.

Generating Speech

Enter Text

In the Text box, enter what you want to synthesize:

Hello! This is a test of my custom voice.

Tips:

  • Use proper punctuation for natural pausing
  • Keep sentences reasonable length
  • Add commas for pauses
  • Use periods for stops

Voice Cloning Options

Enable Voice Cloning (checkbox)

When checked:

  • Uses a reference voice to guide generation
  • Makes output sound more like a specific speaker
  • Select a voice from the dropdown

When unchecked:

  • Uses the model's "default" learned voice
  • Good for generic speech
  • No reference needed

Voice (dropdown)

Select a reference voice sample. These are audio files in the voices/ folder.

Try different voices to hear style variations.

CFG Scale

Range: 1.0 - 2.0 (default: 1.3)

Controls how closely the model follows the reference voice:

  • 1.0-1.2: More creative, natural, less similar to reference
  • 1.3-1.5: Good balance (recommended)
  • 1.6-2.0: Very similar to reference, may sound less natural

Start at 1.3 and adjust based on results.

Generate

Click Generate Speech.

You'll see a progress bar. Generation time depends on:

  • Text length
  • Your hardware
  • Model size (7B is slower than 1.5B)

Output

When complete:

  • Audio player appears with your generated speech
  • Status shows metrics:
    ✓ Generated successfully
    Duration: 5.2s
    Generation time: 2.1s
    RTF: 0.40x
  • Audio saved to outputs/generated_TIMESTAMP.wav

RTF (Real-Time Factor):

  • < 1.0 = Faster than real-time
  • = 1.0 = Same speed
  • > 1.0 = Slower than real-time

Lower is faster.

Using Finetuned Models

After training your voice:

  1. Load base model: vibevoice/VibeVoice-1.5B
  2. Check Load LoRA Adapter
  3. Enter checkpoint path
  4. Generate as normal

The model combines:

  • Base VibeVoice knowledge (how to speak)
  • Your LoRA adapter (your voice style)

With vs Without Voice Cloning

With voice cloning enabled:

  • More control over speaking style
  • Can match specific reference samples
  • Good for mimicking a particular recording

With voice cloning disabled:

  • Uses purely the finetuned adapter
  • Model's learned voice without guidance
  • Good for seeing what the model learned

Try both to see which works better for your use case.

Generation Settings Explained

Voice Selection

Reference voices are stored in voices/ folder. The model uses these to understand speaking style.

You can add your own reference voices - see the voices guide.

Device Selection

  • cuda: NVIDIA GPUs (fastest)
  • mps: Apple Silicon (fast)
  • cpu: Works anywhere (slowest)

Device is auto-selected on load. Can't change without reloading model.

Tips for Better Results

Text formatting:

  • Use natural punctuation
  • Avoid very long run-on sentences
  • Include appropriate capitalization
  • Don't use unusual symbols

Voice cloning:

  • Try different reference voices
  • Adjust CFG scale if output doesn't match expectations
  • Some voices work better than others

Finetuned models:

  • Make sure you trained on good quality data
  • If results are poor, retrain with more epochs or better data
  • Voice cloning can help or hurt depending on the reference

Experimentation:

  • Generate the same text multiple times
  • Try different CFG scales
  • Test with and without voice cloning
  • Compare different reference voices

Troubleshooting

Model won't load

Out of memory:

  • Close other GPU applications
  • Try CPU device (slower)
  • Use 1.5B instead of 7B

Connection timeout:

  • Model is downloading - wait longer
  • Check internet connection

Generation fails

"Please load the model first":

  • Click Load Model before generating

"Please select a voice":

  • Either select a voice or disable voice cloning

Out of memory during generation:

  • Shorten your text
  • Close other GPU applications
  • Restart OpenVoiceLab

Audio sounds wrong

Robotic or unnatural:

  • Try lower CFG scale (1.0-1.2)
  • Use different reference voice
  • If using finetuned model, may be overtrained

Doesn't match expected voice:

  • Try higher CFG scale (1.5-1.8)
  • Use different reference voice
  • Check that LoRA loaded correctly

Background music or noise:

  • This is a known VibeVoice behavior
  • Try different text or reference voice
  • Rerun generation (results vary)

Generation is slow

  • This is normal for large models
  • GPU is much faster than CPU
  • 7B model is slower than 1.5B
  • Shorter text generates faster

Output Files

Generated audio is saved to outputs/ with timestamps:

outputs/
├── generated_20251008_143022.wav
├── generated_20251008_143045.wav
└── generated_20251008_143108.wav

Files are WAV format at 24kHz sample rate.

You can:

  • Play in any audio player
  • Convert to MP3 with ffmpeg
  • Edit in audio software
  • Use in your projects

Advanced Usage

Batch Generation

To generate multiple outputs:

  • Enter text
  • Generate
  • Change text
  • Generate again
  • Repeat

Each output is saved with a new timestamp.

Different Voices

To compare voices:

  1. Generate with voice A
  2. Change voice dropdown to voice B
  3. Generate same text
  4. Compare outputs

Tuning CFG Scale

To find optimal CFG scale:

  1. Generate at 1.0
  2. Generate at 1.5
  3. Generate at 2.0
  4. Listen and pick best

Next Steps

Common Workflows

Using pretrained model:

  1. Load model (no LoRA)
  2. Select reference voice
  3. Generate speech

Using finetuned model:

  1. Load model with LoRA adapter
  2. Optionally use voice cloning
  3. Generate speech

Testing different voices:

  1. Load model once
  2. Change voice dropdown
  3. Regenerate to compare

Experimenting with settings:

  1. Keep same text
  2. Adjust CFG scale
  3. Toggle voice cloning
  4. Compare results

Released under the BSD-3-Clause License.