Generating Speech (Inference)
This guide covers using the Inference tab to generate speech from text.
Overview
The Inference tab lets you:
- Load the VibeVoice model (with or without LoRA adapters)
- Generate speech from text
- Use voice cloning with reference samples
- Adjust generation settings
Basic Workflow
- Load the model
- Enter text
- Select voice (optional)
- Adjust settings
- Generate
- Listen to output
Loading the Model
First Time Setup
- Go to the Inference tab
- Model Path: Use
vibevoice/VibeVoice-1.5B(default)- Or
vibevoice/VibeVoice-7Bif you have 24GB+ VRAM
- Or
- Device: Auto-selected (cuda/mps/cpu)
- Leave Load LoRA Adapter unchecked (for now)
- Click Load Model
First time will download ~3-6 GB from HuggingFace. Subsequent loads are instant.
When loaded, you'll see:
✓ Model loaded on cudaLoading with LoRA (Finetuned Voice)
To use your trained voice:
- Check Load LoRA Adapter
- LoRA Path field appears
- Enter path to your training checkpoint:
training_runs/run_20251008_143022/checkpoints/ - Click Load Model
You'll see:
✓ Model loaded on cuda
✓ LoRA loaded from training_runs/run_20251008_143022/checkpoints/Now the model will generate speech in your finetuned voice style.
Unloading
Click Unload Model to free up VRAM. You'll need to load again before generating.
Generating Speech
Enter Text
In the Text box, enter what you want to synthesize:
Hello! This is a test of my custom voice.Tips:
- Use proper punctuation for natural pausing
- Keep sentences reasonable length
- Add commas for pauses
- Use periods for stops
Voice Cloning Options
Enable Voice Cloning (checkbox)
When checked:
- Uses a reference voice to guide generation
- Makes output sound more like a specific speaker
- Select a voice from the dropdown
When unchecked:
- Uses the model's "default" learned voice
- Good for generic speech
- No reference needed
Voice (dropdown)
Select a reference voice sample. These are audio files in the voices/ folder.
Try different voices to hear style variations.
CFG Scale
Range: 1.0 - 2.0 (default: 1.3)
Controls how closely the model follows the reference voice:
- 1.0-1.2: More creative, natural, less similar to reference
- 1.3-1.5: Good balance (recommended)
- 1.6-2.0: Very similar to reference, may sound less natural
Start at 1.3 and adjust based on results.
Generate
Click Generate Speech.
You'll see a progress bar. Generation time depends on:
- Text length
- Your hardware
- Model size (7B is slower than 1.5B)
Output
When complete:
- Audio player appears with your generated speech
- Status shows metrics:
✓ Generated successfully Duration: 5.2s Generation time: 2.1s RTF: 0.40x - Audio saved to
outputs/generated_TIMESTAMP.wav
RTF (Real-Time Factor):
- < 1.0 = Faster than real-time
- = 1.0 = Same speed
- > 1.0 = Slower than real-time
Lower is faster.
Using Finetuned Models
After training your voice:
- Load base model:
vibevoice/VibeVoice-1.5B - Check Load LoRA Adapter
- Enter checkpoint path
- Generate as normal
The model combines:
- Base VibeVoice knowledge (how to speak)
- Your LoRA adapter (your voice style)
With vs Without Voice Cloning
With voice cloning enabled:
- More control over speaking style
- Can match specific reference samples
- Good for mimicking a particular recording
With voice cloning disabled:
- Uses purely the finetuned adapter
- Model's learned voice without guidance
- Good for seeing what the model learned
Try both to see which works better for your use case.
Generation Settings Explained
Voice Selection
Reference voices are stored in voices/ folder. The model uses these to understand speaking style.
You can add your own reference voices - see the voices guide.
Device Selection
- cuda: NVIDIA GPUs (fastest)
- mps: Apple Silicon (fast)
- cpu: Works anywhere (slowest)
Device is auto-selected on load. Can't change without reloading model.
Tips for Better Results
Text formatting:
- Use natural punctuation
- Avoid very long run-on sentences
- Include appropriate capitalization
- Don't use unusual symbols
Voice cloning:
- Try different reference voices
- Adjust CFG scale if output doesn't match expectations
- Some voices work better than others
Finetuned models:
- Make sure you trained on good quality data
- If results are poor, retrain with more epochs or better data
- Voice cloning can help or hurt depending on the reference
Experimentation:
- Generate the same text multiple times
- Try different CFG scales
- Test with and without voice cloning
- Compare different reference voices
Troubleshooting
Model won't load
Out of memory:
- Close other GPU applications
- Try CPU device (slower)
- Use 1.5B instead of 7B
Connection timeout:
- Model is downloading - wait longer
- Check internet connection
Generation fails
"Please load the model first":
- Click Load Model before generating
"Please select a voice":
- Either select a voice or disable voice cloning
Out of memory during generation:
- Shorten your text
- Close other GPU applications
- Restart OpenVoiceLab
Audio sounds wrong
Robotic or unnatural:
- Try lower CFG scale (1.0-1.2)
- Use different reference voice
- If using finetuned model, may be overtrained
Doesn't match expected voice:
- Try higher CFG scale (1.5-1.8)
- Use different reference voice
- Check that LoRA loaded correctly
Background music or noise:
- This is a known VibeVoice behavior
- Try different text or reference voice
- Rerun generation (results vary)
Generation is slow
- This is normal for large models
- GPU is much faster than CPU
- 7B model is slower than 1.5B
- Shorter text generates faster
Output Files
Generated audio is saved to outputs/ with timestamps:
outputs/
├── generated_20251008_143022.wav
├── generated_20251008_143045.wav
└── generated_20251008_143108.wavFiles are WAV format at 24kHz sample rate.
You can:
- Play in any audio player
- Convert to MP3 with ffmpeg
- Edit in audio software
- Use in your projects
Advanced Usage
Batch Generation
To generate multiple outputs:
- Enter text
- Generate
- Change text
- Generate again
- Repeat
Each output is saved with a new timestamp.
Different Voices
To compare voices:
- Generate with voice A
- Change voice dropdown to voice B
- Generate same text
- Compare outputs
Tuning CFG Scale
To find optimal CFG scale:
- Generate at 1.0
- Generate at 1.5
- Generate at 2.0
- Listen and pick best
Next Steps
- Managing Voices - Add your own reference voices
- Finetuning - Train custom voice models
- FAQ - Common questions
Common Workflows
Using pretrained model:
- Load model (no LoRA)
- Select reference voice
- Generate speech
Using finetuned model:
- Load model with LoRA adapter
- Optionally use voice cloning
- Generate speech
Testing different voices:
- Load model once
- Change voice dropdown
- Regenerate to compare
Experimenting with settings:
- Keep same text
- Adjust CFG scale
- Toggle voice cloning
- Compare results