Quick Start

Let's generate your first speech sample with OpenVoiceLab.

What We'll Do

In this guide, you'll:

Load a pretrained model
Generate speech from text
Listen to your first AI-generated audio

This takes about 5 minutes and doesn't require any training or custom data.

Step 1: Launch OpenVoiceLab

If you haven't already, start OpenVoiceLab:

bash

# Linux/macOS
./scripts/run.sh

# Windows
scripts\run.bat

# Or manually
python -m ovl.cli

Open your browser to http://localhost:7860

Network Access

To access from other devices on your network, add --host 0.0.0.0 or --share:

bash

./scripts/run.sh --host 0.0.0.0  # or run.bat --host 0.0.0.0 on Windows
./scripts/run.sh --share          # or run.bat --share on Windows

Note: TensorBoard will not work from another device.

Step 2: Navigate to Inference Tab

Click on the Inference tab at the top. This is where you generate speech from text.

You should see:

Model Settings (left side)
Generate Speech controls (right side)

Step 3: Load the Model

The first time you use OpenVoiceLab, you need to load a model.

Configure Model Settings

In the Model Settings section:

Model Path: Should already say vibevoice/VibeVoice-1.5B
- This is the base model that will download from HuggingFace
- If you have lots of VRAM (24GB+), you can try vibevoice/VibeVoice-7B for better quality
Device: Should auto-select the best option
- cuda if you have an NVIDIA GPU
- mps if you have an Apple Silicon Mac
- cpu if neither (works but slower)
Leave Load LoRA Adapter unchecked for now
- We'll use this later when you finetune your own voice

Load the Model

Click the Load Model button.

First-Time Download

The first time you load the model, it will download about 3-6 GB from HuggingFace. This can take several minutes depending on your internet speed. Subsequent loads will be much faster!

You should see a progress indicator. When done, the status will show:

✓ Model loaded on cuda

The button will change to Unload Model.

Step 4: Generate Your First Speech

Now the fun part - making the AI talk!

Enter Some Text

In the Text box, enter something like:

Hello! This is my first test with OpenVoiceLab.
I'm excited to explore voice synthesis and create custom voices.

Feel free to write anything you want - keep it under a few sentences for this first test.

Configure Generation Settings

Enable Voice Cloning: Leave this checked
- This uses a reference voice to guide the generation
- Makes the output sound more natural
Voice: Select a voice from the dropdown
- These are reference voice samples included with OpenVoiceLab
- Try different voices to hear the variation!
CFG Scale: Leave at default (1.3)
- This controls how closely the model follows the reference voice
- Lower = more creative, Higher = more similar to reference
- Range: 1.0 - 2.0

Generate!

Click the Generate Speech button.

You'll see a progress bar as the model generates audio. This takes a few seconds depending on:

Text length (longer text = more time)
Your hardware (GPU is much faster than CPU)
Model size (7B is slower but higher quality)

Typical generation times:

RTX 4090: 2-5 seconds for a sentence
Apple M1/M2: 10-20 seconds
CPU: 30-60 seconds

Listen to Your Audio

When generation completes, you'll see:

Audio player with your generated speech

Status showing generation stats:

✓ Generated successfully
Duration: 8.5s
Generation time: 3.2s
RTF: 0.38x

Click the play button to hear your AI-generated speech!

The audio is automatically saved to outputs/generated_TIMESTAMP.wav

Step 5: Experiment!

Now that you've generated your first audio, try experimenting:

Try Different Voices

Change the Voice dropdown and regenerate the same text. Notice how the speaking style changes!

Adjust CFG Scale

Try different CFG Scale values:

1.0: More creative, less similar to reference
1.5: More adherent to reference voice
2.0: Maximum similarity (may sound less natural)

Try Different Text

Generate different types of content:

Conversational:

Hey there! How's it going? I've been working on this really cool project lately.

Narrative:

Once upon a time, in a land far away, there lived a curious inventor
who dreamed of bringing voices to life.

Technical:

The neural network processes input tokens through multiple transformer layers,
generating acoustic features at a frame rate of 7.5 hertz.

Disable Voice Cloning (Optional)

Uncheck Enable Voice Cloning to hear the model's "natural" voice without reference guidance. This can be interesting to compare!

Understanding Generation Settings

What is Voice Cloning?

Voice cloning uses a reference audio sample (the "Voice" you select) to guide how the model generates speech. Think of it as giving the AI an example voice to imitate.

With voice cloning enabled:

Output sounds similar to the reference voice
More consistent style and tone
Better for matching a specific speaker

With voice cloning disabled:

Model uses its "default" learned voice
More variation between generations
Useful when you want generic neutral speech

What is CFG Scale?

CFG (Classifier-Free Guidance) Scale controls how strictly the model follows the reference voice.

Lower values (1.0-1.2): More creative, natural-sounding, but less similar to reference
Medium values (1.3-1.5): Good balance (recommended)
Higher values (1.6-2.0): Very similar to reference, but may sound less natural

Start with 1.3 and adjust based on your preference.

What is RTF?

RTF (Real-Time Factor) measures generation speed:

RTF < 1.0: Faster than real-time (0.5x = generates 2x faster than audio playback)
RTF = 1.0: Same speed as real-time
RTF > 1.0: Slower than real-time (2.0x = takes 2 seconds to generate 1 second of audio)

Lower is better. RTF depends on your hardware.

Troubleshooting Quick Start

Model won't load

Error: Out of memory

Close other applications using GPU
Try using CPU device instead (slower but works)
Use a smaller model if you tried the 7B variant

Error: Connection timeout

Model is downloading from HuggingFace - wait a bit longer
Check your internet connection
Try again if download was interrupted

Generation fails

Error: Please load the model first

Make sure you clicked "Load Model" and it shows as loaded

Error: Please enter some text

The text box is empty - type something!

Error: Please select a voice

Voice cloning is enabled but no voice selected
Select a voice from dropdown or disable voice cloning

Audio sounds weird

Try a different reference voice
Adjust CFG scale (try 1.2 or 1.5)
Regenerate - there's some randomness involved
Make sure your text has proper punctuation

What's Next?

Now that you've generated speech with a pretrained model, you're ready to explore more:

Create Your Own Voice

Follow the complete workflow to train a custom voice:

Preparing Audio Data - Process your audio files
Finetuning Your Voice - Train a custom voice model
Generating Speech - Use your finetuned voice

Learn More

Inference Guide - Deep dive into all generation options
Managing Voices - Add your own reference voices
FAQ - Common questions and answers

Tips for Best Results

✅ Good practices:

Use proper punctuation for natural pausing
Keep sentences reasonable length
Use conversational language
Include emotional cues in text

❌ Things to avoid:

Very long run-on sentences
Random capitalization or symbols
Extremely technical jargon (model may not pronounce well)
Non-English text (unless model was trained on it)

Need Help?

Check the FAQ for common questions
Read the Troubleshooting guide
Join our Discord community

Happy voice generating! 🎙️

Quick Start ​

What We'll Do ​

Step 1: Launch OpenVoiceLab ​

Step 2: Navigate to Inference Tab ​

Step 3: Load the Model ​

Configure Model Settings ​

Load the Model ​

Step 4: Generate Your First Speech ​

Enter Some Text ​

Configure Generation Settings ​

Generate! ​

Listen to Your Audio ​

Step 5: Experiment! ​

Try Different Voices ​

Adjust CFG Scale ​

Try Different Text ​

Disable Voice Cloning (Optional) ​

Understanding Generation Settings ​

What is Voice Cloning? ​

What is CFG Scale? ​

What is RTF? ​

Troubleshooting Quick Start ​

Model won't load ​

Generation fails ​

Audio sounds weird ​

What's Next? ​

Create Your Own Voice ​

Learn More ​

Tips for Best Results ​

Need Help? ​