Quick Start
Let's generate your first speech sample with OpenVoiceLab.
What We'll Do
In this guide, you'll:
- Load a pretrained model
- Generate speech from text
- Listen to your first AI-generated audio
This takes about 5 minutes and doesn't require any training or custom data.
Step 1: Launch OpenVoiceLab
If you haven't already, start OpenVoiceLab:
# Linux/macOS
./scripts/run.sh
# Windows
scripts\run.bat
# Or manually
python -m ovl.cliOpen your browser to http://localhost:7860
Network Access
To access from other devices on your network, add --host 0.0.0.0 or --share:
./scripts/run.sh --host 0.0.0.0 # or run.bat --host 0.0.0.0 on Windows
./scripts/run.sh --share # or run.bat --share on WindowsNote: TensorBoard will not work from another device.
Step 2: Navigate to Inference Tab
Click on the Inference tab at the top. This is where you generate speech from text.
You should see:
- Model Settings (left side)
- Generate Speech controls (right side)
Step 3: Load the Model
The first time you use OpenVoiceLab, you need to load a model.
Configure Model Settings
In the Model Settings section:
Model Path: Should already say
vibevoice/VibeVoice-1.5B- This is the base model that will download from HuggingFace
- If you have lots of VRAM (24GB+), you can try
vibevoice/VibeVoice-7Bfor better quality
Device: Should auto-select the best option
cudaif you have an NVIDIA GPUmpsif you have an Apple Silicon Maccpuif neither (works but slower)
Leave Load LoRA Adapter unchecked for now
- We'll use this later when you finetune your own voice
Load the Model
Click the Load Model button.
First-Time Download
The first time you load the model, it will download about 3-6 GB from HuggingFace. This can take several minutes depending on your internet speed. Subsequent loads will be much faster!
You should see a progress indicator. When done, the status will show:
✓ Model loaded on cudaThe button will change to Unload Model.
Step 4: Generate Your First Speech
Now the fun part - making the AI talk!
Enter Some Text
In the Text box, enter something like:
Hello! This is my first test with OpenVoiceLab.
I'm excited to explore voice synthesis and create custom voices.Feel free to write anything you want - keep it under a few sentences for this first test.
Configure Generation Settings
Enable Voice Cloning: Leave this checked
- This uses a reference voice to guide the generation
- Makes the output sound more natural
Voice: Select a voice from the dropdown
- These are reference voice samples included with OpenVoiceLab
- Try different voices to hear the variation!
CFG Scale: Leave at default (1.3)
- This controls how closely the model follows the reference voice
- Lower = more creative, Higher = more similar to reference
- Range: 1.0 - 2.0
Generate!
Click the Generate Speech button.
You'll see a progress bar as the model generates audio. This takes a few seconds depending on:
- Text length (longer text = more time)
- Your hardware (GPU is much faster than CPU)
- Model size (7B is slower but higher quality)
Typical generation times:
- RTX 4090: 2-5 seconds for a sentence
- Apple M1/M2: 10-20 seconds
- CPU: 30-60 seconds
Listen to Your Audio
When generation completes, you'll see:
- Audio player with your generated speech
- Status showing generation stats:
✓ Generated successfully Duration: 8.5s Generation time: 3.2s RTF: 0.38x
Click the play button to hear your AI-generated speech!
The audio is automatically saved to outputs/generated_TIMESTAMP.wav
Step 5: Experiment!
Now that you've generated your first audio, try experimenting:
Try Different Voices
Change the Voice dropdown and regenerate the same text. Notice how the speaking style changes!
Adjust CFG Scale
Try different CFG Scale values:
- 1.0: More creative, less similar to reference
- 1.5: More adherent to reference voice
- 2.0: Maximum similarity (may sound less natural)
Try Different Text
Generate different types of content:
Conversational:
Hey there! How's it going? I've been working on this really cool project lately.Narrative:
Once upon a time, in a land far away, there lived a curious inventor
who dreamed of bringing voices to life.Technical:
The neural network processes input tokens through multiple transformer layers,
generating acoustic features at a frame rate of 7.5 hertz.Disable Voice Cloning (Optional)
Uncheck Enable Voice Cloning to hear the model's "natural" voice without reference guidance. This can be interesting to compare!
Understanding Generation Settings
What is Voice Cloning?
Voice cloning uses a reference audio sample (the "Voice" you select) to guide how the model generates speech. Think of it as giving the AI an example voice to imitate.
With voice cloning enabled:
- Output sounds similar to the reference voice
- More consistent style and tone
- Better for matching a specific speaker
With voice cloning disabled:
- Model uses its "default" learned voice
- More variation between generations
- Useful when you want generic neutral speech
What is CFG Scale?
CFG (Classifier-Free Guidance) Scale controls how strictly the model follows the reference voice.
- Lower values (1.0-1.2): More creative, natural-sounding, but less similar to reference
- Medium values (1.3-1.5): Good balance (recommended)
- Higher values (1.6-2.0): Very similar to reference, but may sound less natural
Start with 1.3 and adjust based on your preference.
What is RTF?
RTF (Real-Time Factor) measures generation speed:
- RTF < 1.0: Faster than real-time (0.5x = generates 2x faster than audio playback)
- RTF = 1.0: Same speed as real-time
- RTF > 1.0: Slower than real-time (2.0x = takes 2 seconds to generate 1 second of audio)
Lower is better. RTF depends on your hardware.
Troubleshooting Quick Start
Model won't load
Error: Out of memory
- Close other applications using GPU
- Try using CPU device instead (slower but works)
- Use a smaller model if you tried the 7B variant
Error: Connection timeout
- Model is downloading from HuggingFace - wait a bit longer
- Check your internet connection
- Try again if download was interrupted
Generation fails
Error: Please load the model first
- Make sure you clicked "Load Model" and it shows as loaded
Error: Please enter some text
- The text box is empty - type something!
Error: Please select a voice
- Voice cloning is enabled but no voice selected
- Select a voice from dropdown or disable voice cloning
Audio sounds weird
- Try a different reference voice
- Adjust CFG scale (try 1.2 or 1.5)
- Regenerate - there's some randomness involved
- Make sure your text has proper punctuation
What's Next?
Now that you've generated speech with a pretrained model, you're ready to explore more:
Create Your Own Voice
Follow the complete workflow to train a custom voice:
- Preparing Audio Data - Process your audio files
- Finetuning Your Voice - Train a custom voice model
- Generating Speech - Use your finetuned voice
Learn More
- Inference Guide - Deep dive into all generation options
- Managing Voices - Add your own reference voices
- FAQ - Common questions and answers
Tips for Best Results
✅ Good practices:
- Use proper punctuation for natural pausing
- Keep sentences reasonable length
- Use conversational language
- Include emotional cues in text
❌ Things to avoid:
- Very long run-on sentences
- Random capitalization or symbols
- Extremely technical jargon (model may not pronounce well)
- Non-English text (unless model was trained on it)
Need Help?
- Check the FAQ for common questions
- Read the Troubleshooting guide
- Join our Discord community
Happy voice generating! 🎙️