Skip to content

Quick Start

Let's generate your first speech sample with OpenVoiceLab.

What We'll Do

In this guide, you'll:

  1. Load a pretrained model
  2. Generate speech from text
  3. Listen to your first AI-generated audio

This takes about 5 minutes and doesn't require any training or custom data.

Step 1: Launch OpenVoiceLab

If you haven't already, start OpenVoiceLab:

bash
# Linux/macOS
./scripts/run.sh

# Windows
scripts\run.bat

# Or manually
python -m ovl.cli

Open your browser to http://localhost:7860

Network Access

To access from other devices on your network, add --host 0.0.0.0 or --share:

bash
./scripts/run.sh --host 0.0.0.0  # or run.bat --host 0.0.0.0 on Windows
./scripts/run.sh --share          # or run.bat --share on Windows

Note: TensorBoard will not work from another device.

Step 2: Navigate to Inference Tab

Click on the Inference tab at the top. This is where you generate speech from text.

You should see:

  • Model Settings (left side)
  • Generate Speech controls (right side)

Step 3: Load the Model

The first time you use OpenVoiceLab, you need to load a model.

Configure Model Settings

In the Model Settings section:

  1. Model Path: Should already say vibevoice/VibeVoice-1.5B

    • This is the base model that will download from HuggingFace
    • If you have lots of VRAM (24GB+), you can try vibevoice/VibeVoice-7B for better quality
  2. Device: Should auto-select the best option

    • cuda if you have an NVIDIA GPU
    • mps if you have an Apple Silicon Mac
    • cpu if neither (works but slower)
  3. Leave Load LoRA Adapter unchecked for now

    • We'll use this later when you finetune your own voice

Load the Model

Click the Load Model button.

First-Time Download

The first time you load the model, it will download about 3-6 GB from HuggingFace. This can take several minutes depending on your internet speed. Subsequent loads will be much faster!

You should see a progress indicator. When done, the status will show:

✓ Model loaded on cuda

The button will change to Unload Model.

Step 4: Generate Your First Speech

Now the fun part - making the AI talk!

Enter Some Text

In the Text box, enter something like:

Hello! This is my first test with OpenVoiceLab.
I'm excited to explore voice synthesis and create custom voices.

Feel free to write anything you want - keep it under a few sentences for this first test.

Configure Generation Settings

  1. Enable Voice Cloning: Leave this checked

    • This uses a reference voice to guide the generation
    • Makes the output sound more natural
  2. Voice: Select a voice from the dropdown

    • These are reference voice samples included with OpenVoiceLab
    • Try different voices to hear the variation!
  3. CFG Scale: Leave at default (1.3)

    • This controls how closely the model follows the reference voice
    • Lower = more creative, Higher = more similar to reference
    • Range: 1.0 - 2.0

Generate!

Click the Generate Speech button.

You'll see a progress bar as the model generates audio. This takes a few seconds depending on:

  • Text length (longer text = more time)
  • Your hardware (GPU is much faster than CPU)
  • Model size (7B is slower but higher quality)

Typical generation times:

  • RTX 4090: 2-5 seconds for a sentence
  • Apple M1/M2: 10-20 seconds
  • CPU: 30-60 seconds

Listen to Your Audio

When generation completes, you'll see:

  • Audio player with your generated speech
  • Status showing generation stats:
    ✓ Generated successfully
    Duration: 8.5s
    Generation time: 3.2s
    RTF: 0.38x

Click the play button to hear your AI-generated speech!

The audio is automatically saved to outputs/generated_TIMESTAMP.wav

Step 5: Experiment!

Now that you've generated your first audio, try experimenting:

Try Different Voices

Change the Voice dropdown and regenerate the same text. Notice how the speaking style changes!

Adjust CFG Scale

Try different CFG Scale values:

  • 1.0: More creative, less similar to reference
  • 1.5: More adherent to reference voice
  • 2.0: Maximum similarity (may sound less natural)

Try Different Text

Generate different types of content:

Conversational:

Hey there! How's it going? I've been working on this really cool project lately.

Narrative:

Once upon a time, in a land far away, there lived a curious inventor
who dreamed of bringing voices to life.

Technical:

The neural network processes input tokens through multiple transformer layers,
generating acoustic features at a frame rate of 7.5 hertz.

Disable Voice Cloning (Optional)

Uncheck Enable Voice Cloning to hear the model's "natural" voice without reference guidance. This can be interesting to compare!

Understanding Generation Settings

What is Voice Cloning?

Voice cloning uses a reference audio sample (the "Voice" you select) to guide how the model generates speech. Think of it as giving the AI an example voice to imitate.

With voice cloning enabled:

  • Output sounds similar to the reference voice
  • More consistent style and tone
  • Better for matching a specific speaker

With voice cloning disabled:

  • Model uses its "default" learned voice
  • More variation between generations
  • Useful when you want generic neutral speech

What is CFG Scale?

CFG (Classifier-Free Guidance) Scale controls how strictly the model follows the reference voice.

  • Lower values (1.0-1.2): More creative, natural-sounding, but less similar to reference
  • Medium values (1.3-1.5): Good balance (recommended)
  • Higher values (1.6-2.0): Very similar to reference, but may sound less natural

Start with 1.3 and adjust based on your preference.

What is RTF?

RTF (Real-Time Factor) measures generation speed:

  • RTF < 1.0: Faster than real-time (0.5x = generates 2x faster than audio playback)
  • RTF = 1.0: Same speed as real-time
  • RTF > 1.0: Slower than real-time (2.0x = takes 2 seconds to generate 1 second of audio)

Lower is better. RTF depends on your hardware.

Troubleshooting Quick Start

Model won't load

Error: Out of memory

  • Close other applications using GPU
  • Try using CPU device instead (slower but works)
  • Use a smaller model if you tried the 7B variant

Error: Connection timeout

  • Model is downloading from HuggingFace - wait a bit longer
  • Check your internet connection
  • Try again if download was interrupted

Generation fails

Error: Please load the model first

  • Make sure you clicked "Load Model" and it shows as loaded

Error: Please enter some text

  • The text box is empty - type something!

Error: Please select a voice

  • Voice cloning is enabled but no voice selected
  • Select a voice from dropdown or disable voice cloning

Audio sounds weird

  • Try a different reference voice
  • Adjust CFG scale (try 1.2 or 1.5)
  • Regenerate - there's some randomness involved
  • Make sure your text has proper punctuation

What's Next?

Now that you've generated speech with a pretrained model, you're ready to explore more:

Create Your Own Voice

Follow the complete workflow to train a custom voice:

  1. Preparing Audio Data - Process your audio files
  2. Finetuning Your Voice - Train a custom voice model
  3. Generating Speech - Use your finetuned voice

Learn More

Tips for Best Results

Good practices:

  • Use proper punctuation for natural pausing
  • Keep sentences reasonable length
  • Use conversational language
  • Include emotional cues in text

Things to avoid:

  • Very long run-on sentences
  • Random capitalization or symbols
  • Extremely technical jargon (model may not pronounce well)
  • Non-English text (unless model was trained on it)

Need Help?

Happy voice generating! 🎙️

Released under the BSD-3-Clause License.