Skip to content

Preparing Audio Data

This guide covers how to process your audio files into datasets ready for training.

Overview

OpenVoiceLab needs training data in a specific format. The Data tab handles this automatically:

  1. You provide raw audio files (MP3, WAV, FLAC, M4A)
  2. Silero VAD segments the audio into speech clips
  3. Whisper transcribes each segment
  4. Output is saved for training

No manual segmentation or transcription needed.

Audio Requirements

What Works Well

  • Single speaker audio only
  • Clean recordings with minimal background noise
  • Natural speech (conversations, readings, podcasts)
  • 30+ minutes of audio recommended
  • Variety in speaking styles and content

What Doesn't Work

  • Multiple speakers talking simultaneously
  • Heavy music or background noise
  • Very quiet or distorted audio
  • Non-speech content (sound effects, music-only)

Step-by-Step Process

1. Collect Your Audio Files

Put all your audio files in a single folder:

/path/to/my_audio/
├── recording1.mp3
├── recording2.wav
├── podcast_episode.m4a
└── audiobook_chapter.flac

Files can be any length - they'll be automatically split into segments. No metadata is needed - just raw audio files.

2. Open the Data Tab

In OpenVoiceLab, click the Data tab.

3. Configure Processing

Input Directory

  • Enter the full path to your audio folder
  • Example: /Users/username/my_audio (macOS/Linux) or C:\Users\username\my_audio (Windows)

Dataset Name

  • Choose a name for this dataset
  • Use letters, numbers, and underscores only
  • Example: john_podcast, emma_voice, my_dataset
  • This name is used to reference the dataset later

Whisper Model

  • Select transcription model from dropdown

Available models (from fastest to most accurate):

  • openai/whisper-tiny - Fastest, least accurate
  • openai/whisper-base - Small model
  • openai/whisper-small - Better accuracy
  • openai/whisper-medium - Even better
  • openai/whisper-large-v3 - Most accurate
  • openai/whisper-large-v3-turbo - Fast + accurate (recommended)

If your data is relatively clean, whisper-large-v3-turbo is fine. Use larger models if transcription quality matters.

4. Start Processing

Click Start Processing.

You'll see progress as it:

  1. Segments audio files
  2. Transcribes each segment
  3. Saves the dataset

Processing time depends on:

  • Amount of audio
  • Whisper model size
  • Your GPU/CPU speed

Generally, it should be done in a few minutes.

5. Verify Dataset

When complete, you'll see your dataset listed with:

  • Number of samples created
  • Creation timestamp
  • Storage location

The dataset is saved to data/<dataset_name>/ with this structure:

data/my_dataset/
├── wavs/
│   ├── my_dataset_000000.wav
│   ├── my_dataset_000001.wav
│   └── ...
├── metadata.csv
└── info.json

metadata.csv format:

filename|text|text
my_dataset_000000.wav|This is the transcribed text|This is the transcribed text
my_dataset_000001.wav|Another segment of speech|Another segment of speech

Congratulations! You've now prepared your audio data for training. You can now proceed to the Finetuning Guide.

What Happens During Processing

Voice Activity Detection (VAD)

Silero VAD splits your audio files into speech segments:

  • Detects where speech starts and stops
  • Removes silence and non-speech sections
  • Creates 1-10 second clips (typically 2-5 seconds)
  • Each segment becomes a training sample

Transcription

Whisper transcribes each segment:

  • Converts speech to text
  • No timestamps or formatting
  • Works with English and many other languages
  • Larger models = better accuracy but slower

LJSpeech Format

Output is formatted for training:

  • WAV files at original sample rate
  • CSV with filename and transcription pairs
  • Standard format used by many TTS systems

Checking Data Quality

After processing, you can manually inspect:

bash
# View transcriptions
head data/my_dataset/metadata.csv

# Count samples
wc -l data/my_dataset/metadata.csv

# Listen to segments
ls data/my_dataset/wavs/

Play a few random wav files to check:

  • Audio is clear
  • Segments are reasonable length
  • No weird artifacts or noise

If quality is poor, you may need better source audio.

Troubleshooting

"No audio segments found after VAD processing"

Your audio may be:

  • Too quiet (VAD can't detect speech)
  • Too noisy (VAD rejects it)
  • Not actually speech (music, silence)

Try:

  • Using different audio files
  • Boosting audio volume before processing
  • Using cleaner recordings

Processing is very slow

  • Whisper models are slow on CPU
  • Use a smaller model (whisper-tiny or whisper-base)
  • Enable GPU if you have one
  • Be patient - it's a one-time process

Transcriptions are wrong

  • Use a larger Whisper model for better accuracy
  • Check that your audio language matches Whisper's training
  • Some accents or technical terms may not transcribe well
  • Minor errors are usually okay for training

Out of memory

  • Close other applications
  • Use a smaller Whisper model
  • Process fewer files at once

Tips for Best Results

Audio quality > quantity

  • 30 minutes of clean audio beats 3 hours of noisy audio
  • Remove music, noise, and overlapping speakers before processing

Variety helps

  • Different speaking styles (excited, calm, serious)
  • Different content (conversation, narration, technical)
  • This helps the model generalize

Check transcriptions

  • Open metadata.csv and spot-check a few entries
  • If transcriptions are mostly wrong, use a larger Whisper model
  • Small errors are usually fine

Multiple takes

  • If first attempt produces poor quality, try again
  • Adjust source audio (normalize volume, remove noise)
  • Use different Whisper model

Next Steps

Once you have a processed dataset:

👉 Finetuning Guide - Train a model on your dataset

Audio Collection Tips

Where to get audio for training:

  • Record yourself - Use a decent microphone, quiet room
  • Podcasts - Download episodes (respect copyright)
  • Audiobooks - Public domain books from LibriVox
  • YouTube - Use youtube-dl to extract audio (respect copyright)
  • Existing recordings - Past meetings, presentations, etc.

Copyright note: Only use audio you have rights to. Don't train on copyrighted material without permission.

Released under the BSD-3-Clause License.