Preparing Audio Data

This guide covers how to process your audio files into datasets ready for training.

Overview

OpenVoiceLab needs training data in a specific format. The Data tab handles this automatically:

You provide raw audio files (MP3, WAV, FLAC, M4A)
Silero VAD segments the audio into speech clips
Whisper transcribes each segment
Output is saved for training

No manual segmentation or transcription needed.

Audio Requirements

What Works Well

Single speaker audio only
Clean recordings with minimal background noise
Natural speech (conversations, readings, podcasts)
30+ minutes of audio recommended
Variety in speaking styles and content

What Doesn't Work

Multiple speakers talking simultaneously
Heavy music or background noise
Very quiet or distorted audio
Non-speech content (sound effects, music-only)

Step-by-Step Process

1. Collect Your Audio Files

Put all your audio files in a single folder:

/path/to/my_audio/
├── recording1.mp3
├── recording2.wav
├── podcast_episode.m4a
└── audiobook_chapter.flac

Files can be any length - they'll be automatically split into segments. No metadata is needed - just raw audio files.

2. Open the Data Tab

In OpenVoiceLab, click the Data tab.

3. Configure Processing

Input Directory

Enter the full path to your audio folder
Example: /Users/username/my_audio (macOS/Linux) or C:\Users\username\my_audio (Windows)

Dataset Name

Choose a name for this dataset
Use letters, numbers, and underscores only
Example: john_podcast, emma_voice, my_dataset
This name is used to reference the dataset later

Whisper Model

Select transcription model from dropdown

Available models (from fastest to most accurate):

openai/whisper-tiny - Fastest, least accurate
openai/whisper-base - Small model
openai/whisper-small - Better accuracy
openai/whisper-medium - Even better
openai/whisper-large-v3 - Most accurate
openai/whisper-large-v3-turbo - Fast + accurate (recommended)

If your data is relatively clean, whisper-large-v3-turbo is fine. Use larger models if transcription quality matters.

4. Start Processing

Click Start Processing.

You'll see progress as it:

Segments audio files
Transcribes each segment
Saves the dataset

Processing time depends on:

Amount of audio
Whisper model size
Your GPU/CPU speed

Generally, it should be done in a few minutes.

5. Verify Dataset

When complete, you'll see your dataset listed with:

Number of samples created
Creation timestamp
Storage location

The dataset is saved to data/<dataset_name>/ with this structure:

data/my_dataset/
├── wavs/
│   ├── my_dataset_000000.wav
│   ├── my_dataset_000001.wav
│   └── ...
├── metadata.csv
└── info.json

metadata.csv format:

filename|text|text
my_dataset_000000.wav|This is the transcribed text|This is the transcribed text
my_dataset_000001.wav|Another segment of speech|Another segment of speech

Congratulations! You've now prepared your audio data for training. You can now proceed to the Finetuning Guide.

What Happens During Processing

Voice Activity Detection (VAD)

Silero VAD splits your audio files into speech segments:

Detects where speech starts and stops
Removes silence and non-speech sections
Creates 1-10 second clips (typically 2-5 seconds)
Each segment becomes a training sample

Transcription

Whisper transcribes each segment:

Converts speech to text
No timestamps or formatting
Works with English and many other languages
Larger models = better accuracy but slower

LJSpeech Format

Output is formatted for training:

WAV files at original sample rate
CSV with filename and transcription pairs
Standard format used by many TTS systems

Checking Data Quality

After processing, you can manually inspect:

bash

# View transcriptions
head data/my_dataset/metadata.csv

# Count samples
wc -l data/my_dataset/metadata.csv

# Listen to segments
ls data/my_dataset/wavs/

Play a few random wav files to check:

Audio is clear
Segments are reasonable length
No weird artifacts or noise

If quality is poor, you may need better source audio.

Troubleshooting

"No audio segments found after VAD processing"

Your audio may be:

Too quiet (VAD can't detect speech)
Too noisy (VAD rejects it)
Not actually speech (music, silence)

Try:

Using different audio files
Boosting audio volume before processing
Using cleaner recordings

Processing is very slow

Whisper models are slow on CPU
Use a smaller model (whisper-tiny or whisper-base)
Enable GPU if you have one
Be patient - it's a one-time process

Transcriptions are wrong

Use a larger Whisper model for better accuracy
Check that your audio language matches Whisper's training
Some accents or technical terms may not transcribe well
Minor errors are usually okay for training

Out of memory

Close other applications
Use a smaller Whisper model
Process fewer files at once

Tips for Best Results

Audio quality > quantity

30 minutes of clean audio beats 3 hours of noisy audio
Remove music, noise, and overlapping speakers before processing

Variety helps

Different speaking styles (excited, calm, serious)
Different content (conversation, narration, technical)
This helps the model generalize

Check transcriptions

Open metadata.csv and spot-check a few entries
If transcriptions are mostly wrong, use a larger Whisper model
Small errors are usually fine

Multiple takes

If first attempt produces poor quality, try again
Adjust source audio (normalize volume, remove noise)
Use different Whisper model

Next Steps

Once you have a processed dataset:

👉 Finetuning Guide - Train a model on your dataset

Audio Collection Tips

Where to get audio for training:

Record yourself - Use a decent microphone, quiet room
Podcasts - Download episodes (respect copyright)
Audiobooks - Public domain books from LibriVox
YouTube - Use youtube-dl to extract audio (respect copyright)
Existing recordings - Past meetings, presentations, etc.

Copyright note: Only use audio you have rights to. Don't train on copyrighted material without permission.

Preparing Audio Data ​

Overview ​

Audio Requirements ​

What Works Well ​

What Doesn't Work ​

Step-by-Step Process ​

1. Collect Your Audio Files ​

2. Open the Data Tab ​

3. Configure Processing ​

4. Start Processing ​

5. Verify Dataset ​

What Happens During Processing ​

Voice Activity Detection (VAD) ​

Transcription ​

LJSpeech Format ​

Checking Data Quality ​

Troubleshooting ​

"No audio segments found after VAD processing" ​

Processing is very slow ​

Transcriptions are wrong ​

Out of memory ​

Tips for Best Results ​

Next Steps ​

Audio Collection Tips ​