Preparing Audio Data
This guide covers how to process your audio files into datasets ready for training.
Overview
OpenVoiceLab needs training data in a specific format. The Data tab handles this automatically:
- You provide raw audio files (MP3, WAV, FLAC, M4A)
- Silero VAD segments the audio into speech clips
- Whisper transcribes each segment
- Output is saved for training
No manual segmentation or transcription needed.
Audio Requirements
What Works Well
- Single speaker audio only
- Clean recordings with minimal background noise
- Natural speech (conversations, readings, podcasts)
- 30+ minutes of audio recommended
- Variety in speaking styles and content
What Doesn't Work
- Multiple speakers talking simultaneously
- Heavy music or background noise
- Very quiet or distorted audio
- Non-speech content (sound effects, music-only)
Step-by-Step Process
1. Collect Your Audio Files
Put all your audio files in a single folder:
/path/to/my_audio/
├── recording1.mp3
├── recording2.wav
├── podcast_episode.m4a
└── audiobook_chapter.flacFiles can be any length - they'll be automatically split into segments. No metadata is needed - just raw audio files.
2. Open the Data Tab
In OpenVoiceLab, click the Data tab.
3. Configure Processing
Input Directory
- Enter the full path to your audio folder
- Example:
/Users/username/my_audio(macOS/Linux) orC:\Users\username\my_audio(Windows)
Dataset Name
- Choose a name for this dataset
- Use letters, numbers, and underscores only
- Example:
john_podcast,emma_voice,my_dataset - This name is used to reference the dataset later
Whisper Model
- Select transcription model from dropdown
Available models (from fastest to most accurate):
openai/whisper-tiny- Fastest, least accurateopenai/whisper-base- Small modelopenai/whisper-small- Better accuracyopenai/whisper-medium- Even betteropenai/whisper-large-v3- Most accurateopenai/whisper-large-v3-turbo- Fast + accurate (recommended)
If your data is relatively clean, whisper-large-v3-turbo is fine. Use larger models if transcription quality matters.
4. Start Processing
Click Start Processing.
You'll see progress as it:
- Segments audio files
- Transcribes each segment
- Saves the dataset
Processing time depends on:
- Amount of audio
- Whisper model size
- Your GPU/CPU speed
Generally, it should be done in a few minutes.
5. Verify Dataset
When complete, you'll see your dataset listed with:
- Number of samples created
- Creation timestamp
- Storage location
The dataset is saved to data/<dataset_name>/ with this structure:
data/my_dataset/
├── wavs/
│ ├── my_dataset_000000.wav
│ ├── my_dataset_000001.wav
│ └── ...
├── metadata.csv
└── info.jsonmetadata.csv format:
filename|text|text
my_dataset_000000.wav|This is the transcribed text|This is the transcribed text
my_dataset_000001.wav|Another segment of speech|Another segment of speechCongratulations! You've now prepared your audio data for training. You can now proceed to the Finetuning Guide.
What Happens During Processing
Voice Activity Detection (VAD)
Silero VAD splits your audio files into speech segments:
- Detects where speech starts and stops
- Removes silence and non-speech sections
- Creates 1-10 second clips (typically 2-5 seconds)
- Each segment becomes a training sample
Transcription
Whisper transcribes each segment:
- Converts speech to text
- No timestamps or formatting
- Works with English and many other languages
- Larger models = better accuracy but slower
LJSpeech Format
Output is formatted for training:
- WAV files at original sample rate
- CSV with filename and transcription pairs
- Standard format used by many TTS systems
Checking Data Quality
After processing, you can manually inspect:
# View transcriptions
head data/my_dataset/metadata.csv
# Count samples
wc -l data/my_dataset/metadata.csv
# Listen to segments
ls data/my_dataset/wavs/Play a few random wav files to check:
- Audio is clear
- Segments are reasonable length
- No weird artifacts or noise
If quality is poor, you may need better source audio.
Troubleshooting
"No audio segments found after VAD processing"
Your audio may be:
- Too quiet (VAD can't detect speech)
- Too noisy (VAD rejects it)
- Not actually speech (music, silence)
Try:
- Using different audio files
- Boosting audio volume before processing
- Using cleaner recordings
Processing is very slow
- Whisper models are slow on CPU
- Use a smaller model (whisper-tiny or whisper-base)
- Enable GPU if you have one
- Be patient - it's a one-time process
Transcriptions are wrong
- Use a larger Whisper model for better accuracy
- Check that your audio language matches Whisper's training
- Some accents or technical terms may not transcribe well
- Minor errors are usually okay for training
Out of memory
- Close other applications
- Use a smaller Whisper model
- Process fewer files at once
Tips for Best Results
Audio quality > quantity
- 30 minutes of clean audio beats 3 hours of noisy audio
- Remove music, noise, and overlapping speakers before processing
Variety helps
- Different speaking styles (excited, calm, serious)
- Different content (conversation, narration, technical)
- This helps the model generalize
Check transcriptions
- Open metadata.csv and spot-check a few entries
- If transcriptions are mostly wrong, use a larger Whisper model
- Small errors are usually fine
Multiple takes
- If first attempt produces poor quality, try again
- Adjust source audio (normalize volume, remove noise)
- Use different Whisper model
Next Steps
Once you have a processed dataset:
👉 Finetuning Guide - Train a model on your dataset
Audio Collection Tips
Where to get audio for training:
- Record yourself - Use a decent microphone, quiet room
- Podcasts - Download episodes (respect copyright)
- Audiobooks - Public domain books from LibriVox
- YouTube - Use youtube-dl to extract audio (respect copyright)
- Existing recordings - Past meetings, presentations, etc.
Copyright note: Only use audio you have rights to. Don't train on copyrighted material without permission.