Training Datasets
50k hour dataset
Myrtle.ai's 50k hrs of training data is a mixture of the following open-source datasets:
- LibriSpeech-960h
- Common Voice Corpus 10.0 (version
cv-corpus-10.0-2022-07-04
) - Multilingual LibriSpeech (MLS)
- Peoples' Speech: filtered internally to take highest quality ~10k hrs out of 30k hrs total
This data has a maximum_duration
of 20s and a mean length of 14.67s.
If your dataset is organized in the json format, you can use this script to calculate its mean duration.
10k hour dataset
Myrtle.ai's 10k hrs of training data is a mixture of the following open-source datasets:
- LibriSpeech-960h
- Common Voice
- 961 hours from MLS
- Peoples' Speech: A ~6000 hour subset
This data has a maximum_duration
of 20s and a mean length of 14.02s.
The 10k hour dataset is a subset of the 50k hour dataset above but experiments indicate that models trained on it give better results on Earnings21 than those training on the 50k hour dataset.