JSON format
The JSON format is the default in this repository and if you are training on your own data it is recommended to manipulate it into this format. Note that the data preparation steps are slightly different given the model you have decided to train so please refer to the model configuration page first.
Page contents
Prepare LibriSpeech in JSON format
This page takes LibriSpeech as it is distributed from the www.openslr.org website and prepares it into a JSON manifest format.
Quick Start
To run the data preparation steps for LibriSpeech and the base model run the following from the training/ directory:
# Download data to /datasets/LibriSpeech: requires 60GB of disk
./scripts/download_librispeech.sh
./scripts/preprocess_librispeech.sh
To run preprocessing for the testing or large configurations, instead run:
SPM_SIZE=1023 CONFIG_NAME=testing-1023sp ./scripts/preprocess_librispeech.sh
SPM_SIZE=17407 CONFIG_NAME=large-17407sp ./scripts/preprocess_librispeech.sh
In the next two sections, these steps are described in more detail.
Further detail: download_librispeech.sh
Having run the script, the following folders should exist inside the container:
/datasets/LibriSpeechtrain-clean-100/train-clean-360/train-other-500/dev-clean/dev-other/test-clean/test-other/
If ~/datasets on the host is mounted to /datasets, the downloaded data will be accessible outside the container at ~/datasets/LibriSpeech.
Further detail: preprocess_librispeech.sh
The script will:
- Create
JSONmanifests for each subset of LibriSpeech - Create a sentencepiece tokenizer from the train-960h subset
- Record log-mel stats for the train-960h subset
- Populate the missing fields of a YAML configuration template
Having run the script, the respective files should exist at the following locations:
1. JSON manifests
/datasets/LibriSpeech/librispeech-train-clean-100.jsonlibrispeech-train-clean-360.jsonlibrispeech-train-other-500.jsonlibrispeech-dev-clean.jsonlibrispeech-dev-other.jsonlibrispeech-test-clean.jsonlibrispeech-test-other.json
2. Sentencepiece tokenizer
/datasets/sentencepieces/librispeech-1023sp.modellibrispeech-1023sp.vocab
3. Log-mel stats
/datasets/stats/STATS_SUBDIR:melmeans.ptmeln.ptmelvars.pt
The STATS_SUBDIR will differ depending on the model since these stats are affected by the feature extraction window size. They are:
testing:/datasets/stats/librispeech-winsz0.02- {
base,large}:/datasets/stats/librispeech-winsz0.025
4. _run.yaml config
In the configs/ directory. Depending on the model you are training you will have one of:
testing:configs/testing-1023sp_run.yamlbase:configs/base-8703sp_run.yamllarge:configs/large-17407sp_run.yaml
_run indicates that this is a complete config, not just a template.
Preprocessing Other Datasets
To convert your own data into the JSON format, adapt the steps in scripts/preprocess_librispeech.sh. The JSON manifest creation step is specific to LibriSpeech, but the remaining steps should be configurable via env variables to the script. For example, if you have created a copy of the script called scripts/preprocess_commonvoice.sh you can run it like:
DATASET_NAME_LOWER_CASE=commonvoice DATA_DIR=/datasets/CommonVoice MAX_DURATION_SECS=10.0 scripts/preprocess_commonvoice.sh
where:
DATASET_NAME_LOWER_CASEwill determine the name of generatedSENTENCEPIECEandSTATS_SUBDIRDATA_DIRis the path to whichJSONmanifests will be writtenMAX_DURATION_SECSis number of seconds above which audio clips will be discarded during training
It is advised that you use all of your training data transcripts to build the sentencepiece tokenizer but it is ok to use a subset of the data to calculate the mel stats via the --n_utterances_only flag to caiman_asr_train/utils/generate_mel_stats.py.
Next steps
Having run the data preparation steps, go to the training docs to start training.