WebDataset format
This page gives instructions to read training and validation data from the WebDataset format as opposed to the default JSON
format described in the Data Formats documentation.
In the WebDataset
format, <key>.{flac,wav}
audio files are stored with associated <key>.txt
transcripts in tar file shards. The tar file samples are read sequentially which increases I/O rates compared with random access.
Data Preparation
All commands in this README should be run from the training
directory of the repo.
WebDataset building
If you would like to build your own WebDataset you should refer to the following resources:
- Script that converts from WeNet legacy format to WebDataset:
make_shard_list.py
- Tutorial on creating WebDataset shards
At tarfile creation time, you must ensure that each audio file is stored sequentially with its associated .txt transcript file.
Text normalization
As discussed in more detail here it is necessary to normalize your transcripts so that they contain just spaces, apostrophes and lower-case letters. It is recommended to do this on the fly by setting normalize_transcripts: true
in your config file. Another option is to perform this step offline when you create the WebDataset shards.
Data preparation: preprocess_webdataset.sh
In order to create the artefacts described in the data preparation intro, run the following inside a running container:
DATA_DIR=/datasets/TarredDataset TRAIN_TAR_FILES="train_*tar.tar" DATASET_NAME_LOWER_CASE=librispeech ./scripts/preprocess_webdataset.sh
This script accepts the following arguments:
DATA_DIR
: Directory containing tar files.TRAIN_TAR_FILES
: One or more shard file paths or globs.DATASET_NAME_LOWER_CASE
: Name of dataset to use for naming sentencepiece model. Defaults tolibrispeech
.MAX_DURATION_SECS
: The maximum duration in seconds that you want to train on. Defaults to16.7
as per LibriSpeech.CONFIG_NAME
: Model name to use for the config from this table. Defaults tobase-8703sp
.SPM_SIZE
: Sentencepiece model size. Must matchCONFIG_NAME
. Defaults to8703
.
Training and validation
To trigger training or validation for data stored in WebDataset format you should pass --read_from_tar
to train.sh
, val.sh
.
You will also need to pass --val_tar_files
(and for training, --train_tar_files
) as one or more tar shard files/globs in --data_dir
. For example if all of your training and tar files are in a flat --data_dir
directory you might run:
./scripts/train.sh --read_from_tar --data_dir=/datasets/TarredDataset --train_tar_files train_*.tar --val_tar_files dev_*.tar
where {train,val}_tar_files
can be one or more filenames or fileglobs. In this mode, your training and validation tar files must have non-overlapping names. Alternatively, if you have a nested file structure you can set --data_dir=/
and then pass absolute paths/globs to --train_tar_files
and --val_tar_files
for example like:
./scripts/train.sh --read_from_tar --data_dir=/ --train_tar_files /datasets/TarredDataset/train/** --val_tar_files /datasets/TarredDataset/dev/**
Note that in the second case (when paths are absolute), glob expansions will be performed by your shell rather than the WebDatasetReader
class.
You should refer to the Training command documentation for more details on training arguments unrelated to this data format.
For validation you might run:
./scripts/val.sh --read_from_tar --data_dir=/datasets/TarredDataset --val_tar_files dev_*.tar
# or, absolute paths
./scripts/val.sh --read_from_tar --data_dir=/ --val_tar_files /datasets/TarredDataset/dev/**
WebDataset Limitations
Our WebDataset support currently has the following limitations:
- It isn't currently possible to mix and match
JSON
andWebDataset
formats for the training and validation data passed to./scripts/train.sh
. - It is necessary to have more shards per dataset (including validation data) than
num_gpus
so that each GPU can read from a different shard.