Data preparation
Having chosen which model configuration to train, you will need to complete the following preprocessing steps:
- Prepare your data in one of the supported training formats:
JSON
orWebDataset
. - Create a sentencepiece model from your training data.
- Record your training data log-mel stats for input feature normalization.
- Populate a YAML configuration file with the missing fields.
- Generate an n-gram language model from your training data.
Text normalization
The examples assume a character set of size 28: a space, an apostrophe and 26 lower case letters.
Transcripts will be normalized on the fly during training,
as set in the YAML config templates, normalize_transcripts: lowercase
.
See Changing the character set
for how to configure the character set and normalization.
During validation, the predictions and reference transcripts
will be standardized.