Log-mel feature normalization

We normalize the acoustic log mel features based on the global mean and variance recorded over the training dataset.

Record dataset stats

The script generate_mel_stats.py computes these statistics and stores them in /datasets/stats/<dataset_name+window_size> as PyTorch tensors. For example usage see:

  • scripts/preprocess_librispeech.sh
  • scripts/preprocess_webdataset.sh

Training stability

Empirically, it was found that normalizing the input activations with dataset global mean and variance makes the early stage of training unstable. As such, the default behaviour is to move between two modes of normalization on a schedule during training. This is handled by the MelFeatNormalizer class and explained in the docstring below:

class MelFeatNormalizer:
    Perform audio normalization, optionally blending between two normalization types.

    The two types of normalization are:
        1. use pre-computed NormType.DATASET_STATS per mel bin and normalize each
        timestep independently
        2. use utterance-specific NormType.UTTERANCE_STATS per mel bin that are
        calculated over the time-dimension of the mel spectrogram

    The first of these is used for validation/inference. The second method isn't
    streaming compatible but is more stable during the early stages of training.
    Therefore, by default, the training script blends between the two methods on a


When running validation, the dataset global mean and variance are always used for normalization regardless of how far through the schedule the model is.

Backwards compatibility

Prior to v1.9.0, the per-utterance stats were used for normalization during training (and then streaming normalization was used during inference). To evaluate a model trained on <=v1.8.0, use the --norm_over_utterance flag to the val.sh script.