Training times

Training throughputs on an 8 x A100 (80GB) system are as follows:

ModelTraining timeThroughputNo. of updatesgrad_accumulation_batchesbatch_split_factor
base1.6 days729 utt/sec100k18
large2.2 days550 utt/sec100k116

Training times on an 8 x A5000 (24GB) system are as follows:

ModelTraining timeThroughputNo. of updatesgrad_accumulation_batchesbatch_split_factor
base3.1 days379 utt/sec100k116
large8.5 days140 utt/sec100k84

where:

  • Throughput is the number of utterances seen per second during training (higher is better)
  • No. of updates is the number of optimiser steps at --global_batch_size=1024 that are required to train the models on the 50k hrs training dataset. You may need fewer steps when training with less data
  • grad_accumulation_batches is the number of gradient accumulation steps performed on each GPU before taking an optimizer step
  • batch_split_factor is the number of sub-batches that the PER_GPU_BATCH_SIZE is split into before these sub-batches are passed through the joint network and loss.

For more details on these hyper-parameters, including how to set them, please refer to the batch size arguments documentation.