Beam search decoder

By default validation is carried out using a greedy decoder. To instead use a beam decoder, run

./scripts/val.sh --decoder=beam

which runs with a default beam width of 4. To change the beam width, run, for example

./scripts/val.sh --decoder=beam --beam_width=8

Note

All of the beam decoder options described in this page are available in train.sh as well as val.sh.

The beam decoder utilises an optimised version of beam search - adaptive beam search - which reduces decoding compute by reducing the number of beam expansions to consider, without degrading WER. Two hypothesis pruning methods are employed:

  1. Hypotheses with a score less than beam_prune_score_thresh (default 0.4) below the best hypothesis' score are pruned.
  2. Tokens with a logprob score less than beam_prune_topk_thresh (default 1.5) below the most likely token are ignored.

Reducing beam_prune_score_thresh and beam_prune_topk_thresh increases pruning aggressiveness; setting them < 0 disables pruning.

Adaptive beam search dynamically adjusts computation based on model confidence, using more compute when uncertain and behaving almost greedily when confident.

Softmax temperature

The beam decoder applies a softmax temperature to the logits. By default the --temperature=1.4 as this was found to improve WER across a range of configurations. Increasing the temperature will increase the beam diversity and make the greedy path less likely.

Fuzzy top-k logits

When using --decoder=beam, the model first calculates the logits for all classes (or tokens), applies the log-softmax function to get log probabilities, and then selects the top beam-width tokens to expand the beam.

However, in the hardware-accelerated solution, the I/O of sending the full vocab-size logits tensor from the FPGA to the CPU is a bottleneck. To address this, the hardware-accelerated solution sends a reduced set of logits to the CPU. Specifically, it sends the highest-value logits within some local block of the logits tensor. This enables a 'fuzzy top-k' operation which is approximate to the full top-k operation with some small difference.

Our experiments show that using the reduced logits tensor (fuzzy top-k logits) does not impact the model's WER performance.

Using fuzzy top-k logits

To use a reduced tensor implementation similar to the accelerated version, run the following command:

./scripts/val.sh --num_gpus=2 --decoder=beam --fuzzy_topk_logits

Please note that the evaluation time will increase by ~30% compared to standard beam search, so it is disabled by default.

N-gram language models

N-gram language models are used with beam decoding to improve WER. This is on by default and described in more detail in the N-gram language model documentation.