WER Calculation

WER Formula

Word Error Rate (WER) is a metric commonly used for measuring the performance of Automatic Speech Recognition (ASR) systems.

It compares the hypothesis transcript generated by the model with the reference transcript, which is considered to be the ground truth. The metric measures the minimum number of words that have to either be substituted, removed, or inserted in the hypothesis text in order to match the reference text.

For example:

Hypothesis: the       cat  and the brown dogs sat on the long bench
Reference:  the black cat  and the brown dog  sat on the      bench

In the hypothesis there are:

1 deletion error (word "black"),
1 substitution error ("dogs" instead of "dog"), and
1 insertion error (word "long"), in a total of 11 words in the reference text.

The WER for this transcription is:

$$ WER = \frac{S + D + I}{N} \times 100 = \frac{1 + 1 + 1}{11} \times 100=27.27\% $$

WER Standardization

Before the calculation of the WER, when standardize_wer: true in the yaml config, the text of both hypotheses and references is standardized, so that the model accuracy is not penalised for mistakes due to differences in capitalisation, punctuation, etc.

Currently, CAIMAN-ASR uses the Whisper EnglishSpellingNormalizer. The Whisper standardization rules applied are the following:

Remove text between brackets (< > or [ ]).
Remove punctuation (parentheses, commas, periods etc).
Remove filler words like hmm, uh, etc.
Substitute contractions with full words, e.g. won't -> will not.
Convert British into American English spelling, e.g. standardise -> standardize. The list of words are included in the file english.json

We additionally apply the following transformations:

Remove diacritics ("café" becomes "cafe")
Lowercase the text
Expand digits and symbols into words ("$1.02" becomes "one dollar two cents", "cats & dogs" becomes "cats and dogs")
Expand common abbreviations ("Dr. Smith" becomes "doctor smith")

For example:

Hypothesis:     that's  what we'll standardise in today's example
Reference:  hmm that is what we'll standardize in today's example

After applying the Whisper standardization rules, the sentences are formed:

Hypothesis: that is what we will standardize in today's example
Reference:  that is what we will standardize in today's example

Which are identical, hence the WER=0%.