WER Calculation
WER Formula
Word Error Rate (WER) is a metric commonly used for measuring the performance of Automatic Speech Recognition (ASR) systems.
It compares the hypothesis transcript generated by the model with the reference transcript, which is considered to be the ground truth. The metric measures the minimum number of words that have to either be substituted, removed, or inserted in the hypothesis text in order to match the reference text.
For example:
Hypothesis: the cat and the brown dogs sat on the long bench
Reference: the black cat and the brown dog sat on the bench
In the hypothesis there are:
- 1 deletion error (word "black"),
- 1 substitution error ("dogs" instead of "dog"), and
- 1 insertion error (word "long"), in a total of 11 words in the reference text.
The WER for this transcription is:
$$ WER = \frac{S + D + I}{N} \times 100 = \frac{1 + 1 + 1}{11} \times 100=27.27\% $$
WER Standardization
Before the calculation of the WER, when standardize_wer: true
in the yaml config,
the text of both hypotheses and references is standardized, so that the model accuracy is
not penalised for mistakes due to differences in capitalisation, punctuation, etc.
Currently, CAIMAN-ASR uses the Whisper EnglishSpellingNormalizer. The Whisper standardization rules applied are the following:
- Remove text between brackets (
< >
or[ ]
). - Remove punctuation (parentheses, commas, periods etc).
- Remove filler words like hmm, uh, etc.
- Substitute contractions with full words, e.g. won't -> will not.
- Convert British into American English spelling, e.g. standardise -> standardize. The list of words are included in the file english.json
We additionally apply the following transformations:
- Remove diacritics ("café" becomes "cafe")
- Lowercase the text
- Expand digits and symbols into words ("$1.02" becomes "one dollar two cents", "cats & dogs" becomes "cats and dogs")
- Expand common abbreviations ("Dr. Smith" becomes "doctor smith")
For example:
Hypothesis: that's what we'll standardise in today's example
Reference: hmm that is what we'll standardize in today's example
After applying the Whisper standardization rules, the sentences are formed:
Hypothesis: that is what we will standardize in today's example
Reference: that is what we will standardize in today's example
Which are identical, hence the WER=0%.