User-perceived Latency

User-perceived latency (UPL) is defined as the time difference between the instant when the speaker finishes saying a word and the instant when the word appears as a transcript on a screen. In practice, it can be measured by streaming audio to a server live and measuring the response time for each word. The following table summarizes UPL for base and large models measured by streaming librispeech-dev-clean dataset to an on-site server with an FPGA backend running at maximum RTS.

Model / decodingMean UPLp90 UPLp99 UPL
base / greedy159 ms303 ms451 ms
large / beam163 ms326 ms629 ms

UPL is the sum of the following latencies:

  • audio frame latency (recording device)
  • compute latency (model)
  • network latency (network)
  • emission latency (model)

Audio Frame Latency

Streaming ASR systems buffer audio for a fixed duration before sending it to the server. The buffer length is typically set to match the audio chunk length used during training. CAIMAN-ASR models were trained on audio chunks of 60 ms. However, the end of the word can appear anytime during buffering, which means that its contribution is half the audio frame length on average (30 ms for CAIMAN-ASR).

Compute Latency

The compute latency measures how long it takes for a model to make a prediction for one audio frame. This latency depends on the model size, accelerator backend, server load, decoding strategy, and whether state resets is turned on. The contribution of the compute latency is strictly additive. The following tables summarize 99th-percentile compute latencies (CL99) at maximum number of real-time streams for an FPGA backend and various setups.

ModelParametersDecodingCL99CL99 + state resets
base85Mgreedy25 ms45 ms
base85Mbeam, width=480 ms50 ms
large196Mgreedy25 ms55 ms
large196Mbeam, width=440 ms60 ms

Network Latency

The network latency corresponds to sending the audio chunk to the server and receiving a response back. The contribution of the network latency is roughly equal to the round-trip response time, as measured using ping. In case the solution is deployed on-premise, the expected value is well below 1 ms. In the case the solution is deployed on cloud, the network latency can exceed 100 ms.

Emission Latency

The emission latency (EL) is explained in detail in this document. Together with the network latency, it is the only latency the user can directly influence, as explained in here. Its contribution is strictly additive. The table below summarizes emission latency for the latest models averaged across all HF Leaderboard evaluation datasets.

ModelMean ELp90 EL
base121 ms281 ms
large82 ms287 ms