Key Features

CAIMAN-ASR enables at-scale automatic speech recognition (ASR), supporting up to 2000 real-time streams per accelerator card.

CAIMAN-ASR flow

Lowest end-to-end latency

CAIMAN-ASR leverages the parallel processing advantages of Achronix’s Speedster7t® FPGA, the power behind the accelerator cards, to achieve extremely low latency inference. This enables NLP workloads to be performed in a human-like response time for end-to-end conversational AI.

Simple to integrate into existing systems

CAIMAN-ASR's Websocket API can be easily connected to your service.

Scale up rapidly & easily

CAIMAN-ASR runs on industry-standard PCIe accelerator cards, enabling existing racks to be upgraded quickly for up to 20x greater call capacity. The VectorPath® S7t-VG6 accelerator card from BittWare is available off-the-shelf today.

Efficient inference, at scale

CAIMAN-ASR uses as much as 90% less energy to process the same number of real-time streams as an unaccelerated solution, significantly reducing energy costs and enhancing ESG (environmental, social, and governance) credentials.

Streaming transcription

CAIMAN-ASR is provided pre-trained for English language transcription. For applications requiring specialist vocabularies or alternative languages, the neural model can easily be retrained with customers’ own bespoke datasets using the ML framework PyTorch.

Model Configurations

The solution supports two models: base and large of sizes 85M and 196M parameters respectively. These can be decoded with various configurations that trade off accuracy with latency and throughput. These trade-offs are described in more detail in the performance page but the 'fastest' and 'most accurate' configurations are summarized below:

Description	Model	Parameters	Decoding	Realtime streams (RTS)	p99 latency at max RTS	p99 latency at RTS=32	Huggingface Leaderboard WER
fastest	`base`	85M	greedy	2000	25 ms	15 ms	13.50%
most-accurate	`large`	196M	beam, width=4	500	40 ms	20 ms	11.59%

where:

Realtime streams (RTS) is the number of concurrent streams that can be serviced by a single accelerator using default settings
p99 latency is the 99th-percentile latency to process a single 60 ms audio frame and return any predictions. Note that latency increases with the number of concurrent streams.
WER is the Word Error Rate, a measure of the accuracy of the model. Lower is better.
Huggingface Leaderboard WER is the WER of the model on the Huggingface Open ASR Leaderboard. WER is averaged across 8 test datasets.

The solution scales linearly up to 8 accelerators, and a single server has been measured to support 16000 RTS with the base model.