Emission Latency

Emission latency (EL) is defined by the time difference between the end of a spoken word in an audio file and when the model outputs the final token for the corresponding word. After the model receives the final audio frame of a word, it might not predict the word until it has heard a few more frames of audio. EL measures this delay.

To calculate the model's EL during validation, pass the --calculate_emission_latency flag, e.g.

./scripts/val.sh --calculate_emission_latency

When this flag is enabled, CTM files containing model timestamps are exported to --output_dir.

Emission latencies are calculated by aligning the model-exported CTM files with corresponding ground truth CTM files. Ground truth CTM files are expected to be located in the same directory as the validation manifest or tar files and should share the same base name e.g. if --data_dir=/path/to/dataset and --val_manifests=data.json, then the assumed filepath of the ground truth CTM file is /path/to/dataset/data.ctm. See Forced Alignment for details on producing ground truth CTM files.

The script outputs the mean latency, as well as the 50th and 90th percentile latencies.

By default, outliers outside of +-2s are removed from any calculations.

If one already has model-exported CTM files and corresponding ground truth files, the measure_latency.py script can be used instead of running a complete validation run. To do so, run the script with paths to the ground truth and model CTM files:

python caiman_asr_train/latency/measure_latency.py --gt_ctm /path/to/ground_truth.ctm --model_ctm /path/to/model.ctm

To include substitution errors in latency calculations, add the --include_subs flag:

python caiman_asr_train/latency/measure_latency.py --gt_ctm /path/to/ground_truth.ctm --model_ctm /path/to/model.ctm --include_subs

To export a scatter plot of EL against time from the start of the sequence, pass a filepath to the optional --output_img_path argument e.g.

python caiman_asr_train/latency/measure_latency.py --gt_ctm /path/to/ground_truth.ctm --model_ctm /path/to/model.ctm --output_img_path /path/to/img.png

EL logging is also compatible with val_multiple.sh.

Forced Alignment

The script forced_align.py is used to align audio recordings with their corresponding transcripts, producing ground truth timestamps for each word.

To perform forced alignment, execute the script with the required arguments e.g.

python caiman_asr_train/latency/forced_align.py --dataset_dir /path/to/dataset --manifests data.json

By default, CTM files are exported to the same location as the manifest files and share the same base name e.g. if --dataset_dir /path/to/dataset and --manifests data.json, then the default filepath of the CTM file is /path/to/dataset/data.ctm. The output directory to which CTM files are saved can be changed as follows:

python caiman_asr_train/latency/forced_align.py --dataset_dir /path/to/dataset --manifests manifest.json --output_dir /custom/output/directory

Multiple manifest files can be passed to the script e.g.

python caiman_asr_train/latency/forced_align.py --dataset_dir /path/to/dataset --manifests manifest1.json manifest2.json

The script also supports (multiple) tar files:

python caiman_asr_train/latency/forced_align.py --read_from_tar --tar_files data1.tar data2.tar --dataset_dir /path/to/dataset

Both absolute and relative paths are accepted for --manifests and --tar_files.

By default, utterances are split into 5 minute segments. This allows us to perform forced alignment on datasets with very long utterances (e.g. Earnings21) without encountering memory issues. Most datasets have utterances shorter than 5 minutes and are therefore unaffected by this. To change the segment length, pass the optional --segment_len argument with an integer number of minutes e.g.

python caiman_asr_train/latency/forced_align.py --segment_len 15 --dataset_dir /path/to/dataset --manifests data.json

There is also a CPU option:

python caiman_asr_train/latency/forced_align.py --cpu --dataset_dir /path/to/dataset --manifests data.json

CTM

CTM (Conversation Time Mark) format is space separated file with entries:

<recording_id> <channel_id> <token_start_time> <token_duration> <token_value>

and an optional sixth entry <confidence_score>. CTM allows either token-level or word-level timestamps.