You can turn on profiling by passing --profiler in your training command. Note that profiling will likely slow down training and is intended as a debugging feature.

Some of the profiling results are only saved after the train completes so it is necessary to avoid killing with Ctrl + C if you want to record the full profiling results. It is recommended to profile a small number of --training_steps. Also, set --n_utterances_only [N_UTTERANCES_ONLY] to sample from the training dataset.

Profiling results will be saved in [output_dir]/benchmark/. This consists of:

  • yappi logs named program[rank]_[timestamp].prof. These can be viewed via SnakeViz:

    Launch a container with the command SNAKEVIZ_PORT=[an unused port] ./scripts/docker/ .... Inside the container, run

    ./scripts/profile/launch_snakeviz.bash /results/benchmark/program[rank]_[timestamp].prof

    This will print an interactive URL that you can view in a web browser.

  • top logs named top_log_[timestamp].html. These can be viewed outside the container using a web browser.

  • nvidia-smi text logs named nvidia_smi_log_[timestamp].txt.

  • Manual timings of certain parts of the training loop for each training step constituting an epoch. These are text files named timings_stepN_rankM_[timestamp].txt.

  • system information in system_info_[timestamp].txt.

SnakeViz note

The SnakeViz port defaults to 64546. If this clashes with an existing port, set a new value for the environment variable SNAKEVIZ_PORT when starting Docker with

Sending results

In order to share debug information with please run the following script:

OUTPUT_DIR=/<results dir to share> TAR_FILE=logs_to_share.tar.gz ./scripts/tar_logs_exclude_ckpts.bash

This will compress the logs excluding any checkpoints present in OUTPUT_DIR. The resulting logs_to_share.tar.gz file can be shared with or another third-party.