Profiling

You can turn on profiling by passing --profiler in your training or validation command. Note that profiling will likely slow down the script and is intended as a debugging feature.

Some of the profiling results are only saved after the script completes so it is necessary to avoid killing with Ctrl + C if you want to record the full profiling results. As such, when profiling training it is recommended to:

  • profile a small number of --training_steps
  • set --n_utterances_only [N_UTTERANCES_ONLY] to sample from the training dataset.

Similarly, when profiling validation it is recommended to use --nth_batch_only=<batch idx>

Profiling results will be saved in [output_dir]/benchmark/. This consists of:

  • yappi logs named program[rank]_[timestamp].prof. These can be viewed via SnakeViz:

    Launch a container with the command SNAKEVIZ_PORT=[an unused port] ./scripts/docker/launch.sh .... Inside the container, run

    ./scripts/profile/launch_snakeviz.bash /results/benchmark/program[rank]_[timestamp].prof
    

    This will print an interactive URL that you can view in a web browser.

  • top logs named top_log_[timestamp].html. These can be viewed outside the container using a web browser.

  • nvidia-smi text logs named nvidia_smi_log_[timestamp].txt.

  • Manual timings of certain parts of the training loop for each training step constituting an epoch. These are text files named timings_stepN_rankM_[timestamp].txt.

  • system information in system_info_[timestamp].txt.

Sending results

In order to share debug information with Myrtle.ai please run the following script:

OUTPUT_DIR=/<results dir to share> TAR_FILE=logs_to_share.tar.gz ./scripts/tar_logs_exclude_ckpts.bash

This will compress the logs excluding any checkpoints present in OUTPUT_DIR. The resulting logs_to_share.tar.gz file can be shared with Myrtle.ai or another third-party.