Training on heterogeneous CPUs

Modern desktop computers may have heterogeneous CPUs i.e. performance and efficiency cores. If you launch DALI with the default number of cores this can slow down your train as the DALI pipeline will be bottle-necked by the much slower efficiency cores. This effect can be quite pronounced. For example, on an 13th Gen Intel(R) Core(TM) i7-13700K training with 24 DALI threads trains at 350 UTT/s but training on 8 cores runs close to 500 UTT/s! To determine the correct number of cores run lstopo --of console. Then set --dali_processes_per_cpu= as appropriate (you may need to apt install hwloc).