Optimizing AI model parallelism parameters is often essential for achieving efficient training and scalability on highly distributed systems. It is also critical to ensure optimal system and topology configurations for the training job. NCCL provides end users with the flexibility to specify underlying system topology and CPU affinity through the NCCL topology file and offers environment parameters for obtaining optimal performance on a given platform. This blog post elaborates on the intricacies of system topology and CPU pinning, presenting experiments that demonstrate the advantages. Furthermore, we recommend the optimal configuration for Azure ND-series VMs.
(Co-authored by: Rafael Salas, Sreevatsa Anantharamu, Jithin Jose, Elizabeth Fatade)
Introduction
NCCL
NVIDIA Collective Communications Library (NCCL) is one of the most widely used communication library for AI training/inference. It features GPU-focused collective and point-to-point communication designs that are vital for AI workloads.
NUMA-GPU-HCA affinity
Non-uniform memory access (NUMA) architecture splits up the CPU cores into groups attached to their own local memory. GPUs and HCAs are connected to a specific NUMA node via PCIe interconnect. NCCL launches CPU processes to coordinate GPU-related inter-node communication, hence the process-to-NUMA binding matters a lot for ensuring optimal performance. NCCL uses the Low Latency (LL) protocol for small-to-medium message communication. In this protocol, the GPU data is copied to a pinned CPU buffer before sending over the network. Copying data from a GPU to a NUMA node that it is not directly connected to leads to additional inter-NUMA communication that adds overhead. Furthermore, communicating this CPU buffer through a NIC connected to the neighboring NUMA node also requires additional inter-NUMA communication.
Following diagram shows the system topology for Nvidia DGX H100 system as a reference.
To determine the CPU cores that belong to a NUMA node, use:
lscpu
To determine the GPU-to-NUMA affinity, use:
cat /sys/bus/pci/devices/<busIdOfGPU>/numa_node
or for NVIDIA GPUs, you can alternatively use:
nvidia-smi topo -m
To determine the HCA-to-NUMA mapping, use
cat /sys/bus/pci/devices/<busIdOfHCA>/numa_node
or use:
lstopo
NCCL Topology file
Application developers can pass in the system topology by specifying the NCCL topology file while launching the job. NCCL topology file passes in the following information to NCCL library:
- GPU-to-NUMA mapping
- GPU-to-HCA, HCA-to-NUMA mapping
- NUMA-to-CPU-core mapping
- Speed and type of GPU-GPU interconnect.
This enables NCCL to choose the most efficient system topology for communication. Azure ND-series VMs, like the NDv4 and NDv5 series, feature multiple GPUs. These GPUs connect to the CPUs via PCIe links and to each other using NVLink. If you are using a VM image from the AI and HPC Market place image, this topology file will be located in the directory: /opt/microsoft. The topology files are also located in the azhpc-images GitHub repository under the topology directory.
Performance experiments
This section presents results that demonstrate the benefit from using correct CPU pinning via NCCL topology file. We use the NCCL Tests benchmarks and run experiments for the different NCCL calls comparing the performance of the default mpirun binding to the correct binding specified via NCCL topology file.
Setup
The setup used for the experiments is:
- Two NDv5 nodes: 8 H100 GPUs per node, 8 400G NICs per node
- Azure VM image: microsoft-dsvm:ubuntu-hpc:2204:latest
- NCCL version: 2.25.1+cuda12.6
- MPI library: HPC-X v2.18
Impact of System Topology
In this section, we show the performance impact of system topology awareness in NCCL. We show the NCCL benchmark performance results comparing default and topology-aware configurations:
For the default case, default mpirun binding, we use the following command line for launching NCCL benchmarks:
mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1
For the system topology-aware case, we use the following command line for launching NCCL benchmarks:
mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ -x NCCL_TOPO_FILE=<topo-file.xml> \ -x NCCL_IGNORE_CPU_AFFINITY=1 \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1
Here, hostfile is a file containing the IP addresses of the two nodes. NCCL_BENCHMARK is the NCCL benchmark name. We run experiments with all six benchmarks: all_reduce_perf, all_gather_perf, sendrecv_perf, reduce_scatter_perf, broadcast_perf, and alltoall_perf. <topo-file.xml> is the Azure SKU topology file that can be obtained as described in the "NCCL Topology File" section. NCCL topology files for different Azure HPC/AI SKUs are available here: Topology Files.
Topology file-based binding is set by prescribing the NCCL_TOPO_FILE variable to the path of the NCCL topology file and setting NCCL_IGNORE_CPU_AFFINITY to 1. Setting NCCL_IGNORE_CPU_AFFINITY to 1 is crucial to assign the process affinity in NCCL solely based on the NCCL topology file. If this variable is not set, NCCL honors the affinity set by the MPI library as well as the NCCL topology file by setting the affinity to the intersection of these two sets. If the intersection is null, then NCCL simply retains the affinity set by the MPI library.
Additional NCCL Optimizations
We also list some additional NCCL tuning configurations for better performance. Note that these drive the benchmark performance to the peak, but it is important to fine tune these parameters for the end training application as some of these have impact of SM utilization.
Config | Value |
NCCL_MIN_CHANNELS | 32 |
NCCL_P2P_NET_CHUNKSIZE | 512K |
NCCL_IB_QPS_PER_CONNECTION | 4 |
NCCL_PXN_DISABLE | 1 |
Increasing NCCL_MIN_CHANNELS to 32 increases the throughput of certain collectives (especially ncclReduceScatter). Increasing NCCL_P2P_NET_CHUNKSIZE to 512K (from the default value of 128K) gives better throughput for ncclSend and ncclRecv calls when channel buffers are used instead of user-registered buffer for communication. Increasing NCCL_IB_QPS_PER_CONNECTION to 4 from 1 also slightly increases the throughput of collectives. Enforcing NCCL_PXN_DISABLE to 1 is essential to enable the zero-copy design for ncclSend and ncclRecv calls. Inter-node zero-copy designs are present in NCCL 2.24 onwards but is activated only if NCCL_PXN_DISABLE is set to 1 and the user buffers are registered with the NIC via NCCL calls ("-R 1" flag registers user buffers in NCCL tests). We found the bandwidth of the zero-copy point-to-point design to be around 10 GB/s better than their copy-based design.
Results
Small message sizes
Impact on latency for small message sizes.In the above figure, we compare the small message latency for the different collectives using the NCCL topology file-based binding to the default mpirun binding. For all the NCCL benchmarks, we see a consistent higher performance with using the NCCL topology file-based binding. The reason for this improvement is that, for small message sizes, NCCL uses the Low Latency (LL) protocol, and this protocol uses pinned CPU buffers for inter-node communication.
In the LL protocol, the GPU copies the data to be sent over the network to the CPU buffer. The CPU thread polls this buffer to check if all the flags are set, and once they are set, the thread sends the data over the network. With the default mpirun pinning, all eight processes per node and their allocated memory are on NUMA node 0. However, GPUs 0-3 have affinity to NUMA node 0 and GPUs 4-7 have affinity to NUMA node 1. Therefore, the data copied from GPUs 4-7 need to traverse through the NUMA interconnect to reach the pinned CPU buffer. Furthermore, while communicating the CPU buffer via the NIC (closest to the GPU), the inter-NUMA interconnect needs to be traversed again. The figures above clearly show this additional overhead from inter-NUMA interconnect latency.
With NCCL topology file-based binding, this overhead does not exist. This is because the topology file contains information of the GPUs attached to a NUMA node and the CPU mask for all the cores in a NUMA node. Using this information, NCCL correctly binds the processes to the NUMA nodes.
Medium message sizes
Impact on bandwidth for medium message sizes.The medium message size bandwidth is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. Similar to small message sizes, NCCL uses the LL protocol for most of the above medium message runs. Nearly, all NCCL benchmarks show an improvement in bandwidth with using the NCCL topology file-based binding. The reason for this improvement is same as that for the small message sizes.
Large message sizes
Impact on bandwidth for large message sizes.The bandwidth for the large message sizes is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. As the message sizes gets larger and larger, there is not much improvement in bandwidth with using NCCL topology file-based binding, for all the NCCL benchmarks. This is because the dominant NCCL protocol for this message size range is SIMPLE. In this protocol, on systems with GPUDirectRDMA (which is the case for Azure ND-series SKUs), the GPU buffer is directly communicated over the network for inter-node communication. There is no intermediate copy of the message to a CPU buffer. In this scenario, the GPU-CPU communication is used to only updates a few flags for auxiliary tasks such indicating buffer readiness. The increase in the latency of these auxiliary tasks is insignificant compared to the total message transfer time. Thus, the impact of CPU pinning on the achieved bandwidth reduces for large message sizes.
Conclusion
This blog post describes the impact of system topology awareness and NCCL tuning for AI workload optimization and lists the relevant NCCL configuration options. The impact of topology awareness is significant for small and medium messages where NCCL uses the LL protocol. We also show the performance results comparing the different configurations and highlight the importance of performance tuning.
To summarize, the recommended configuration for NCCL (version greater than v2.24) on Azure HPC/AI VMs (NDv4/NDv5) are:
Config | Value |
NCCL_MIN_CHANNELS | 32 |
NCCL_P2P_NET_CHUNKSIZE | 512K |
NCCL_IB_QPS_PER_CONNECTION | 4 |
NCCL_PXN_DISABLE | 1 |
NCCL_TOPO_FILE | <corresponding-topology-file.xml> |
NCCL_IGNORE_CPU_AFFINITY | 1 |
Note
It is recommended to fine-tune these parameters for the final training application and the corresponding NCCL collectives being used, as some of these parameters impact SM utilization.
For Azure HPC/AI VMs, this NCCL topology file is available in /etc/nccl.conf inside the VM image. For container runs inside the VM, it is recommended to mount the NCCL topology file and /etc/nccl.conf file from the VM to the container.
Updated Mar 04, 2025
Version 2.0Rafael_Salas
Microsoft
Joined June 30, 2021
Azure High Performance Computing (HPC) Blog
Follow this blog board to get notified when there's new activity