ai infrastructure
62 TopicsRunning DeepSeek-R1 on a single NDv5 MI300X VM
Contributors: Davide Vanzo, Yuval Mazor, Jesse Lopez DeepSeek-R1 is an open-weights reasoning model built on DeepSeek-V3, designed for conversational AI, coding, and complex problem-solving. It has gained significant attention beyond the AI/ML community due to its strong reasoning capabilities, often competing with OpenAI’s models. One of its key advantages is that it can be run locally, giving users full control over their data. The NDv5 MI300X VM features 8x AMD Instinct MI300X GPUs, each equipped with 192GB of HBM3 and interconnected via Infinity Fabric 3.0. With up to 5.2 TB/s of memory bandwidth per GPU, the MI300X provides the necessary capacity and speed to process large models efficiently - enabling users to run DeepSeek-R1 at full precision on a single VM. In this blog post, we’ll walk you through the steps to provision an NDv5 MI300X instance on Azure and run DeepSeek-R1 for inference using the SGLang inference framework. Launching an NDv5 MI300X VM Prerequisites Check that your subscription has sufficient vCPU quota for the VM family “StandardNDI Sv 5MI300X” (see Quota documentation). If needed, contact your Microsoft account representative to request quota increase. A Bash terminal with Azure CLI installed and logged into the appropriate tenant. Alternatively, Azure Cloud Shell can also be employed. Provision the VM 1. Using Azure CLI, create an Ubuntu-22.04 VM on ND_MI300x_v5: az group create --location <REGION> -n <RESOURCE_GROUP_NAME> az vm create --name mi300x --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --ssh-key-values <PUBLIC_SSH_PATH> Optionally, the deployment can utilize the cloud-init.yaml file specified as --custom-data <CLOUD_INIT_FILE_PATH> to automate the additional preparation described below: az vm create --name mi300x --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --ssh-key-values <PUBLIC_SSH_PATH> --custom-data <CLOUD_INIT_FILE_PATH> Note: The GPU drivers may take a couple of mintues to completely load after the VM has been initially created. Additional preparation Beyond provisioning the VM, there are additional steps to prepare the environment to optimally run DeepSeed, or other AI workloads including setting-up the 8 NVMe disks on the node in a RAID-0 configuration to act as the cache location for Docker and Hugging Face. The following steps assume you have connected to the VM and working in a Bash shell. 1. Prepare the NVMe disks in a RAID-0 configuration mkdir -p /mnt/resource_nvme/ sudo mdadm --create /dev/md128 -f --run --level 0 --raid-devices 8 $(ls /dev/nvme*n1) sudo mkfs.xfs -f /dev/md128 sudo mount /dev/md128 /mnt/resource_nvme sudo chmod 1777 /mnt/resource_nvme 2. Configure Hugging Face to use the RAID-0. This environmental variable should also be propagated to any containers pulling images or data from Hugging Face. mkdir –p /mnt/resource_nvme/hf_cache export HF_HOME=/mnt/resource_nvme/hf_cache 3. Configure Docker to use the RAID-0 mkdir -p /mnt/resource_nvme/docker sudo tee /etc/docker/daemon.json > /dev/null <<EOF { "data-root": "/mnt/resource_nvme/docker" } EOF sudo chmod 0644 /etc/docker/daemon.json sudo systemctl restart docker All of these additional preperation steps can be automated in VM creation using cloud-init. The example cloud-init.yaml file can be used in provisioning the VM as described above. #cloud-config package_update: true write_files: - path: /opt/setup_nvme.sh permissions: '0755' owner: root:root content: | #!/bin/bash NVME_DISKS_NAME=`ls /dev/nvme*n1` NVME_DISKS=`ls -latr /dev/nvme*n1 | wc -l` echo "Number of NVMe Disks: $NVME_DISKS" if [ "$NVME_DISKS" == "0" ] then exit 0 else mkdir -p /mnt/resource_nvme # Needed incase something did not unmount as expected. This will delete any data that may be left behind mdadm --stop /dev/md* mdadm --create /dev/md128 -f --run --level 0 --raid-devices $NVME_DISKS $NVME_DISKS_NAME mkfs.xfs -f /dev/md128 mount /dev/md128 /mnt/resource_nvme fi chmod 1777 /mnt/resource_nvme - path: /etc/profile.d/hf_home.sh permissions: '0755' content: | export HF_HOME=/mnt/resource_nvme/hf_cache - path: /etc/docker/daemon.json permissions: '0644' content: | { "data-root": "/mnt/resource_nvme/docker" } runcmd: - ["/bin/bash", "/opt/setup_nvme.sh"] - mkdir -p /mnt/resource_nvme/docker - mkdir -p /mnt/resource_nvme/hf_cache # PAM group not working for docker group, so this will add all users to docker group - bash -c 'for USER in $(ls /home); do usermod -aG docker $USER; done' - systemctl restart docker Using MI300X If you are familiar with Nvidia and CUDA tools and environment, AMD provides equivalents as part of the ROCm stack. MI300X + ROCm Nvidia + CUDA Description rocm-smi nvidia-smi CLI for monitoring the system and making changes rccl nccl Library for communication between GPUs Running DeepSeek-R1 1. Pull the container image. It is O(10) GB in size, so it may take a few minutes to download. docker pull rocm/sglang-staging:20250303 2. Start the SGLang server. The model (~642 GB) is downloaded the first time it is launched and will take at least a few minutes to download. Once the application outputs “The server is fired up and ready to roll!”, you can begin making queries to the model. docker run \ --device=/dev/kfd \ --device=/dev/dri \ --security-opt seccomp=unconfined \ --cap-add=SYS_PTRACE \ --group-add video \ --privileged \ --shm-size 32g \ --ipc=host \ -p 30000:30000 \ -v /mnt/resource_nvme:/mnt/resource_nvme \ -e HF_HOME=/mnt/resource_nvme/hf_cache \ -e HSA_NO_SCRATCH_RECLAIM=1 \ -e GPU_FORCE_BLIT_COPY_SIZE=64 \ -e DEBUG_HIP_BLOCK_SYN=1024 \ rocm/sglang-staging:20250303 \ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --host 0.0.0.0 3. You can now make queries to DeepSeek-R1. For example, these requests to the model from another shell on same host provide model data and will generate a sample response. curl http://localhost:30000/get_model_info {"model_path":"deepseek-ai/DeepSeek-R1","tokenizer_path":"deepseek-ai/DeepSeek-R1","is_generation":true} curl http://localhost:30000/generate -H "Content-Type: application/json" -d '{ "text": "Once upon a time,", "sampling_params": { "max_new_tokens": 16, "temperature": 0.6 } }' Conclusion In this post, we detail how to run the full-size 671B DeepSeek-R1 model on a single Azure NDv5 MI300X instance. This includes setting up the machine, installing the necessary drivers, and executing the model. Happy inferencing! References https://github.com/deepseek-ai/DeepSeek-R1 https://github.com/deepseek-ai/DeepSeek-V3 https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/azure-announces-new-ai-optimized-vm-series-featuring-amd%e2%80%99s-flagship-mi300x-gpu/3980770 https://docs.sglang.ai/index.htmlOptimizing AI Workloads on Azure: CPU Pinning via NCCL Topology file
(Co-authored by: Rafael Salas, Sreevatsa Anantharamu, Jithin Jose, Elizabeth Fatade) Introduction NCCL NVIDIA Collective Communications Library (NCCL) is one of the most widely used communication library for AI training/inference. It features GPU-focused collective and point-to-point communication designs that are vital for AI workloads. NUMA-GPU-HCA affinity Non-uniform memory access (NUMA) architecture splits up the CPU cores into groups attached to their own local memory. GPUs and HCAs are connected to a specific NUMA node via PCIe interconnect. NCCL launches CPU processes to coordinate GPU-related inter-node communication, hence the process-to-NUMA binding matters a lot for ensuring optimal performance. NCCL uses the Low Latency (LL) protocol for small-to-medium message communication. In this protocol, the GPU data is copied to a pinned CPU buffer before sending over the network. Copying data from a GPU to a NUMA node that it is not directly connected to leads to additional inter-NUMA communication that adds overhead. Furthermore, communicating this CPU buffer through a NIC connected to the neighboring NUMA node also requires additional inter-NUMA communication. Following diagram shows the system topology for Nvidia DGX H100 system as a reference. H100 system topology (figure from https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html) To determine the CPU cores that belong to a NUMA node, use: lscpu To determine the GPU-to-NUMA affinity, use: cat /sys/bus/pci/devices/<busIdOfGPU>/numa_node or for NVIDIA GPUs, you can alternatively use: nvidia-smi topo -m To determine the HCA-to-NUMA mapping, use cat /sys/bus/pci/devices/<busIdOfHCA>/numa_node or use: lstopo NCCL Topology file Application developers can pass in the system topology by specifying the NCCL topology file while launching the job. NCCL topology file passes in the following information to NCCL library: GPU-to-NUMA mapping GPU-to-HCA, HCA-to-NUMA mapping NUMA-to-CPU-core mapping Speed and type of GPU-GPU interconnect. This enables NCCL to choose the most efficient system topology for communication. Azure ND-series VMs, like the NDv4 and NDv5 series, feature multiple GPUs. These GPUs connect to the CPUs via PCIe links and to each other using NVLink. If you are using a VM image from the AI and HPC Market place image, this topology file will be located in the directory: /opt/microsoft. The topology files are also located in the azhpc-images GitHub repository under the topology directory. Performance experiments This section presents results that demonstrate the benefit from using correct CPU pinning via NCCL topology file. We use the NCCL Tests benchmarks and run experiments for the different NCCL calls comparing the performance of the default mpirun binding to the correct binding specified via NCCL topology file. Setup The setup used for the experiments is: Two NDv5 nodes: 8 H100 GPUs per node, 8 400G NICs per node Azure VM image: microsoft-dsvm:ubuntu-hpc:2204:latest NCCL version: 2.25.1+cuda12.6 MPI library: HPC-X v2.18 Impact of System Topology In this section, we show the performance impact of system topology awareness in NCCL. We show the NCCL benchmark performance results comparing default and topology-aware configurations: For the default case, default mpirun binding, we use the following command line for launching NCCL benchmarks: mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1 For the system topology-aware case, we use the following command line for launching NCCL benchmarks: mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ -x NCCL_TOPO_FILE=<topo-file.xml> \ -x NCCL_IGNORE_CPU_AFFINITY=1 \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1 Here, hostfile is a file containing the IP addresses of the two nodes. NCCL_BENCHMARK is the NCCL benchmark name. We run experiments with all six benchmarks: all_reduce_perf, all_gather_perf, sendrecv_perf, reduce_scatter_perf, broadcast_perf, and alltoall_perf. <topo-file.xml> is the Azure SKU topology file that can be obtained as described in the "NCCL Topology File" section. NCCL topology files for different Azure HPC/AI SKUs are available here: Topology Files. Topology file-based binding is set by prescribing the NCCL_TOPO_FILE variable to the path of the NCCL topology file and setting NCCL_IGNORE_CPU_AFFINITY to 1. Setting NCCL_IGNORE_CPU_AFFINITY to 1 is crucial to assign the process affinity in NCCL solely based on the NCCL topology file. If this variable is not set, NCCL honors the affinity set by the MPI library as well as the NCCL topology file by setting the affinity to the intersection of these two sets. If the intersection is null, then NCCL simply retains the affinity set by the MPI library. Additional NCCL Optimizations We also list some additional NCCL tuning configurations for better performance. Note that these drive the benchmark performance to the peak, but it is important to fine tune these parameters for the end training application as some of these have impact of SM utilization. Config Value NCCL_MIN_CHANNELS 32 NCCL_P2P_NET_CHUNKSIZE 512K NCCL_IB_QPS_PER_CONNECTION 4 NCCL_PXN_DISABLE 1 Increasing NCCL_MIN_CHANNELS to 32 increases the throughput of certain collectives (especially ncclReduceScatter). Increasing NCCL_P2P_NET_CHUNKSIZE to 512K (from the default value of 128K) gives better throughput for ncclSend and ncclRecv calls when channel buffers are used instead of user-registered buffer for communication. Increasing NCCL_IB_QPS_PER_CONNECTION to 4 from 1 also slightly increases the throughput of collectives. Enforcing NCCL_PXN_DISABLE to 1 is essential to enable the zero-copy design for ncclSend and ncclRecv calls. Inter-node zero-copy designs are present in NCCL 2.24 onwards but is activated only if NCCL_PXN_DISABLE is set to 1 and the user buffers are registered with the NIC via NCCL calls ("-R 1" flag registers user buffers in NCCL tests). We found the bandwidth of the zero-copy point-to-point design to be around 10 GB/s better than their copy-based design. Results Small message sizes In the above figure, we compare the small message latency for the different collectives using the NCCL topology file-based binding to the default mpirun binding. For all the NCCL benchmarks, we see a consistent higher performance with using the NCCL topology file-based binding. The reason for this improvement is that, for small message sizes, NCCL uses the Low Latency (LL) protocol, and this protocol uses pinned CPU buffers for inter-node communication. In the LL protocol, the GPU copies the data to be sent over the network to the CPU buffer. The CPU thread polls this buffer to check if all the flags are set, and once they are set, the thread sends the data over the network. With the default mpirun pinning, all eight processes per node and their allocated memory are on NUMA node 0. However, GPUs 0-3 have affinity to NUMA node 0 and GPUs 4-7 have affinity to NUMA node 1. Therefore, the data copied from GPUs 4-7 need to traverse through the NUMA interconnect to reach the pinned CPU buffer. Furthermore, while communicating the CPU buffer via the NIC (closest to the GPU), the inter-NUMA interconnect needs to be traversed again. The figures above clearly show this additional overhead from inter-NUMA interconnect latency. With NCCL topology file-based binding, this overhead does not exist. This is because the topology file contains information of the GPUs attached to a NUMA node and the CPU mask for all the cores in a NUMA node. Using this information, NCCL correctly binds the processes to the NUMA nodes. Medium message sizes The medium message size bandwidth is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. Similar to small message sizes, NCCL uses the LL protocol for most of the above medium message runs. Nearly, all NCCL benchmarks show an improvement in bandwidth with using the NCCL topology file-based binding. The reason for this improvement is same as that for the small message sizes. Large message sizes The bandwidth for the large message sizes is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. As the message sizes gets larger and larger, there is not much improvement in bandwidth with using NCCL topology file-based binding, for all the NCCL benchmarks. This is because the dominant NCCL protocol for this message size range is SIMPLE. In this protocol, on systems with GPUDirectRDMA (which is the case for Azure ND-series SKUs), the GPU buffer is directly communicated over the network for inter-node communication. There is no intermediate copy of the message to a CPU buffer. In this scenario, the GPU-CPU communication is used to only updates a few flags for auxiliary tasks such indicating buffer readiness. The increase in the latency of these auxiliary tasks is insignificant compared to the total message transfer time. Thus, the impact of CPU pinning on the achieved bandwidth reduces for large message sizes. Conclusion This blog post describes the impact of system topology awareness and NCCL tuning for AI workload optimization and lists the relevant NCCL configuration options. The impact of topology awareness is significant for small and medium messages where NCCL uses the LL protocol. We also show the performance results comparing the different configurations and highlight the importance of performance tuning. To summarize, the recommended configuration for NCCL (version greater than v2.24) on Azure HPC/AI VMs (NDv4/NDv5) are: Config Value NCCL_MIN_CHANNELS 32 NCCL_P2P_NET_CHUNKSIZE 512K NCCL_IB_QPS_PER_CONNECTION 4 NCCL_PXN_DISABLE 1 NCCL_TOPO_FILE <corresponding-topology-file.xml> NCCL_IGNORE_CPU_AFFINITY 1 For Azure HPC/AI VMs, this NCCL topology file is available in /etc/nccl.conf inside the VM image. For container runs inside the VM, it is recommended to mount the NCCL topology file and /etc/nccl.conf file from the VM to the container.Running DeepSeek on AKS with Ollama
Introduction This guide provides step-by-step instructions on how to run DeepSeek on Azure Kubernetes Service (AKS). The setup utilizes an ND-H100-v5 VM to accommodate the 4-bit quantized 671-billion parameter model on a single node. Prerequisites Before proceeding, ensure you have an AKS cluster with a node pool containing at least one ND-H100-v5 instance. Additionally, make sure the NVIDIA GPU drivers are properly set up. Important: Set the OS disk size to 1024GB to accommodate the model. For detailed instructions on creating an AKS cluster, refer to this guide. Deploying Ollama with DeepSeek on AKS We will use the PyTorch container from NVIDIA to deploy Ollama. First, configure authentication on AKS: Create an NGC API Key Visit NGC API Key Setup. Generate an API key. Create a Secret in the AKS Cluster Run the following command to create a Kubernetes secret for authentication: kubectl create secret docker-registry nvcr-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<YOUR_NGC_API_KEY> Deploy the Pod Use the following Kubernetes manifest to deploy Ollama, download the model, and start serving it: apiVersion: v1 kind: Pod metadata: name: ollama spec: containers: - name: ollama image: nvcr.io/nvidia/pytorch:24.09-py3 imagePullSecrets: - name: nvcr-secret securityContext: capabilities: add: ["IPC_LOCK"] volumeMounts: - mountPath: /dev/shm name: shmem resources: requests: nvidia.com/gpu: 8 nvidia.com/mlnxnics: 8 limits: nvidia.com/gpu: 8 nvidia.com/mlnxnics: 8 command: - bash - -c - | export DEBIAN_FRONTEND=noninteractive export OLLAMA_MODELS=/mnt/data/models mkdir -p $OLLAMA_MODELS apt update apt install -y curl pciutils net-tools # Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Start Ollama server in the foreground ollama serve 2>&1 | tee /mnt/data/ollama.log & # Wait for Ollama to fully start before pulling the model sleep 5 # Fix parameter issue (https://github.com/ollama/ollama/issues/8599) cat >Modelfile <<EOF FROM deepseek-r1:671b PARAMETER num_ctx 24576 PARAMETER num_predict 8192 EOF ollama create deepseek-r1:671b-fixed -f Modelfile # Pull model ollama pull deepseek-r1:671b-fixed # Keep the container running wait volumes: - emptyDir: medium: Memory sizeLimit: 128Gi name: shmem Connecting to Ollama By default, this setup does not create a service. You can either define one or use port forwarding to connect Ollama to your local machine. To forward the port, run: kubectl port-forward pod/ollama 11434:11434 Now, you can interact with the DeepSeek reasoning model using your favorite chat client. For example, in Chatbox, you can ask: "Tell me a joke about large language models."Deploying ZFS Scratch Storage for NVMe on Azure Kubernetes Service (AKS)
This guide demonstrates how to use ZFS LocalPV to efficiently manage the NVMe storage available on Azure NDv5 H100 VMs. Equipped with eight 3.5TB NVMe disks, these VMs are tailored for high-performance workloads like AI/ML and large-scale data processing. By combining the flexibility of AKS with the advanced storage capabilities of ZFS, you can dynamically provision stateful node-local volumes while aggregating NVMe disks for optimal performance.Optimizing Language Model Inference on Azure
Inefficient inference optimization can lead to skyrocketing costs for customers, making it crucial to establish clear performance benchmarking numbers. This blog sets the standard for expected performance, helping customers make informed decisions that maximize efficiency and minimize expenses with the new Azure ND H200 v5-series.