hpc
202 TopicsExperience Next-Gen HPC Innovation: AMD Lab Empowers ‘Try Before You Buy’ on Azure
In today’s fast-paced digital landscape, High-Performance Computing (HPC) is a critical engine powering innovation across industries—from automotive and aerospace to energy and manufacturing. To keep pace with escalating performance demands and the need for agile, risk-free testing environments, AMD has partnered with Microsoft and leading Independent Software Vendors (ISVs) to introduce the AMD HPC Innovation Lab. This pioneering sandbox environment on Azure is a “try before you buy” solution designed to empower customers to run their HPC workloads, assess performance, and experience AMD’s newest hardware innovations that deliver enhanced performance, scalability, and consistency—all without any financial commitments. Introducing the AMD Innovation Lab: A New Paradigm in Customer Engagement The AMD HPC Innovation Lab represents a paradigm shift in customer engagement for HPC solutions. Traditionally, organizations had to invest significant time and resources to build and manage on-premises testing environments, dealing with challenges such as hardware maintenance, scalability issues, and high operational costs. Without the opportunity to fully explore the benefits of cloud solutions through a trial offer, they often missed out on the advantages of cloud computing. With this innovative lab, customers now have the opportunity to experiment with optimized HPC environments in a simple, user-friendly interface. The process is straightforward: upload your input file or choose from the pre-configured options, run your workload, and then download your output file for analysis. This streamlined approach allows businesses to compare performance results on an apples-to-apples basis against other providers or existing on-premises setups. Empowering Decision Makers For Business Decision Makers (BDMs) and Technical Decision Makers (TDMs), the lab offers a compelling value proposition. It eliminates the complexities and uncertainties often associated with traditional testing environments by providing a risk-free opportunity to: Thoroughly Evaluate Performance: With access to AMD’s cutting-edge chipsets and Azure’s robust cloud infrastructure, organizations can conduct detailed proof-of-concept evaluations without incurring long-term costs. Accelerate Decision-Making: The streamlined testing process not only speeds up the evaluation phase but also accelerates the overall time to value, enabling organizations to make informed decisions quickly. Optimize Infrastructure: Created in partnership with ISVs and optimized by both AMD and Microsoft, the lab ensures that the infrastructure is fine-tuned for HPC workloads. This guarantees that performance assessments are both accurate and reflective of real-world scenarios. Seamless Integration with Leading ISVs A notable strength of the AMD HPC Innovation Lab is its collaborative design with top ISVs like Ansys, Altair, Siemens, and others. These partnerships ensure that the lab’s environment is equipped with industry-leading applications and solvers, such as Ansys Fluent for fluid dynamics and Ansys Mechanical for structural analysis. Each solver is pre-configured to provide a balanced and consistent performance evaluation, ensuring that users can benchmark their HPC workloads against industry standards with ease. Sustainability and Scalability Beyond performance and ease-of-use, the AMD HPC Innovation Lab is built with sustainability in mind. By leveraging Azure’s scalable cloud infrastructure, businesses can conduct HPC tests without the overhead and environmental impact of maintaining additional on-premises resources. This not only helps reduce operational costs but also supports corporate sustainability goals by minimizing the carbon footprint associated with traditional HPC setups. An Exciting Future for HPC Testing The innovation behind the AMD HPC Innovation Lab is just the beginning. With plans to continuously expand the lab catalog and include more ISVs, the platform is set to evolve as a comprehensive testing ecosystem. This ongoing expansion will provide customers with an increasingly diverse set of tools and environments tailored to meet a wide array of HPC needs. Whether you’re evaluating performance for fluid dynamics, structural simulations, or electromagnetic fields, the lab’s growing catalog promises to deliver precise and actionable insights. Ready to Experience the Future of HPC? The AMD HPC Innovation Lab on Azure offers a unique and exciting opportunity for organizations looking to harness the power of advanced computing without upfront financial risk. With its intuitive interface, optimized infrastructure, and robust ecosystem of ISVs, this sandbox environment is a game-changer in HPC testing and validation. Take advantage of this no-cost, high-impact solution to explore, experiment, and experience firsthand the benefits of AMD-powered HPC on Azure. To learn more and sign up for the program, visit https://aka.ms/AMDInnovationLab/LearnMoreRunning DeepSeek-R1 on a single NDv5 MI300X VM
Contributors: Davide Vanzo, Yuval Mazor, Jesse Lopez DeepSeek-R1 is an open-weights reasoning model built on DeepSeek-V3, designed for conversational AI, coding, and complex problem-solving. It has gained significant attention beyond the AI/ML community due to its strong reasoning capabilities, often competing with OpenAI’s models. One of its key advantages is that it can be run locally, giving users full control over their data. The NDv5 MI300X VM features 8x AMD Instinct MI300X GPUs, each equipped with 192GB of HBM3 and interconnected via Infinity Fabric 3.0. With up to 5.2 TB/s of memory bandwidth per GPU, the MI300X provides the necessary capacity and speed to process large models efficiently - enabling users to run DeepSeek-R1 at full precision on a single VM. In this blog post, we’ll walk you through the steps to provision an NDv5 MI300X instance on Azure and run DeepSeek-R1 for inference using the SGLang inference framework. Launching an NDv5 MI300X VM Prerequisites Check that your subscription has sufficient vCPU quota for the VM family “StandardNDI Sv 5MI300X” (see Quota documentation). If needed, contact your Microsoft account representative to request quota increase. A Bash terminal with Azure CLI installed and logged into the appropriate tenant. Alternatively, Azure Cloud Shell can also be employed. Provision the VM 1. Using Azure CLI, create an Ubuntu-22.04 VM on ND_MI300x_v5: az group create --location <REGION> -n <RESOURCE_GROUP_NAME> az vm create --name mi300x --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --ssh-key-values <PUBLIC_SSH_PATH> Optionally, the deployment can utilize the cloud-init.yaml file specified as --custom-data <CLOUD_INIT_FILE_PATH> to automate the additional preparation described below: az vm create --name mi300x --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --ssh-key-values <PUBLIC_SSH_PATH> --custom-data <CLOUD_INIT_FILE_PATH> Note: The GPU drivers may take a couple of mintues to completely load after the VM has been initially created. Additional preparation Beyond provisioning the VM, there are additional steps to prepare the environment to optimally run DeepSeed, or other AI workloads including setting-up the 8 NVMe disks on the node in a RAID-0 configuration to act as the cache location for Docker and Hugging Face. The following steps assume you have connected to the VM and working in a Bash shell. 1. Prepare the NVMe disks in a RAID-0 configuration mkdir -p /mnt/resource_nvme/ sudo mdadm --create /dev/md128 -f --run --level 0 --raid-devices 8 $(ls /dev/nvme*n1) sudo mkfs.xfs -f /dev/md128 sudo mount /dev/md128 /mnt/resource_nvme sudo chmod 1777 /mnt/resource_nvme 2. Configure Hugging Face to use the RAID-0. This environmental variable should also be propagated to any containers pulling images or data from Hugging Face. mkdir –p /mnt/resource_nvme/hf_cache export HF_HOME=/mnt/resource_nvme/hf_cache 3. Configure Docker to use the RAID-0 mkdir -p /mnt/resource_nvme/docker sudo tee /etc/docker/daemon.json > /dev/null <<EOF { "data-root": "/mnt/resource_nvme/docker" } EOF sudo chmod 0644 /etc/docker/daemon.json sudo systemctl restart docker All of these additional preperation steps can be automated in VM creation using cloud-init. The example cloud-init.yaml file can be used in provisioning the VM as described above. #cloud-config package_update: true write_files: - path: /opt/setup_nvme.sh permissions: '0755' owner: root:root content: | #!/bin/bash NVME_DISKS_NAME=`ls /dev/nvme*n1` NVME_DISKS=`ls -latr /dev/nvme*n1 | wc -l` echo "Number of NVMe Disks: $NVME_DISKS" if [ "$NVME_DISKS" == "0" ] then exit 0 else mkdir -p /mnt/resource_nvme # Needed incase something did not unmount as expected. This will delete any data that may be left behind mdadm --stop /dev/md* mdadm --create /dev/md128 -f --run --level 0 --raid-devices $NVME_DISKS $NVME_DISKS_NAME mkfs.xfs -f /dev/md128 mount /dev/md128 /mnt/resource_nvme fi chmod 1777 /mnt/resource_nvme - path: /etc/profile.d/hf_home.sh permissions: '0755' content: | export HF_HOME=/mnt/resource_nvme/hf_cache - path: /etc/docker/daemon.json permissions: '0644' content: | { "data-root": "/mnt/resource_nvme/docker" } runcmd: - ["/bin/bash", "/opt/setup_nvme.sh"] - mkdir -p /mnt/resource_nvme/docker - mkdir -p /mnt/resource_nvme/hf_cache # PAM group not working for docker group, so this will add all users to docker group - bash -c 'for USER in $(ls /home); do usermod -aG docker $USER; done' - systemctl restart docker Using MI300X If you are familiar with Nvidia and CUDA tools and environment, AMD provides equivalents as part of the ROCm stack. MI300X + ROCm Nvidia + CUDA Description rocm-smi nvidia-smi CLI for monitoring the system and making changes rccl nccl Library for communication between GPUs Running DeepSeek-R1 1. Pull the container image. It is O(10) GB in size, so it may take a few minutes to download. docker pull rocm/sglang-staging:20250303 2. Start the SGLang server. The model (~642 GB) is downloaded the first time it is launched and will take at least a few minutes to download. Once the application outputs “The server is fired up and ready to roll!”, you can begin making queries to the model. docker run \ --device=/dev/kfd \ --device=/dev/dri \ --security-opt seccomp=unconfined \ --cap-add=SYS_PTRACE \ --group-add video \ --privileged \ --shm-size 32g \ --ipc=host \ -p 30000:30000 \ -v /mnt/resource_nvme:/mnt/resource_nvme \ -e HF_HOME=/mnt/resource_nvme/hf_cache \ -e HSA_NO_SCRATCH_RECLAIM=1 \ -e GPU_FORCE_BLIT_COPY_SIZE=64 \ -e DEBUG_HIP_BLOCK_SYN=1024 \ rocm/sglang-staging:20250303 \ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --host 0.0.0.0 3. You can now make queries to DeepSeek-R1. For example, these requests to the model from another shell on same host provide model data and will generate a sample response. curl http://localhost:30000/get_model_info {"model_path":"deepseek-ai/DeepSeek-R1","tokenizer_path":"deepseek-ai/DeepSeek-R1","is_generation":true} curl http://localhost:30000/generate -H "Content-Type: application/json" -d '{ "text": "Once upon a time,", "sampling_params": { "max_new_tokens": 16, "temperature": 0.6 } }' Conclusion In this post, we detail how to run the full-size 671B DeepSeek-R1 model on a single Azure NDv5 MI300X instance. This includes setting up the machine, installing the necessary drivers, and executing the model. Happy inferencing! References https://github.com/deepseek-ai/DeepSeek-R1 https://github.com/deepseek-ai/DeepSeek-V3 https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/azure-announces-new-ai-optimized-vm-series-featuring-amd%e2%80%99s-flagship-mi300x-gpu/3980770 https://docs.sglang.ai/index.htmlOptimizing AI Workloads on Azure: CPU Pinning via NCCL Topology file
(Co-authored by: Rafael Salas, Sreevatsa Anantharamu, Jithin Jose, Elizabeth Fatade) Introduction NCCL NVIDIA Collective Communications Library (NCCL) is one of the most widely used communication library for AI training/inference. It features GPU-focused collective and point-to-point communication designs that are vital for AI workloads. NUMA-GPU-HCA affinity Non-uniform memory access (NUMA) architecture splits up the CPU cores into groups attached to their own local memory. GPUs and HCAs are connected to a specific NUMA node via PCIe interconnect. NCCL launches CPU processes to coordinate GPU-related inter-node communication, hence the process-to-NUMA binding matters a lot for ensuring optimal performance. NCCL uses the Low Latency (LL) protocol for small-to-medium message communication. In this protocol, the GPU data is copied to a pinned CPU buffer before sending over the network. Copying data from a GPU to a NUMA node that it is not directly connected to leads to additional inter-NUMA communication that adds overhead. Furthermore, communicating this CPU buffer through a NIC connected to the neighboring NUMA node also requires additional inter-NUMA communication. Following diagram shows the system topology for Nvidia DGX H100 system as a reference. H100 system topology (figure from https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html) To determine the CPU cores that belong to a NUMA node, use: lscpu To determine the GPU-to-NUMA affinity, use: cat /sys/bus/pci/devices/<busIdOfGPU>/numa_node or for NVIDIA GPUs, you can alternatively use: nvidia-smi topo -m To determine the HCA-to-NUMA mapping, use cat /sys/bus/pci/devices/<busIdOfHCA>/numa_node or use: lstopo NCCL Topology file Application developers can pass in the system topology by specifying the NCCL topology file while launching the job. NCCL topology file passes in the following information to NCCL library: GPU-to-NUMA mapping GPU-to-HCA, HCA-to-NUMA mapping NUMA-to-CPU-core mapping Speed and type of GPU-GPU interconnect. This enables NCCL to choose the most efficient system topology for communication. Azure ND-series VMs, like the NDv4 and NDv5 series, feature multiple GPUs. These GPUs connect to the CPUs via PCIe links and to each other using NVLink. If you are using a VM image from the AI and HPC Market place image, this topology file will be located in the directory: /opt/microsoft. The topology files are also located in the azhpc-images GitHub repository under the topology directory. Performance experiments This section presents results that demonstrate the benefit from using correct CPU pinning via NCCL topology file. We use the NCCL Tests benchmarks and run experiments for the different NCCL calls comparing the performance of the default mpirun binding to the correct binding specified via NCCL topology file. Setup The setup used for the experiments is: Two NDv5 nodes: 8 H100 GPUs per node, 8 400G NICs per node Azure VM image: microsoft-dsvm:ubuntu-hpc:2204:latest NCCL version: 2.25.1+cuda12.6 MPI library: HPC-X v2.18 Impact of System Topology In this section, we show the performance impact of system topology awareness in NCCL. We show the NCCL benchmark performance results comparing default and topology-aware configurations: For the default case, default mpirun binding, we use the following command line for launching NCCL benchmarks: mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1 For the system topology-aware case, we use the following command line for launching NCCL benchmarks: mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ -x NCCL_TOPO_FILE=<topo-file.xml> \ -x NCCL_IGNORE_CPU_AFFINITY=1 \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1 Here, hostfile is a file containing the IP addresses of the two nodes. NCCL_BENCHMARK is the NCCL benchmark name. We run experiments with all six benchmarks: all_reduce_perf, all_gather_perf, sendrecv_perf, reduce_scatter_perf, broadcast_perf, and alltoall_perf. <topo-file.xml> is the Azure SKU topology file that can be obtained as described in the "NCCL Topology File" section. NCCL topology files for different Azure HPC/AI SKUs are available here: Topology Files. Topology file-based binding is set by prescribing the NCCL_TOPO_FILE variable to the path of the NCCL topology file and setting NCCL_IGNORE_CPU_AFFINITY to 1. Setting NCCL_IGNORE_CPU_AFFINITY to 1 is crucial to assign the process affinity in NCCL solely based on the NCCL topology file. If this variable is not set, NCCL honors the affinity set by the MPI library as well as the NCCL topology file by setting the affinity to the intersection of these two sets. If the intersection is null, then NCCL simply retains the affinity set by the MPI library. Additional NCCL Optimizations We also list some additional NCCL tuning configurations for better performance. Note that these drive the benchmark performance to the peak, but it is important to fine tune these parameters for the end training application as some of these have impact of SM utilization. Config Value NCCL_MIN_CHANNELS 32 NCCL_P2P_NET_CHUNKSIZE 512K NCCL_IB_QPS_PER_CONNECTION 4 NCCL_PXN_DISABLE 1 Increasing NCCL_MIN_CHANNELS to 32 increases the throughput of certain collectives (especially ncclReduceScatter). Increasing NCCL_P2P_NET_CHUNKSIZE to 512K (from the default value of 128K) gives better throughput for ncclSend and ncclRecv calls when channel buffers are used instead of user-registered buffer for communication. Increasing NCCL_IB_QPS_PER_CONNECTION to 4 from 1 also slightly increases the throughput of collectives. Enforcing NCCL_PXN_DISABLE to 1 is essential to enable the zero-copy design for ncclSend and ncclRecv calls. Inter-node zero-copy designs are present in NCCL 2.24 onwards but is activated only if NCCL_PXN_DISABLE is set to 1 and the user buffers are registered with the NIC via NCCL calls ("-R 1" flag registers user buffers in NCCL tests). We found the bandwidth of the zero-copy point-to-point design to be around 10 GB/s better than their copy-based design. Results Small message sizes In the above figure, we compare the small message latency for the different collectives using the NCCL topology file-based binding to the default mpirun binding. For all the NCCL benchmarks, we see a consistent higher performance with using the NCCL topology file-based binding. The reason for this improvement is that, for small message sizes, NCCL uses the Low Latency (LL) protocol, and this protocol uses pinned CPU buffers for inter-node communication. In the LL protocol, the GPU copies the data to be sent over the network to the CPU buffer. The CPU thread polls this buffer to check if all the flags are set, and once they are set, the thread sends the data over the network. With the default mpirun pinning, all eight processes per node and their allocated memory are on NUMA node 0. However, GPUs 0-3 have affinity to NUMA node 0 and GPUs 4-7 have affinity to NUMA node 1. Therefore, the data copied from GPUs 4-7 need to traverse through the NUMA interconnect to reach the pinned CPU buffer. Furthermore, while communicating the CPU buffer via the NIC (closest to the GPU), the inter-NUMA interconnect needs to be traversed again. The figures above clearly show this additional overhead from inter-NUMA interconnect latency. With NCCL topology file-based binding, this overhead does not exist. This is because the topology file contains information of the GPUs attached to a NUMA node and the CPU mask for all the cores in a NUMA node. Using this information, NCCL correctly binds the processes to the NUMA nodes. Medium message sizes The medium message size bandwidth is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. Similar to small message sizes, NCCL uses the LL protocol for most of the above medium message runs. Nearly, all NCCL benchmarks show an improvement in bandwidth with using the NCCL topology file-based binding. The reason for this improvement is same as that for the small message sizes. Large message sizes The bandwidth for the large message sizes is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. As the message sizes gets larger and larger, there is not much improvement in bandwidth with using NCCL topology file-based binding, for all the NCCL benchmarks. This is because the dominant NCCL protocol for this message size range is SIMPLE. In this protocol, on systems with GPUDirectRDMA (which is the case for Azure ND-series SKUs), the GPU buffer is directly communicated over the network for inter-node communication. There is no intermediate copy of the message to a CPU buffer. In this scenario, the GPU-CPU communication is used to only updates a few flags for auxiliary tasks such indicating buffer readiness. The increase in the latency of these auxiliary tasks is insignificant compared to the total message transfer time. Thus, the impact of CPU pinning on the achieved bandwidth reduces for large message sizes. Conclusion This blog post describes the impact of system topology awareness and NCCL tuning for AI workload optimization and lists the relevant NCCL configuration options. The impact of topology awareness is significant for small and medium messages where NCCL uses the LL protocol. We also show the performance results comparing the different configurations and highlight the importance of performance tuning. To summarize, the recommended configuration for NCCL (version greater than v2.24) on Azure HPC/AI VMs (NDv4/NDv5) are: Config Value NCCL_MIN_CHANNELS 32 NCCL_P2P_NET_CHUNKSIZE 512K NCCL_IB_QPS_PER_CONNECTION 4 NCCL_PXN_DISABLE 1 NCCL_TOPO_FILE <corresponding-topology-file.xml> NCCL_IGNORE_CPU_AFFINITY 1 For Azure HPC/AI VMs, this NCCL topology file is available in /etc/nccl.conf inside the VM image. For container runs inside the VM, it is recommended to mount the NCCL topology file and /etc/nccl.conf file from the VM to the container.Add a new Partition to a running CycleCloud SLURM cluster
Learn how to seamlessly integrate a new partition into an active SLURM cluster within Azure CycleCloud without restarting the entire cluster. This guide provides step-by-step instructions to add a new nodearray, configure cluster settings, and scale the cluster to match the SLURM partition. Optimize your HPC environment with Azure CycleCloud and SLURM for efficient resource utilization and enhanced performance.Using Azure CycleCloud with Weka
What is Azure CycleCloud? Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing HPC environments on Azure. With Azure CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale. CycleCloud is used for running workloads like scientific simulations, rendering tasks, Genomics and Bionomics, Financial Modeling, Artificial Intelligence, Machine Learning and other data-intensive operations that require large amounts of compute power. CycleCloud supports GPU computing which is useful for the workloads described above. One of the strengths of Azure CycleCloud is its ability to automatically scale resources up or down based on demand. If your workload requires more GPU power (such as for deep learning training), CycleCloud can provision additional GPU-enabled instances as needed. The question remains – If the GPU’s provisioned by CycleCloud are waiting for storage I/O operations, not only is the performance of the application severely impacted, the GPU is also underutilized meaning you are not fully exploiting the resources you are paying for! This brings us to Weka.io. But before we talk about the problems Weka & CycleCloud solve, let's talk about what Weka is. What is WEKA? The WEKA® Data Platform was purpose-built to seamlessly and sustainably deliver speed, simplicity, and scale that meets the needs of modern enterprises and research organizations without compromise. Its advanced, software-defined architecture supports next-generation workloads in virtually any location with cloud simplicity and on-premises performance. At the heart of the WEKA® Data Platform is a modern fully distributed parallel filesystem, WekaFS™ which can span across 1,000’s of NVMe SSD spread across multiple hosts and seamlessly extend itself over S3 compatible object storage. You can deploy WEKA software on a cluster of Microsoft Azure LSv3 VMs with local SSD to create a high-performance storage layer. WEKA can also take advantage of Azure Blob Storage to scale your namespace at the lowest cost. You can automate your WEKA deployment through HashiCorp Terraform templates for fast easy installation. Data stored with your WEKA environment is accessible to applications in your environment through multiple protocols, including NFS, SMB, POSIX, and S3-compliant applications. d object storage in a single global namespace Key components to WEKA Data Platform in Azure include: The Architecture is deployed directly in the customer Tenant within a subscription ID of the customers choosing. WEKA software is deployed across 6 or more Azure LSv3 VMs. The LSv3 VMs are clustered to act as one single device. The WekaFS™ namespace is extended transparently onto Azure Hot Blob Scale Up and Scale down functions are driven by Logic App’s and Function Apps All client secrets are kept in Azure Vault Deployment is fully automated using Terraform WEKA Templates What is the integration? Using the Weka-CycleCloud template available here, any compute nodes deployed via CycleCloud will automatically install the WEKA agent as well as automatically mount to the WEKA filesystem. Users can deploy 10, 100, even 1000’s of compute nodes and they will all mount to the fastest storage in Azure (WEKA). Full integration steps are available here: WEKA/CycleCloud for SLUM Integration Benefits The combined solution of Weka combines the best of both worlds. With the CycleCloud / Weka template, customers will get: Simplified HPC management. With CycleCloud, you can provision clusters with a few clicks using preconfigured templates – and the clusters will all be mounted directly to WEKA. A High-Performance End to End Architecture. CycleCloud & WEKA allows users to combine the benefits of CPUs/GPUs with ultra fast storage. This is essential to ensure high throughput and low latency for computational workloads. The goal is to ensure that the storage subsystem can keep up with the high-speed demands of the CPU/GPU, especially in scenarios where you're running compute-heavy workloads like deep learning, scientific simulations, or large-scale data processing. Cost Optimization #1. Both CycleCloud and WEKA allow for autoscaling (up and down). Adjust the number of compute resources (CycleCloud) as well as the number of Storage backend nodes (WEKA) based on workload needs. Cost Optimization #2. WEKA.IO offers intelligent data tiering to help optimize performance and storage costs. The tiering system is designed to automatically move data between different storage classes based on access patterns, which maximizes efficiency while minimizing expenses. Conclusion The CycleCloud & WEKA integration delivers a simplified HPC (AI/ML) cloud management platform, exceptional performance for data-intensive workloads, cost optimization via elastic scaling, flash optimization, & data tiering, all in one user Interface. This enables organizations to achieve high throughput, low latency, and optimal CPU/GPU resource utilization for their most demanding applications and use cases. Try it today! Special thanks to Raj Sharma and the WEKA team for their work on this integration!Running DeepSeek on AKS with Ollama
Introduction This guide provides step-by-step instructions on how to run DeepSeek on Azure Kubernetes Service (AKS). The setup utilizes an ND-H100-v5 VM to accommodate the 4-bit quantized 671-billion parameter model on a single node. Prerequisites Before proceeding, ensure you have an AKS cluster with a node pool containing at least one ND-H100-v5 instance. Additionally, make sure the NVIDIA GPU drivers are properly set up. Important: Set the OS disk size to 1024GB to accommodate the model. For detailed instructions on creating an AKS cluster, refer to this guide. Deploying Ollama with DeepSeek on AKS We will use the PyTorch container from NVIDIA to deploy Ollama. First, configure authentication on AKS: Create an NGC API Key Visit NGC API Key Setup. Generate an API key. Create a Secret in the AKS Cluster Run the following command to create a Kubernetes secret for authentication: kubectl create secret docker-registry nvcr-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<YOUR_NGC_API_KEY> Deploy the Pod Use the following Kubernetes manifest to deploy Ollama, download the model, and start serving it: apiVersion: v1 kind: Pod metadata: name: ollama spec: containers: - name: ollama image: nvcr.io/nvidia/pytorch:24.09-py3 imagePullSecrets: - name: nvcr-secret securityContext: capabilities: add: ["IPC_LOCK"] volumeMounts: - mountPath: /dev/shm name: shmem resources: requests: nvidia.com/gpu: 8 nvidia.com/mlnxnics: 8 limits: nvidia.com/gpu: 8 nvidia.com/mlnxnics: 8 command: - bash - -c - | export DEBIAN_FRONTEND=noninteractive export OLLAMA_MODELS=/mnt/data/models mkdir -p $OLLAMA_MODELS apt update apt install -y curl pciutils net-tools # Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Start Ollama server in the foreground ollama serve 2>&1 | tee /mnt/data/ollama.log & # Wait for Ollama to fully start before pulling the model sleep 5 # Fix parameter issue (https://github.com/ollama/ollama/issues/8599) cat >Modelfile <<EOF FROM deepseek-r1:671b PARAMETER num_ctx 24576 PARAMETER num_predict 8192 EOF ollama create deepseek-r1:671b-fixed -f Modelfile # Pull model ollama pull deepseek-r1:671b-fixed # Keep the container running wait volumes: - emptyDir: medium: Memory sizeLimit: 128Gi name: shmem Connecting to Ollama By default, this setup does not create a service. You can either define one or use port forwarding to connect Ollama to your local machine. To forward the port, run: kubectl port-forward pod/ollama 11434:11434 Now, you can interact with the DeepSeek reasoning model using your favorite chat client. For example, in Chatbox, you can ask: "Tell me a joke about large language models."GPU Slicing in CycleCloud Slurm with CUDA Multi-Process Service (MPS)
High-performance computing thrives on efficient GPU resource sharing, and integrating NVIDIA’s CUDA Multi-Process Service (MPS) with CycleCloud-managed Slurm clusters can revolutionize how teams optimize their workloads. CUDA MPS streamlines GPU sharing by creating a shared GPU context for multiple CUDA processes. This happens transparently, thanks to an MPS control daemon that manages workloads behind the scenes. It ensures that users can seamlessly run their CUDA programs without modification, each benefiting from their own MPS server for efficient workload distribution and GPU access. When combined with the dynamic scaling of Azure CycleCloud and the robust job scheduling of Slurm, this setup becomes a game-changer. Teams can maximize GPU utilization, minimize costs, and scale effortlessly, all while maintaining the flexibility to adapt to diverse workloads. Unlike NVIDIA’s Multi-Instance GPU (MIG), which is tied to specific GPU models like A100 and H100, MPS works across any GPU with a compute capability of 3.5 or higher. While MIG provides hardware-level isolation, MPS excels at enabling multiple concurrent CUDA processes on the same GPU, requiring careful memory management to maintain performance. Refer to my previous post on Creating a SLURM Cluster (Without CycleCloud) for Scheduling NVIDIA MIG-Based GPU Accelerated workloads. In this guide, we’ll explore how to integrate CUDA MPS with Slurm using Azure CycleCloud. You’ll discover how to run multiple GPU jobs on a single GPU, unlocking new levels of efficiency and scalability for your HPC operations. Prerequisites CycleCloud Version: 8.7 • Status: Must be installed and operational Scheduler Node • VM Size: Standard D4s v5 (4 vCPUs, 16 GiB memory) • Image: Ubuntu-HPC 2204 - Gen2 (microsoft-dsvm:ubuntu-hpc:2204:latest) • Scheduling Software: Slurm 24.05.4 Execute Node • VM Size: Standard_NC24ads_A100_v4 • Image: Ubuntu-HPC 2204 - Gen2 (microsoft-dsvm:ubuntu-hpc:2204:latest) • Scheduling Software: Slurm 24.05.4 • Nvidia Driver – integrated to Image. Here is the procedure to integrate CUDA MPS in CycleCloud. login to CycleCloud server and clone the following repository and upload the project to the locker. git clone https://github.com/vinil-v/slurm-cuda-mps cd slurm-cuda-mps/ cyclecloud project upload "<Locker name>" in your CycleCloud Slurm cluster configuration, add this project as a cluster-init in the Scheduler configuration. This will enable the CUDA MPS support in the CycleCloud Slurm cluster. I have tested with A100, H100 and T4 GPU VMs and its working as expected. The custom project is going to make the following configuration changes to slurm and GRES configuration. ----- slurm.conf ---- (base) vinil@slurmgpu-hpc-1:~$ grep Gres /etc/slurm/slurm.conf GresTypes=gpu,mps ----- azure.conf ------- (base) vinil@slurmgpu-hpc-1:~$ grep mps /etc/slurm/azure.conf Nodename=slurmgpu-hpc-1 Feature=cloud STATE=CLOUD CPUs=24 ThreadsPerCore=1 RealMemory=214016 Gres=gpu:1,mps:100 ------gres.conf ---- (base) vinil@slurmgpu-hpc-1:~$ cat /etc/slurm/gres.conf Nodename=slurmgpu-hpc-1 Name=gpu Count=1 File=/dev/nvidia0 Nodename=slurmgpu-hpc-1 Name=mps Count=100 File=/dev/nvidia0 Testing the setup: Here’s the job script used to request a partial GPU allocation for compute. In this example, the job requests 25% of a GPU, allowing up to 4 jobs per GPU since each GPU is divided into 100 MPS shares. CUDA_MPS_ACTIVE_THREAD_PERCENTAGE variable controlling GPU usage. This variable set by Slurm using --gres=mps directive. To test this setup, I used the distributed_training.py . You need to setup an Anaconda environment to run this job. Since MPS doesn’t enforce strict memory limits per process, managing memory at the application level is essential for efficient memory use in multi-process setups. To ensure successful concurrent execution of 4 jobs on a single GPU, I added code to limit each process's memory usage to 15GB. # Limit GPU memory to 15 GB gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: try: for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) # Enable memory growth tf.config.experimental.set_virtual_device_configuration(gpu,[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=15360)]) except RuntimeError as e: print(e) # Memory growth must be set at program startup Slurm job script: #!/bin/bash #SBATCH --job-name=cuda_mps_job # Job name #SBATCH --output=cuda_mps_output.%j # Output file #SBATCH --error=cuda_mps_error.%j # Error file #SBATCH --ntasks=1 # Number of tasks (processes) #SBATCH --cpus-per-task=6 # Number of CPU cores per task (adjust as needed) #SBATCH --gres=mps:25 # Request MPS shares #SBATCH --time=01:00:00 # Time limit (adjust as needed) #SBATCH --partition=hpc # Specify the GPU partition (adjust as needed) # Define directories for MPS control pipes and logs export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-$SLURM_JOB_ID export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$SLURM_JOB_ID # Create the necessary directories mkdir -p $CUDA_MPS_PIPE_DIRECTORY mkdir -p $CUDA_MPS_LOG_DIRECTORY # Start the MPS control daemon if it isn't already running if ! pgrep -x "nvidia-cuda-mps-control" > /dev/null; then echo "Starting MPS control daemon..." nvidia-cuda-mps-control -d fi # Set the MPS thread utilization limit (optional) export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=25 # Load your Python environment source /shared/home/vinil/anaconda3/etc/profile.d/conda.sh conda activate training_env # Run your CUDA applications echo "Running CUDA applications..." python distributed_training.py # Clean up MPS directories (optional) echo "Stopping MPS control daemon..." echo quit | nvidia-cuda-mps-control # Clean up MPS directories rm -rf $CUDA_MPS_PIPE_DIRECTORY rm -rf $CUDA_MPS_LOG_DIRECTORY We can see 4 jobs are running in 1 GPU under the MPS control. you can see M+C Type in the nvidia-smi output. (base) vinil@slurmgpu-hpc-1:~$ nvidia-smi Tue Nov 5 05:51:22 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100 80GB PCIe Off | 00000001:00:00.0 Off | 0 | | N/A 41C P0 79W / 300W | 62748MiB / 81920MiB | 94% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 66542 M+C python 15648MiB | | 0 N/A N/A 66543 M+C python 15648MiB | | 0 N/A N/A 66544 M+C python 15648MiB | | 0 N/A N/A 66545 M+C python 15648MiB | | 0 N/A N/A 66560 C nvidia-cuda-mps-server 30MiB | | 0 N/A N/A 66563 C nvidia-cuda-mps-server 30MiB | | 0 N/A N/A 66564 C nvidia-cuda-mps-server 30MiB | | 0 N/A N/A 66567 C nvidia-cuda-mps-server 30MiB | +-----------------------------------------------------------------------------------------+ We can observe that four jobs are currently running, while the fifth job is in the queue. In this setup, there is only one node with a single GPU, which is why the fifth job remains queued. Each job utilizes 25% of the GPU's capacity. vinil@slurmgpu-scheduler:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3 hpc cuda_mps vinil R 0:02 1 slurmgpu-hpc-1 4 hpc cuda_mps vinil R 0:02 1 slurmgpu-hpc-1 2 hpc cuda_mps vinil R 0:05 1 slurmgpu-hpc-1 1 hpc cuda_mps vinil R 1:03 1 slurmgpu-hpc-1 5 hpc cuda_mps vinil PD 0:00 1 (Resources) Current Limitations: While MPS effectively handles multiple processes on a single GPU, it struggles with seamless multi-GPU support in Slurm environments. Testing on an 8-GPU machine revealed that only the first GPU was utilized, leaving the others idle, highlighting a need for further exploration to unlock full multi-GPU utilization under MPS. The issue arises from Slurm's interaction with CUDA and its management of GPU visibility through environment variables like CUDA_VISIBLE_DEVICES and control groups (cgroups). Slurm uses CUDA_VISIBLE_DEVICES to isolate GPUs for jobs, mapping the assigned GPU to 0 from the job’s perspective. While effective for single-GPU setups, this approach limits MPS, which requires visibility of all GPUs it might manage. Additionally, Slurm's use of cgroups to enforce GPU allocation confines MPS servers to the assigned GPUs, preventing access to others. Compounding this, MPS allocates a shared GPU context per user (UID) and doesn’t inherently distribute processes across multiple GPUs, meaning jobs assigned to a single GPU by Slurm remain limited to that GPU’s context. Manual Setup for Running CUDA-MPS Jobs in a Multi-GPU Environment To begin, we need to start a separate CUDA MPS daemon for each GPU. Below is a script I created to enable the CUDA MPS daemon across all GPUs: #!/bin/bash set -eux # Set the number of available GPUs GPUS=$(nvidia-smi -L | wc -l) # Loop through each GPU for i in $(seq 0 $((GPUS-1))); do nvidia-smi -i $i -c EXCLUSIVE_PROCESS mkdir -p /tmp/mps_$i /tmp/mps_log_$i export CUDA_VISIBLE_DEVICES=$i export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_$i nvidia-cuda-mps-control -d done To run a job on a specific GPU, setting the CUDA_VISIBLE_DEVICES variable is unnecessary. Instead, we leverage the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY environment variables to direct jobs to specific GPUs. The MPS control daemon, MPS server, and associated MPS clients communicate using named pipes and UNIX domain sockets, which are created by default in /tmp/nvidia-mps. The CUDA_MPS_PIPE_DIRECTORY variable lets you customize the location of these pipes and sockets. It is crucial that this variable is consistently set across all MPS clients sharing the same MPS server and control daemon. Below is the command I used to submit a job to GPU0. The CUDA_MPS_ACTIVE_THREAD_PERCENTAGE variable was set to 25%, allowing up to four concurrent jobs on the GPU: #!/bin/bash source /shared/home/vinil/anaconda3/etc/profile.d/conda.sh conda activate training_env export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=25 CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0 CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_0 python case0/distributed_training.py This approach enables job submission to each GPU by specifying unique CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY values. Below is the nvidia-smi output showcasing the results. -----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-40GB Off | 00000001:00:00.0 Off | 0 | | N/A 33C P0 64W / 400W | 8519MiB / 40960MiB | 22% E. Process | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-40GB Off | 00000002:00:00.0 Off | 0 | | N/A 32C P0 65W / 400W | 8519MiB / 40960MiB | 22% E. Process | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A100-SXM4-40GB Off | 00000003:00:00.0 Off | 0 | | N/A 32C P0 61W / 400W | 8519MiB / 40960MiB | 16% E. Process | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA A100-SXM4-40GB Off | 00000004:00:00.0 Off | 0 | | N/A 32C P0 64W / 400W | 8519MiB / 40960MiB | 22% E. Process | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA A100-SXM4-40GB Off | 0000000B:00:00.0 Off | 0 | | N/A 32C P0 63W / 400W | 8519MiB / 40960MiB | 22% E. Process | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA A100-SXM4-40GB Off | 0000000C:00:00.0 Off | 0 | | N/A 32C P0 63W / 400W | 8519MiB / 40960MiB | 22% E. Process | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA A100-SXM4-40GB Off | 0000000D:00:00.0 Off | 0 | | N/A 32C P0 63W / 400W | 8519MiB / 40960MiB | 24% E. Process | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA A100-SXM4-40GB Off | 0000000E:00:00.0 Off | 0 | | N/A 32C P0 61W / 400W | 8519MiB / 40960MiB | 14% E. Process | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 122702 C nvidia-cuda-mps-server 30MiB | | 0 N/A N/A 137145 M+C python 8480MiB | | 1 N/A N/A 123715 C nvidia-cuda-mps-server 30MiB | | 1 N/A N/A 137143 M+C python 8480MiB | | 2 N/A N/A 124455 C nvidia-cuda-mps-server 30MiB | | 2 N/A N/A 137141 M+C python 8480MiB | | 3 N/A N/A 130412 C nvidia-cuda-mps-server 30MiB | | 3 N/A N/A 137148 M+C python 8480MiB | | 4 N/A N/A 131362 C nvidia-cuda-mps-server 30MiB | | 4 N/A N/A 137142 M+C python 8480MiB | | 5 N/A N/A 132061 C nvidia-cuda-mps-server 30MiB | | 5 N/A N/A 137144 M+C python 8480MiB | | 6 N/A N/A 137147 M+C python 8480MiB | | 6 N/A N/A 137180 C nvidia-cuda-mps-server 30MiB | | 7 N/A N/A 137146 M+C python 8480MiB | | 7 N/A N/A 137186 C nvidia-cuda-mps-server 30MiB | Conclusion By integrating CUDA MPS into CycleCloud, GPU resource sharing reaches new levels of efficiency. Despite its limitations, MPS offers an excellent solution for environments prioritizing flexibility over strict isolation. This project is a step toward optimizing GPU usage for diverse workloads in cloud-based HPC clusters. Reference: Nvidia CUDA Multi-Process Service Azure CycleCloud Slurm MPS ManagementDeploying ZFS Scratch Storage for NVMe on Azure Kubernetes Service (AKS)
This guide demonstrates how to use ZFS LocalPV to efficiently manage the NVMe storage available on Azure NDv5 H100 VMs. Equipped with eight 3.5TB NVMe disks, these VMs are tailored for high-performance workloads like AI/ML and large-scale data processing. By combining the flexibility of AKS with the advanced storage capabilities of ZFS, you can dynamically provision stateful node-local volumes while aggregating NVMe disks for optimal performance.Breaking the Speed Limit with WEKA File System on top of Azure Hot Blob
WEKA delivers unbeatable performance for your most demanding applications running in Microsoft Azure, supporting high I/O, low latency, small files, and mixed workloads with zero tuning and automatic storage rebalancing. Examine how WEKA’s patented filesystem, WekaFS™, and its parallel processing algorithms accelerate Blob storage performance. The WEKA® Data Platform is purpose-built to deliver speed, simplicity, and scale that meets the needs of modern enterprises and research organizations without compromise. At the heart of the WEKA® Data Platform is a modern fully distributed parallel filesystem, WekaFS™, which can span across 1,000’s of NVMe SSD spread across multiple hosts and seamlessly extend itself over compatible object storage.