virtual machines
46 TopicsExperience Next-Gen HPC Innovation: AMD Lab Empowers ‘Try Before You Buy’ on Azure
In today’s fast-paced digital landscape, High-Performance Computing (HPC) is a critical engine powering innovation across industries—from automotive and aerospace to energy and manufacturing. To keep pace with escalating performance demands and the need for agile, risk-free testing environments, AMD has partnered with Microsoft and leading Independent Software Vendors (ISVs) to introduce the AMD HPC Innovation Lab. This pioneering sandbox environment on Azure is a “try before you buy” solution designed to empower customers to run their HPC workloads, assess performance, and experience AMD’s newest hardware innovations that deliver enhanced performance, scalability, and consistency—all without any financial commitments. Introducing the AMD Innovation Lab: A New Paradigm in Customer Engagement The AMD HPC Innovation Lab represents a paradigm shift in customer engagement for HPC solutions. Traditionally, organizations had to invest significant time and resources to build and manage on-premises testing environments, dealing with challenges such as hardware maintenance, scalability issues, and high operational costs. Without the opportunity to fully explore the benefits of cloud solutions through a trial offer, they often missed out on the advantages of cloud computing. With this innovative lab, customers now have the opportunity to experiment with optimized HPC environments in a simple, user-friendly interface. The process is straightforward: upload your input file or choose from the pre-configured options, run your workload, and then download your output file for analysis. This streamlined approach allows businesses to compare performance results on an apples-to-apples basis against other providers or existing on-premises setups. Empowering Decision Makers For Business Decision Makers (BDMs) and Technical Decision Makers (TDMs), the lab offers a compelling value proposition. It eliminates the complexities and uncertainties often associated with traditional testing environments by providing a risk-free opportunity to: Thoroughly Evaluate Performance: With access to AMD’s cutting-edge chipsets and Azure’s robust cloud infrastructure, organizations can conduct detailed proof-of-concept evaluations without incurring long-term costs. Accelerate Decision-Making: The streamlined testing process not only speeds up the evaluation phase but also accelerates the overall time to value, enabling organizations to make informed decisions quickly. Optimize Infrastructure: Created in partnership with ISVs and optimized by both AMD and Microsoft, the lab ensures that the infrastructure is fine-tuned for HPC workloads. This guarantees that performance assessments are both accurate and reflective of real-world scenarios. Seamless Integration with Leading ISVs A notable strength of the AMD HPC Innovation Lab is its collaborative design with top ISVs like Ansys, Altair, Siemens, and others. These partnerships ensure that the lab’s environment is equipped with industry-leading applications and solvers, such as Ansys Fluent for fluid dynamics and Ansys Mechanical for structural analysis. Each solver is pre-configured to provide a balanced and consistent performance evaluation, ensuring that users can benchmark their HPC workloads against industry standards with ease. Sustainability and Scalability Beyond performance and ease-of-use, the AMD HPC Innovation Lab is built with sustainability in mind. By leveraging Azure’s scalable cloud infrastructure, businesses can conduct HPC tests without the overhead and environmental impact of maintaining additional on-premises resources. This not only helps reduce operational costs but also supports corporate sustainability goals by minimizing the carbon footprint associated with traditional HPC setups. An Exciting Future for HPC Testing The innovation behind the AMD HPC Innovation Lab is just the beginning. With plans to continuously expand the lab catalog and include more ISVs, the platform is set to evolve as a comprehensive testing ecosystem. This ongoing expansion will provide customers with an increasingly diverse set of tools and environments tailored to meet a wide array of HPC needs. Whether you’re evaluating performance for fluid dynamics, structural simulations, or electromagnetic fields, the lab’s growing catalog promises to deliver precise and actionable insights. Ready to Experience the Future of HPC? The AMD HPC Innovation Lab on Azure offers a unique and exciting opportunity for organizations looking to harness the power of advanced computing without upfront financial risk. With its intuitive interface, optimized infrastructure, and robust ecosystem of ISVs, this sandbox environment is a game-changer in HPC testing and validation. Take advantage of this no-cost, high-impact solution to explore, experiment, and experience firsthand the benefits of AMD-powered HPC on Azure. To learn more and sign up for the program, visit https://aka.ms/AMDInnovationLab/LearnMoreRunning DeepSeek-R1 on a single NDv5 MI300X VM
Contributors: Davide Vanzo, Yuval Mazor, Jesse Lopez DeepSeek-R1 is an open-weights reasoning model built on DeepSeek-V3, designed for conversational AI, coding, and complex problem-solving. It has gained significant attention beyond the AI/ML community due to its strong reasoning capabilities, often competing with OpenAI’s models. One of its key advantages is that it can be run locally, giving users full control over their data. The NDv5 MI300X VM features 8x AMD Instinct MI300X GPUs, each equipped with 192GB of HBM3 and interconnected via Infinity Fabric 3.0. With up to 5.2 TB/s of memory bandwidth per GPU, the MI300X provides the necessary capacity and speed to process large models efficiently - enabling users to run DeepSeek-R1 at full precision on a single VM. In this blog post, we’ll walk you through the steps to provision an NDv5 MI300X instance on Azure and run DeepSeek-R1 for inference using the SGLang inference framework. Launching an NDv5 MI300X VM Prerequisites Check that your subscription has sufficient vCPU quota for the VM family “StandardNDI Sv 5MI300X” (see Quota documentation). If needed, contact your Microsoft account representative to request quota increase. A Bash terminal with Azure CLI installed and logged into the appropriate tenant. Alternatively, Azure Cloud Shell can also be employed. Provision the VM 1. Using Azure CLI, create an Ubuntu-22.04 VM on ND_MI300x_v5: az group create --location <REGION> -n <RESOURCE_GROUP_NAME> az vm create --name mi300x --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --ssh-key-values <PUBLIC_SSH_PATH> Optionally, the deployment can utilize the cloud-init.yaml file specified as --custom-data <CLOUD_INIT_FILE_PATH> to automate the additional preparation described below: az vm create --name mi300x --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --ssh-key-values <PUBLIC_SSH_PATH> --custom-data <CLOUD_INIT_FILE_PATH> Note: The GPU drivers may take a couple of mintues to completely load after the VM has been initially created. Additional preparation Beyond provisioning the VM, there are additional steps to prepare the environment to optimally run DeepSeed, or other AI workloads including setting-up the 8 NVMe disks on the node in a RAID-0 configuration to act as the cache location for Docker and Hugging Face. The following steps assume you have connected to the VM and working in a Bash shell. 1. Prepare the NVMe disks in a RAID-0 configuration mkdir -p /mnt/resource_nvme/ sudo mdadm --create /dev/md128 -f --run --level 0 --raid-devices 8 $(ls /dev/nvme*n1) sudo mkfs.xfs -f /dev/md128 sudo mount /dev/md128 /mnt/resource_nvme sudo chmod 1777 /mnt/resource_nvme 2. Configure Hugging Face to use the RAID-0. This environmental variable should also be propagated to any containers pulling images or data from Hugging Face. mkdir –p /mnt/resource_nvme/hf_cache export HF_HOME=/mnt/resource_nvme/hf_cache 3. Configure Docker to use the RAID-0 mkdir -p /mnt/resource_nvme/docker sudo tee /etc/docker/daemon.json > /dev/null <<EOF { "data-root": "/mnt/resource_nvme/docker" } EOF sudo chmod 0644 /etc/docker/daemon.json sudo systemctl restart docker All of these additional preperation steps can be automated in VM creation using cloud-init. The example cloud-init.yaml file can be used in provisioning the VM as described above. #cloud-config package_update: true write_files: - path: /opt/setup_nvme.sh permissions: '0755' owner: root:root content: | #!/bin/bash NVME_DISKS_NAME=`ls /dev/nvme*n1` NVME_DISKS=`ls -latr /dev/nvme*n1 | wc -l` echo "Number of NVMe Disks: $NVME_DISKS" if [ "$NVME_DISKS" == "0" ] then exit 0 else mkdir -p /mnt/resource_nvme # Needed incase something did not unmount as expected. This will delete any data that may be left behind mdadm --stop /dev/md* mdadm --create /dev/md128 -f --run --level 0 --raid-devices $NVME_DISKS $NVME_DISKS_NAME mkfs.xfs -f /dev/md128 mount /dev/md128 /mnt/resource_nvme fi chmod 1777 /mnt/resource_nvme - path: /etc/profile.d/hf_home.sh permissions: '0755' content: | export HF_HOME=/mnt/resource_nvme/hf_cache - path: /etc/docker/daemon.json permissions: '0644' content: | { "data-root": "/mnt/resource_nvme/docker" } runcmd: - ["/bin/bash", "/opt/setup_nvme.sh"] - mkdir -p /mnt/resource_nvme/docker - mkdir -p /mnt/resource_nvme/hf_cache # PAM group not working for docker group, so this will add all users to docker group - bash -c 'for USER in $(ls /home); do usermod -aG docker $USER; done' - systemctl restart docker Using MI300X If you are familiar with Nvidia and CUDA tools and environment, AMD provides equivalents as part of the ROCm stack. MI300X + ROCm Nvidia + CUDA Description rocm-smi nvidia-smi CLI for monitoring the system and making changes rccl nccl Library for communication between GPUs Running DeepSeek-R1 1. Pull the container image. It is O(10) GB in size, so it may take a few minutes to download. docker pull rocm/sglang-staging:20250303 2. Start the SGLang server. The model (~642 GB) is downloaded the first time it is launched and will take at least a few minutes to download. Once the application outputs “The server is fired up and ready to roll!”, you can begin making queries to the model. docker run \ --device=/dev/kfd \ --device=/dev/dri \ --security-opt seccomp=unconfined \ --cap-add=SYS_PTRACE \ --group-add video \ --privileged \ --shm-size 32g \ --ipc=host \ -p 30000:30000 \ -v /mnt/resource_nvme:/mnt/resource_nvme \ -e HF_HOME=/mnt/resource_nvme/hf_cache \ -e HSA_NO_SCRATCH_RECLAIM=1 \ -e GPU_FORCE_BLIT_COPY_SIZE=64 \ -e DEBUG_HIP_BLOCK_SYN=1024 \ rocm/sglang-staging:20250303 \ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --host 0.0.0.0 3. You can now make queries to DeepSeek-R1. For example, these requests to the model from another shell on same host provide model data and will generate a sample response. curl http://localhost:30000/get_model_info {"model_path":"deepseek-ai/DeepSeek-R1","tokenizer_path":"deepseek-ai/DeepSeek-R1","is_generation":true} curl http://localhost:30000/generate -H "Content-Type: application/json" -d '{ "text": "Once upon a time,", "sampling_params": { "max_new_tokens": 16, "temperature": 0.6 } }' Conclusion In this post, we detail how to run the full-size 671B DeepSeek-R1 model on a single Azure NDv5 MI300X instance. This includes setting up the machine, installing the necessary drivers, and executing the model. Happy inferencing! References https://github.com/deepseek-ai/DeepSeek-R1 https://github.com/deepseek-ai/DeepSeek-V3 https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/azure-announces-new-ai-optimized-vm-series-featuring-amd%e2%80%99s-flagship-mi300x-gpu/3980770 https://docs.sglang.ai/index.htmlOptimizing AI Workloads on Azure: CPU Pinning via NCCL Topology file
(Co-authored by: Rafael Salas, Sreevatsa Anantharamu, Jithin Jose, Elizabeth Fatade) Introduction NCCL NVIDIA Collective Communications Library (NCCL) is one of the most widely used communication library for AI training/inference. It features GPU-focused collective and point-to-point communication designs that are vital for AI workloads. NUMA-GPU-HCA affinity Non-uniform memory access (NUMA) architecture splits up the CPU cores into groups attached to their own local memory. GPUs and HCAs are connected to a specific NUMA node via PCIe interconnect. NCCL launches CPU processes to coordinate GPU-related inter-node communication, hence the process-to-NUMA binding matters a lot for ensuring optimal performance. NCCL uses the Low Latency (LL) protocol for small-to-medium message communication. In this protocol, the GPU data is copied to a pinned CPU buffer before sending over the network. Copying data from a GPU to a NUMA node that it is not directly connected to leads to additional inter-NUMA communication that adds overhead. Furthermore, communicating this CPU buffer through a NIC connected to the neighboring NUMA node also requires additional inter-NUMA communication. Following diagram shows the system topology for Nvidia DGX H100 system as a reference. H100 system topology (figure from https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html) To determine the CPU cores that belong to a NUMA node, use: lscpu To determine the GPU-to-NUMA affinity, use: cat /sys/bus/pci/devices/<busIdOfGPU>/numa_node or for NVIDIA GPUs, you can alternatively use: nvidia-smi topo -m To determine the HCA-to-NUMA mapping, use cat /sys/bus/pci/devices/<busIdOfHCA>/numa_node or use: lstopo NCCL Topology file Application developers can pass in the system topology by specifying the NCCL topology file while launching the job. NCCL topology file passes in the following information to NCCL library: GPU-to-NUMA mapping GPU-to-HCA, HCA-to-NUMA mapping NUMA-to-CPU-core mapping Speed and type of GPU-GPU interconnect. This enables NCCL to choose the most efficient system topology for communication. Azure ND-series VMs, like the NDv4 and NDv5 series, feature multiple GPUs. These GPUs connect to the CPUs via PCIe links and to each other using NVLink. If you are using a VM image from the AI and HPC Market place image, this topology file will be located in the directory: /opt/microsoft. The topology files are also located in the azhpc-images GitHub repository under the topology directory. Performance experiments This section presents results that demonstrate the benefit from using correct CPU pinning via NCCL topology file. We use the NCCL Tests benchmarks and run experiments for the different NCCL calls comparing the performance of the default mpirun binding to the correct binding specified via NCCL topology file. Setup The setup used for the experiments is: Two NDv5 nodes: 8 H100 GPUs per node, 8 400G NICs per node Azure VM image: microsoft-dsvm:ubuntu-hpc:2204:latest NCCL version: 2.25.1+cuda12.6 MPI library: HPC-X v2.18 Impact of System Topology In this section, we show the performance impact of system topology awareness in NCCL. We show the NCCL benchmark performance results comparing default and topology-aware configurations: For the default case, default mpirun binding, we use the following command line for launching NCCL benchmarks: mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1 For the system topology-aware case, we use the following command line for launching NCCL benchmarks: mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ -x NCCL_TOPO_FILE=<topo-file.xml> \ -x NCCL_IGNORE_CPU_AFFINITY=1 \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1 Here, hostfile is a file containing the IP addresses of the two nodes. NCCL_BENCHMARK is the NCCL benchmark name. We run experiments with all six benchmarks: all_reduce_perf, all_gather_perf, sendrecv_perf, reduce_scatter_perf, broadcast_perf, and alltoall_perf. <topo-file.xml> is the Azure SKU topology file that can be obtained as described in the "NCCL Topology File" section. NCCL topology files for different Azure HPC/AI SKUs are available here: Topology Files. Topology file-based binding is set by prescribing the NCCL_TOPO_FILE variable to the path of the NCCL topology file and setting NCCL_IGNORE_CPU_AFFINITY to 1. Setting NCCL_IGNORE_CPU_AFFINITY to 1 is crucial to assign the process affinity in NCCL solely based on the NCCL topology file. If this variable is not set, NCCL honors the affinity set by the MPI library as well as the NCCL topology file by setting the affinity to the intersection of these two sets. If the intersection is null, then NCCL simply retains the affinity set by the MPI library. Additional NCCL Optimizations We also list some additional NCCL tuning configurations for better performance. Note that these drive the benchmark performance to the peak, but it is important to fine tune these parameters for the end training application as some of these have impact of SM utilization. Config Value NCCL_MIN_CHANNELS 32 NCCL_P2P_NET_CHUNKSIZE 512K NCCL_IB_QPS_PER_CONNECTION 4 NCCL_PXN_DISABLE 1 Increasing NCCL_MIN_CHANNELS to 32 increases the throughput of certain collectives (especially ncclReduceScatter). Increasing NCCL_P2P_NET_CHUNKSIZE to 512K (from the default value of 128K) gives better throughput for ncclSend and ncclRecv calls when channel buffers are used instead of user-registered buffer for communication. Increasing NCCL_IB_QPS_PER_CONNECTION to 4 from 1 also slightly increases the throughput of collectives. Enforcing NCCL_PXN_DISABLE to 1 is essential to enable the zero-copy design for ncclSend and ncclRecv calls. Inter-node zero-copy designs are present in NCCL 2.24 onwards but is activated only if NCCL_PXN_DISABLE is set to 1 and the user buffers are registered with the NIC via NCCL calls ("-R 1" flag registers user buffers in NCCL tests). We found the bandwidth of the zero-copy point-to-point design to be around 10 GB/s better than their copy-based design. Results Small message sizes In the above figure, we compare the small message latency for the different collectives using the NCCL topology file-based binding to the default mpirun binding. For all the NCCL benchmarks, we see a consistent higher performance with using the NCCL topology file-based binding. The reason for this improvement is that, for small message sizes, NCCL uses the Low Latency (LL) protocol, and this protocol uses pinned CPU buffers for inter-node communication. In the LL protocol, the GPU copies the data to be sent over the network to the CPU buffer. The CPU thread polls this buffer to check if all the flags are set, and once they are set, the thread sends the data over the network. With the default mpirun pinning, all eight processes per node and their allocated memory are on NUMA node 0. However, GPUs 0-3 have affinity to NUMA node 0 and GPUs 4-7 have affinity to NUMA node 1. Therefore, the data copied from GPUs 4-7 need to traverse through the NUMA interconnect to reach the pinned CPU buffer. Furthermore, while communicating the CPU buffer via the NIC (closest to the GPU), the inter-NUMA interconnect needs to be traversed again. The figures above clearly show this additional overhead from inter-NUMA interconnect latency. With NCCL topology file-based binding, this overhead does not exist. This is because the topology file contains information of the GPUs attached to a NUMA node and the CPU mask for all the cores in a NUMA node. Using this information, NCCL correctly binds the processes to the NUMA nodes. Medium message sizes The medium message size bandwidth is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. Similar to small message sizes, NCCL uses the LL protocol for most of the above medium message runs. Nearly, all NCCL benchmarks show an improvement in bandwidth with using the NCCL topology file-based binding. The reason for this improvement is same as that for the small message sizes. Large message sizes The bandwidth for the large message sizes is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. As the message sizes gets larger and larger, there is not much improvement in bandwidth with using NCCL topology file-based binding, for all the NCCL benchmarks. This is because the dominant NCCL protocol for this message size range is SIMPLE. In this protocol, on systems with GPUDirectRDMA (which is the case for Azure ND-series SKUs), the GPU buffer is directly communicated over the network for inter-node communication. There is no intermediate copy of the message to a CPU buffer. In this scenario, the GPU-CPU communication is used to only updates a few flags for auxiliary tasks such indicating buffer readiness. The increase in the latency of these auxiliary tasks is insignificant compared to the total message transfer time. Thus, the impact of CPU pinning on the achieved bandwidth reduces for large message sizes. Conclusion This blog post describes the impact of system topology awareness and NCCL tuning for AI workload optimization and lists the relevant NCCL configuration options. The impact of topology awareness is significant for small and medium messages where NCCL uses the LL protocol. We also show the performance results comparing the different configurations and highlight the importance of performance tuning. To summarize, the recommended configuration for NCCL (version greater than v2.24) on Azure HPC/AI VMs (NDv4/NDv5) are: Config Value NCCL_MIN_CHANNELS 32 NCCL_P2P_NET_CHUNKSIZE 512K NCCL_IB_QPS_PER_CONNECTION 4 NCCL_PXN_DISABLE 1 NCCL_TOPO_FILE <corresponding-topology-file.xml> NCCL_IGNORE_CPU_AFFINITY 1 For Azure HPC/AI VMs, this NCCL topology file is available in /etc/nccl.conf inside the VM image. For container runs inside the VM, it is recommended to mount the NCCL topology file and /etc/nccl.conf file from the VM to the container.Benchmark EDA workloads on Azure Intel Emerald Rapids (EMR) VMs
Co-author: Nalla Ram and Wiener Evgeny, Intel Electronic Design Automation (EDA) consists of a collection of software tools and workflows used for designing semiconductor products, most notably advanced computer chips. With today's rapid pace of innovation, there is an increasing demand for higher performance, smaller chip sizes, and lower power consumption. EDA tools require multiple nodes and numerous CPUs (cores) in a cluster to meet this demand. A high-performance network and a centralized file system support this multi-node cluster, ensuring that all components act as a unified whole to provide both consistency and scale-out performance. The cloud offers EDA engineers access to substantial computing capacities, enabling faster design and characterization while removing the need for in-house CPUs that may remain idle during off-peak times. Additionally, it provides enhanced quality with higher simulation coverage and facilitates global collaboration among designers. Objective In this article, we will evaluate the performance of the latest Azure VMs, which utilize the 5th Gen Intel® Xeon® Platinum 8537C (Emerald Rapids “EMR”) processor, comparing them to the previous Ice Lake generation. We compared the new D64dsv6 and FX64v2 VMs, both running on the same EMR CPU model, against the previous generation D64dsv5 VM. The new Dsv6 and Dlsv6 series VMs provide two different memory-to-vCPU ratios. Dsv6 supports up to 128 vCPUs and 512 GiB of RAM. The Esv6 and Edsv6 series VMs are also available, offering up to 192 vCPUs and over 1800 GiB of RAM. FXv2 series VMs feature an all-core turbo frequency of up to 4.0 GHz, supporting a 21:1 memory-to-vCPU ratio with base sizes, and an even higher ratio with constrained core sizes. Please refer to New Azure Dlsv6, Dsv6, Esv6 VMs with new CPU, Azure Boost, and NVMe Support and Azure FXv2-series Virtual Machines based on the Emerald Rapids processor for more details. We will use two leading EDA tools, Cadence Spectre-X and Synopsys VCS, to benchmark design simulations. We will explore different scenarios in real-world situations, including single-threaded, multi-threaded, and multiple jobs running on one node. Additionally, we will conduct a cost-effective analysis to provide silicon companies with guidance when considering migrating EDA workloads to Azure. Testing environment Figure 1: Testing environment Compute VMs, the license server VM, and storage all reside in the same Proximity Placement Group to reduce the network traffic latency. We used Azure NetApp Files (ANF) as our NFS storage solution with a Premium 4TiB volume, which provides up to 256 MB/s throughput. Cadence Spectre-X Use Case Business Value Cadence Spectre X is a top-tier EDA tool designed for large-scale verification simulations of complex analog, RF, and mixed-signal blocks. With multi-threaded simulation capabilities, users can run single analyses, such as TRAN or HB, on a multi-core machine. Spectre-X excels at distributing simulation workloads across Azure cloud, leveraging thousands of CPU cores to enhance performance and capacity. Test Case The test design is a Post Layout DSPF design with over 100,000 circuit entries. All tools, design files, and output files are stored on the shared Azure NetApp Files (ANF) volume. Please refer to Benefits of using Azure NetApp Files for EDA).Simulations are run by altering the number of threads per job using the +mt option, and the total time for each run is recorded from the output log files. Here is an example command for running a single-threaded (+mt=1) Spectre-X job: spectre -64 +preset=cx +mt=1 input.scs -o SPECTREX_cx_1t +lqt 0 -f sst2 During the test, we observed that the number of utilized CPUs matched the number of threads per job. We also noted low storage read/write operations (fewer than 2,000 IOPS), low network bandwidth usage, and minimal memory use during the test. This indicates that Spectre-X is a highly compute-intensive and CPU-bound workload. Benchmark Results The table below displays the total elapsed time (in seconds) for various Spectre-X simulation runs with thread counts of 1, 2, 4, and 8. A lower value indicates better performance. These results demonstrate the Spectre-X tool's efficiency in distributing workloads on either Ice Lake instances (D64dsv5) and Emerald Rapids instances (D64dsv6 and FX64v2). As expected, we observed that the D64dsv6 instances, with an all-core turbo frequency of up to 3.6 GHz, perform 12 to 18% better, and the FX64v2 instances, with a CPU frequency of up to 4.0 GHz, perform 22 to 29% better than the D64dsv5 instances. Figure 2. Total elapsed time (in seconds) for Spectre-X simulation with thread counts of 1, 2, 4, and 8. Figure 3. Performance improvement of multi-threading Spectre-X jobs Cost-effectiveness Analysis By estimating the total time and VM costs for running 500 single-threaded jobs, we found that the D64dsv6 instances were the most cost-effective option, while the FX64v2 instances achieved the fastest total time. This provides customers with options depending on whether they prioritize cost savings or faster job completion times when choosing Azure EMR VMs. Figure 4. Cost-effective estimation for running 500 single-threaded Spectre-X jobs Synopsys VCS Use Case Business value VCS is a Synopsys Functional verification solution. A significant portion of compute capacity (up to 40%) in chip design is consumed by front-end validation tools, specifically RTL simulators like Synopsys VCS. The chip logic design cycle consists of recurrent design and validation cycles. The validation cycle involves running a set of simulator tests (regression) on the latest design snapshot. The ability to scale and accelerate the VCS test regression and keep validation up to date with design changes is crucial for a project to meet its time-to-market and quality goals, as shown schematically in Figure 5. Figure 5: Scaling Front End regression accelerates design validation cycle and improves quality, in the above example on the left panel Model B and C would land after Design A validation is completed Test case As a test case we used a containerized representative VCS test of an Intel design. Complex RTL design (>10M gates) SVTB (System-Verilog Test Bench) simulation test running 100K cycles Resident memory footprint per simulation instance is 7 GB. Benchmark Results We ran our test case VCS simulation separately on the D64dsv5 (Xeon 3 - Ice Lake) system and the FX64v2 (Xeon 5 - Emerald Rapids) system, scaling from 1 to 32 parallel VCS tests on each. Since VCS simulation is a CPU-intensive application, we anticipated performance acceleration on the FX64v2. The new Emerald Rapids CPU architecture offers higher instructions per cycle (IPC) at the same frequency compared to previous generations. It operates at a higher all-core Turbo frequency (4.0 GHz vs. 3.5 GHz for Ice Lake), features larger L2 and L3 caches, faster UPI NUMA links, supports DDR5 memory instead of DDR4 in Ice Lake, and includes PCIe 5.0 compared to PCIe 4.0 in Ice Lake. The results presented here are for specific Intel IP design used Results may vary based on individual configuration and design used for testing As expected, we observed speedup for the Emerald Rapids instance compared to Ice Lake instance from 17 to 43% for the range of simultaneous simulations shown in the chart below. (See Figure 6. and Figure 7.) Figure 6: Scaling the parallel VCS simulation tests on Emerald Rapids vs. Ice Lake Azure Instances. Vertical axis shows simulation test avg. runtime in sec. Figure 7: Speedup percentage for VCS on Emerald Rapids instance compared to Ice Lake Azure instances Summary The article evaluates the performance of the latest Azure VMs using the 5th Gen Intel® Xeon® Platinum 8537C (Emerald Rapids) processor by comparing them to the previous Ice Lake generation. Using two EDA tools, Cadence Spectre-X and Synopsys VCS, the benchmarks involve real-world scenarios including single-threaded, multi-threaded, and multiple jobs running on one node. Results show that Spectre-X performs 12 to 18% better on D64ds v6 instances and 22 to 29% better on FX64v2 instances compared to D64ds v5 instances. The D64ds v6 instances were found to be more cost-effective, while FX64mds v2 instances achieved the shortest total runtime. For Synopsys VCS, the benchmarks revealed a speedup of 17 to 43% for Emerald Rapids instances over Ice Lake instances across various parallel simulations. The findings offer EDA customers options on which Azure EMR instances to select based on the cost-efficiency analysis.Announcing Azure HBv5 Virtual Machines: A Breakthrough in Memory Bandwidth for HPC
Discover the new Azure HBv5 Virtual Machines, unveiled at Microsoft Ignite, designed for high-performance computing applications. With up to 7 TB/s of memory bandwidth and custom 4th Generation EPYC processors, these VMs are optimized for the most memory-intensive HPC workloads. Sign up for the preview starting in the first half of 2025 and see them in action at Supercomputing 2024 in AtlantaOptimizing Language Model Inference on Azure
Inefficient inference optimization can lead to skyrocketing costs for customers, making it crucial to establish clear performance benchmarking numbers. This blog sets the standard for expected performance, helping customers make informed decisions that maximize efficiency and minimize expenses with the new Azure ND H200 v5-series.Azure NV V710 v5: Empowering Real-Time AI/ML Inferencing and Advanced Visualization in the Cloud
Today, we’re excited to introduce the Azure NV V710 v5, the latest VM tailored for small-to-medium AI/ML inferencing workloads, Virtual Desktop Infrastructure (VDI), visualization, and cloud gaming workloads.Introducing the Azure NMads MA35D: Exclusive Media Processing Powerhouse in the Cloud
Azure is proud to announce the exclusive launch of the NMads MA35D Virtual Machine (VM) – a groundbreaking addition to our portfolio specifically designed to redefine media processing and video transcoding in the cloud.Benchmarking 6th gen. Intel-based Dv6 (preview) VM SKUs for HPC Workloads in Financial Services
Introduction In the fast-paced world of Financial Services, High-Performance Computing (HPC) systems in the cloud have become indispensable. From instrument pricing and risk evaluations to portfolio optimizations and regulatory workloads like CVA and FRTB, the flexibility and scalability of cloud deployments are transforming the industry. Unlike traditional HPC systems that require complex parallelization frameworks (e.g. depending on MPI and InfiniBand networking), many financial calculations can be efficiently executed on general-purpose SKUs in Azure. Depending on the codes used to perform the calculations, many implementations leverage vendor-specific optimizations such as AVX-512 from Intel. With the recent announcement of the public preview of the 6th generation of Intel-based Dv6 VMs (see here), this article will explore the performance evolution across three generations of D32ds – from D32dsv4 to D32dsv6. We will follow the testing methodology similar to the article from January 2023 – “Benchmarking on Azure HPC SKUs for Financial Services Workloads” (link here). Overview of D-Series VM in focus: In the official announcement it was mentioned, that the upcoming Dv6 series (currently in preview) offers significant improvements over the previous Dv5 generation. Key highlights include: Up to 27% higher vCPU performance and a threefold increase in L3 cache compared to the previous generation Intel Dl/D/Ev5 VMs. Support for up to 192 vCPUs and more than 18 GiB of memory. Azure Boost, which provides: Up to 400,000 IOPS and 12 GB/s remote storage throughput. Up to 200 Gbps VM network bandwidth. A 46% increase in local SSD capacity and more than three times the read IOPS. NVMe interface for both local and remote disks. Note: Enhanced security through Total Memory Encryption (TME) technology is not activated in the preview deployment and will be benchmarked once available. Technical Specifications for 3 generations of D32ds SKUs VM Name D32ds_v4 D32ds_v5 D32ds_v6 Number of vCPUs 32 32 32 InfiniBand N/A N/A N/A Processor Intel® Xeon® Platinum 8370C (Ice Lake) or Intel® Xeon® Platinum 8272CL (Cascade Lake) Intel® Xeon® Platinum 8370C (Ice Lake) Intel® Xeon® Platinum 8573C (Emerald Rapids) processor Peak CPU Frequency 3.4 GHz 3.5 GHz 3.0 GHz RAM per VM 128 GB 128 GB 128 GB RAM per core 4 GB 4 GB 4 GB Attached Disk 1200 SSD 1200 SSD 440 SSD Benchmarking Setup For our benchmarking setup, we utilised the user-friendly, open-source test suite from Phoronix (link) to run 2 tests from OpenBenchmarking.org test suite, specifically targeting quantitative finance workloads. The tests in the "finance suite" are divided into two groups, each running independent benchmarks. In addition to the finance test suite, we also ran the AI-Benchmark to evaluate the evolution of AI inferencing capabilities across three VM generations. Finance Bench QuantLib AI Benchmark Bonds OpenMP Size XXS Device Inference Score Repo OpenMP Size X Device AI Score Monte-Carlo OpenMP Device Training Score Software dependencies Component Version OS Image Ubuntu marketplace image: 24_04-lts Phoronix Test Suite 10.8.5 Quantlib Benchmark 1.35-dev Finance Bench Benchmark 2016-07-25 AI Benchmark Alpha 0.1.2 Python 3.12.3 To run the benchmark on a freshly created D-Series VM, execute the following commands (after updating the installed packages to the latest version): git clone https://github.com/phoronix-test-suite/phoronix-test-suite.git sudo apt-get install php-cli php-xml cmake sudo ./install-sh phoronix-test-suite benchmark finance For the AI Benchmark tests, a few additional steps are required. For example, creating a virtual environment for additional python packages and the installation of the tensorflow and ai-benchmark packages are required: sudo apt install python3 python3-pip python3-virtualenv mkdir ai-benchmark && cd ai-benchmark virtualenv virtualenv source virtualenv/bin/activate pip install tensorflow pip install ai-benchmark phoronix-test-suite benchmark ai-benchmark Benchmarking Runtimes and Results The purpose of this article is to share the results of a set of benchmarks that closely align with the use cases mentioned in the introduction. Most of these use cases are predominantly CPU-bound, which is why we have limited the benchmark to D-Series VMs. For memory-bound codes that would benefit from a higher memory-to-core ratio, the new Ev6 SKU could be a suitable option. In the picture below, you can see a representative benchmarking run on a Dv6 VM, where nearly 100% of the CPUs were utilised during execution. The individual runs of the Phoronix test suite, starting with Finance Bench and followed by QuantLib, are clearly visible. Runtimes Benchmark VM Size Start Time End Time Duration Minutes Finance Benchmark Standard D32ds v4 12:08 15:29 03:21 201.00 Finance Benchmark Standard D32ds v5 11:38 14:12 02:34 154.00 Finance Benchmark Standard D32ds v6 11:39 13:27 01:48 108.00 Finance Bench Results QuantLib Results AI Benchmark Alpha Results Discussion of the results The results show significant performance improvements in QuantLib across the D32v4, D32v5, and D32v6 versions. Specifically, the tasks per second for Size S increased by 47.18% from D32v5 to D32v6, while Size XXS saw an increase of 45.55%. Benchmark times for 'Repo OpenMP' and 'Bonds OpenMP' also decreased, indicating better performance. 'Repo OpenMP' times were reduced by 18.72% from D32v4 to D32v5 and by 20.46% from D32v5 to D32v6. Similarly, 'Bonds OpenMP' times decreased by 11.98% from D32v4 to D32v5 and by 18.61% from D32v5 to D32v6. In terms of Monte-Carlo OpenMP performance, the D32v6 showed the best results with a time of 51,927.04 ms, followed by the D32v5 at 56,443.91 ms, and the D32v4 at 57,093.94 ms. The improvements were -1.14% from D32v4 to D32v5 and -8.00% from D32v5 to D32v6. AI Benchmark Alpha scores for device inference and training also improved significantly. Inference scores increased by 15.22% from D32v4 to D32v5 and by 42.41% from D32v5 to D32v6. Training scores saw an increase of 21.82% from D32v4 to D32v5 and 43.49% from D32v5 to D32v6. Finally, Device AI scores improved across the versions, with D32v4 scoring 6726, D32v5 scoring 7996, and D32v6 scoring 11436. The percentage increases were 18.88% from D32v4 to D32v5 and 43.02% from D32v5 to D32v6. Next Steps & Final Comments The public preview of the new Intel SKUs have already shown very promising benchmarking results, indicating a significant performance improvement compared to the previous D-series generations, which are still widely used in FSI scenarios. It's important to note that your custom code or purchased libraries might exhibit different characteristics than the benchmarks selected. Therefore, we recommend validating the performance indicators with your own setup. In this benchmarking setup, we have not disabled Hyper-Threading on the CPUs, so the available cores are exposed as virtual cores. If this scenario is of interest to you, please reach out to the authors for more information. Additionally, Azure offers a wide range of VM families to suit various needs, including F, FX, Fa, D, Da, E, Ea, and specialized HPC SKUs like HC and HB VMs. A dedicated validation, based on your individual code / workload, is recommended here as well, to ensure the best suited SKU is selected for the task at hand.