virtual machines
114 TopicsExperience Next-Gen HPC Innovation: AMD Lab Empowers ‘Try Before You Buy’ on Azure
In today’s fast-paced digital landscape, High-Performance Computing (HPC) is a critical engine powering innovation across industries—from automotive and aerospace to energy and manufacturing. To keep pace with escalating performance demands and the need for agile, risk-free testing environments, AMD has partnered with Microsoft and leading Independent Software Vendors (ISVs) to introduce the AMD HPC Innovation Lab. This pioneering sandbox environment on Azure is a “try before you buy” solution designed to empower customers to run their HPC workloads, assess performance, and experience AMD’s newest hardware innovations that deliver enhanced performance, scalability, and consistency—all without any financial commitments. Introducing the AMD Innovation Lab: A New Paradigm in Customer Engagement The AMD HPC Innovation Lab represents a paradigm shift in customer engagement for HPC solutions. Traditionally, organizations had to invest significant time and resources to build and manage on-premises testing environments, dealing with challenges such as hardware maintenance, scalability issues, and high operational costs. Without the opportunity to fully explore the benefits of cloud solutions through a trial offer, they often missed out on the advantages of cloud computing. With this innovative lab, customers now have the opportunity to experiment with optimized HPC environments in a simple, user-friendly interface. The process is straightforward: upload your input file or choose from the pre-configured options, run your workload, and then download your output file for analysis. This streamlined approach allows businesses to compare performance results on an apples-to-apples basis against other providers or existing on-premises setups. Empowering Decision Makers For Business Decision Makers (BDMs) and Technical Decision Makers (TDMs), the lab offers a compelling value proposition. It eliminates the complexities and uncertainties often associated with traditional testing environments by providing a risk-free opportunity to: Thoroughly Evaluate Performance: With access to AMD’s cutting-edge chipsets and Azure’s robust cloud infrastructure, organizations can conduct detailed proof-of-concept evaluations without incurring long-term costs. Accelerate Decision-Making: The streamlined testing process not only speeds up the evaluation phase but also accelerates the overall time to value, enabling organizations to make informed decisions quickly. Optimize Infrastructure: Created in partnership with ISVs and optimized by both AMD and Microsoft, the lab ensures that the infrastructure is fine-tuned for HPC workloads. This guarantees that performance assessments are both accurate and reflective of real-world scenarios. Seamless Integration with Leading ISVs A notable strength of the AMD HPC Innovation Lab is its collaborative design with top ISVs like Ansys, Altair, Siemens, and others. These partnerships ensure that the lab’s environment is equipped with industry-leading applications and solvers, such as Ansys Fluent for fluid dynamics and Ansys Mechanical for structural analysis. Each solver is pre-configured to provide a balanced and consistent performance evaluation, ensuring that users can benchmark their HPC workloads against industry standards with ease. Sustainability and Scalability Beyond performance and ease-of-use, the AMD HPC Innovation Lab is built with sustainability in mind. By leveraging Azure’s scalable cloud infrastructure, businesses can conduct HPC tests without the overhead and environmental impact of maintaining additional on-premises resources. This not only helps reduce operational costs but also supports corporate sustainability goals by minimizing the carbon footprint associated with traditional HPC setups. An Exciting Future for HPC Testing The innovation behind the AMD HPC Innovation Lab is just the beginning. With plans to continuously expand the lab catalog and include more ISVs, the platform is set to evolve as a comprehensive testing ecosystem. This ongoing expansion will provide customers with an increasingly diverse set of tools and environments tailored to meet a wide array of HPC needs. Whether you’re evaluating performance for fluid dynamics, structural simulations, or electromagnetic fields, the lab’s growing catalog promises to deliver precise and actionable insights. Ready to Experience the Future of HPC? The AMD HPC Innovation Lab on Azure offers a unique and exciting opportunity for organizations looking to harness the power of advanced computing without upfront financial risk. With its intuitive interface, optimized infrastructure, and robust ecosystem of ISVs, this sandbox environment is a game-changer in HPC testing and validation. Take advantage of this no-cost, high-impact solution to explore, experiment, and experience firsthand the benefits of AMD-powered HPC on Azure. To learn more and sign up for the program, visit https://aka.ms/AMDInnovationLab/LearnMoreRunning DeepSeek-R1 on a single NDv5 MI300X VM
Contributors: Davide Vanzo, Yuval Mazor, Jesse Lopez DeepSeek-R1 is an open-weights reasoning model built on DeepSeek-V3, designed for conversational AI, coding, and complex problem-solving. It has gained significant attention beyond the AI/ML community due to its strong reasoning capabilities, often competing with OpenAI’s models. One of its key advantages is that it can be run locally, giving users full control over their data. The NDv5 MI300X VM features 8x AMD Instinct MI300X GPUs, each equipped with 192GB of HBM3 and interconnected via Infinity Fabric 3.0. With up to 5.2 TB/s of memory bandwidth per GPU, the MI300X provides the necessary capacity and speed to process large models efficiently - enabling users to run DeepSeek-R1 at full precision on a single VM. In this blog post, we’ll walk you through the steps to provision an NDv5 MI300X instance on Azure and run DeepSeek-R1 for inference using the SGLang inference framework. Launching an NDv5 MI300X VM Prerequisites Check that your subscription has sufficient vCPU quota for the VM family “StandardNDI Sv 5MI300X” (see Quota documentation). If needed, contact your Microsoft account representative to request quota increase. A Bash terminal with Azure CLI installed and logged into the appropriate tenant. Alternatively, Azure Cloud Shell can also be employed. Provision the VM 1. Using Azure CLI, create an Ubuntu-22.04 VM on ND_MI300x_v5: az group create --location <REGION> -n <RESOURCE_GROUP_NAME> az vm create --name mi300x --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --ssh-key-values <PUBLIC_SSH_PATH> Optionally, the deployment can utilize the cloud-init.yaml file specified as --custom-data <CLOUD_INIT_FILE_PATH> to automate the additional preparation described below: az vm create --name mi300x --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --ssh-key-values <PUBLIC_SSH_PATH> --custom-data <CLOUD_INIT_FILE_PATH> Note: The GPU drivers may take a couple of mintues to completely load after the VM has been initially created. Additional preparation Beyond provisioning the VM, there are additional steps to prepare the environment to optimally run DeepSeed, or other AI workloads including setting-up the 8 NVMe disks on the node in a RAID-0 configuration to act as the cache location for Docker and Hugging Face. The following steps assume you have connected to the VM and working in a Bash shell. 1. Prepare the NVMe disks in a RAID-0 configuration mkdir -p /mnt/resource_nvme/ sudo mdadm --create /dev/md128 -f --run --level 0 --raid-devices 8 $(ls /dev/nvme*n1) sudo mkfs.xfs -f /dev/md128 sudo mount /dev/md128 /mnt/resource_nvme sudo chmod 1777 /mnt/resource_nvme 2. Configure Hugging Face to use the RAID-0. This environmental variable should also be propagated to any containers pulling images or data from Hugging Face. mkdir –p /mnt/resource_nvme/hf_cache export HF_HOME=/mnt/resource_nvme/hf_cache 3. Configure Docker to use the RAID-0 mkdir -p /mnt/resource_nvme/docker sudo tee /etc/docker/daemon.json > /dev/null <<EOF { "data-root": "/mnt/resource_nvme/docker" } EOF sudo chmod 0644 /etc/docker/daemon.json sudo systemctl restart docker All of these additional preperation steps can be automated in VM creation using cloud-init. The example cloud-init.yaml file can be used in provisioning the VM as described above. #cloud-config package_update: true write_files: - path: /opt/setup_nvme.sh permissions: '0755' owner: root:root content: | #!/bin/bash NVME_DISKS_NAME=`ls /dev/nvme*n1` NVME_DISKS=`ls -latr /dev/nvme*n1 | wc -l` echo "Number of NVMe Disks: $NVME_DISKS" if [ "$NVME_DISKS" == "0" ] then exit 0 else mkdir -p /mnt/resource_nvme # Needed incase something did not unmount as expected. This will delete any data that may be left behind mdadm --stop /dev/md* mdadm --create /dev/md128 -f --run --level 0 --raid-devices $NVME_DISKS $NVME_DISKS_NAME mkfs.xfs -f /dev/md128 mount /dev/md128 /mnt/resource_nvme fi chmod 1777 /mnt/resource_nvme - path: /etc/profile.d/hf_home.sh permissions: '0755' content: | export HF_HOME=/mnt/resource_nvme/hf_cache - path: /etc/docker/daemon.json permissions: '0644' content: | { "data-root": "/mnt/resource_nvme/docker" } runcmd: - ["/bin/bash", "/opt/setup_nvme.sh"] - mkdir -p /mnt/resource_nvme/docker - mkdir -p /mnt/resource_nvme/hf_cache # PAM group not working for docker group, so this will add all users to docker group - bash -c 'for USER in $(ls /home); do usermod -aG docker $USER; done' - systemctl restart docker Using MI300X If you are familiar with Nvidia and CUDA tools and environment, AMD provides equivalents as part of the ROCm stack. MI300X + ROCm Nvidia + CUDA Description rocm-smi nvidia-smi CLI for monitoring the system and making changes rccl nccl Library for communication between GPUs Running DeepSeek-R1 1. Pull the container image. It is O(10) GB in size, so it may take a few minutes to download. docker pull rocm/sglang-staging:20250303 2. Start the SGLang server. The model (~642 GB) is downloaded the first time it is launched and will take at least a few minutes to download. Once the application outputs “The server is fired up and ready to roll!”, you can begin making queries to the model. docker run \ --device=/dev/kfd \ --device=/dev/dri \ --security-opt seccomp=unconfined \ --cap-add=SYS_PTRACE \ --group-add video \ --privileged \ --shm-size 32g \ --ipc=host \ -p 30000:30000 \ -v /mnt/resource_nvme:/mnt/resource_nvme \ -e HF_HOME=/mnt/resource_nvme/hf_cache \ -e HSA_NO_SCRATCH_RECLAIM=1 \ -e GPU_FORCE_BLIT_COPY_SIZE=64 \ -e DEBUG_HIP_BLOCK_SYN=1024 \ rocm/sglang-staging:20250303 \ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --host 0.0.0.0 3. You can now make queries to DeepSeek-R1. For example, these requests to the model from another shell on same host provide model data and will generate a sample response. curl http://localhost:30000/get_model_info {"model_path":"deepseek-ai/DeepSeek-R1","tokenizer_path":"deepseek-ai/DeepSeek-R1","is_generation":true} curl http://localhost:30000/generate -H "Content-Type: application/json" -d '{ "text": "Once upon a time,", "sampling_params": { "max_new_tokens": 16, "temperature": 0.6 } }' Conclusion In this post, we detail how to run the full-size 671B DeepSeek-R1 model on a single Azure NDv5 MI300X instance. This includes setting up the machine, installing the necessary drivers, and executing the model. Happy inferencing! References https://github.com/deepseek-ai/DeepSeek-R1 https://github.com/deepseek-ai/DeepSeek-V3 https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/azure-announces-new-ai-optimized-vm-series-featuring-amd%e2%80%99s-flagship-mi300x-gpu/3980770 https://docs.sglang.ai/index.htmlOptimizing AI Workloads on Azure: CPU Pinning via NCCL Topology file
(Co-authored by: Rafael Salas, Sreevatsa Anantharamu, Jithin Jose, Elizabeth Fatade) Introduction NCCL NVIDIA Collective Communications Library (NCCL) is one of the most widely used communication library for AI training/inference. It features GPU-focused collective and point-to-point communication designs that are vital for AI workloads. NUMA-GPU-HCA affinity Non-uniform memory access (NUMA) architecture splits up the CPU cores into groups attached to their own local memory. GPUs and HCAs are connected to a specific NUMA node via PCIe interconnect. NCCL launches CPU processes to coordinate GPU-related inter-node communication, hence the process-to-NUMA binding matters a lot for ensuring optimal performance. NCCL uses the Low Latency (LL) protocol for small-to-medium message communication. In this protocol, the GPU data is copied to a pinned CPU buffer before sending over the network. Copying data from a GPU to a NUMA node that it is not directly connected to leads to additional inter-NUMA communication that adds overhead. Furthermore, communicating this CPU buffer through a NIC connected to the neighboring NUMA node also requires additional inter-NUMA communication. Following diagram shows the system topology for Nvidia DGX H100 system as a reference. H100 system topology (figure from https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html) To determine the CPU cores that belong to a NUMA node, use: lscpu To determine the GPU-to-NUMA affinity, use: cat /sys/bus/pci/devices/<busIdOfGPU>/numa_node or for NVIDIA GPUs, you can alternatively use: nvidia-smi topo -m To determine the HCA-to-NUMA mapping, use cat /sys/bus/pci/devices/<busIdOfHCA>/numa_node or use: lstopo NCCL Topology file Application developers can pass in the system topology by specifying the NCCL topology file while launching the job. NCCL topology file passes in the following information to NCCL library: GPU-to-NUMA mapping GPU-to-HCA, HCA-to-NUMA mapping NUMA-to-CPU-core mapping Speed and type of GPU-GPU interconnect. This enables NCCL to choose the most efficient system topology for communication. Azure ND-series VMs, like the NDv4 and NDv5 series, feature multiple GPUs. These GPUs connect to the CPUs via PCIe links and to each other using NVLink. If you are using a VM image from the AI and HPC Market place image, this topology file will be located in the directory: /opt/microsoft. The topology files are also located in the azhpc-images GitHub repository under the topology directory. Performance experiments This section presents results that demonstrate the benefit from using correct CPU pinning via NCCL topology file. We use the NCCL Tests benchmarks and run experiments for the different NCCL calls comparing the performance of the default mpirun binding to the correct binding specified via NCCL topology file. Setup The setup used for the experiments is: Two NDv5 nodes: 8 H100 GPUs per node, 8 400G NICs per node Azure VM image: microsoft-dsvm:ubuntu-hpc:2204:latest NCCL version: 2.25.1+cuda12.6 MPI library: HPC-X v2.18 Impact of System Topology In this section, we show the performance impact of system topology awareness in NCCL. We show the NCCL benchmark performance results comparing default and topology-aware configurations: For the default case, default mpirun binding, we use the following command line for launching NCCL benchmarks: mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1 For the system topology-aware case, we use the following command line for launching NCCL benchmarks: mpirun -np 16 \ --map-by ppr:8:node \ -hostfile ./hostfile \ -mca coll_hcoll_enable 0 \ -x LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x UCX_TLS=rc \ -x UCX_NET_DEVICES=mlx5_ib0:1 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_DEBUG=WARN \ -x NCCL_TOPO_FILE=<topo-file.xml> \ -x NCCL_IGNORE_CPU_AFFINITY=1 \ /opt/nccl-tests/build/$NCCL_BENCHMARK -b 4 -e 8G -g 1 -c 0 -f 2 -R 1 Here, hostfile is a file containing the IP addresses of the two nodes. NCCL_BENCHMARK is the NCCL benchmark name. We run experiments with all six benchmarks: all_reduce_perf, all_gather_perf, sendrecv_perf, reduce_scatter_perf, broadcast_perf, and alltoall_perf. <topo-file.xml> is the Azure SKU topology file that can be obtained as described in the "NCCL Topology File" section. NCCL topology files for different Azure HPC/AI SKUs are available here: Topology Files. Topology file-based binding is set by prescribing the NCCL_TOPO_FILE variable to the path of the NCCL topology file and setting NCCL_IGNORE_CPU_AFFINITY to 1. Setting NCCL_IGNORE_CPU_AFFINITY to 1 is crucial to assign the process affinity in NCCL solely based on the NCCL topology file. If this variable is not set, NCCL honors the affinity set by the MPI library as well as the NCCL topology file by setting the affinity to the intersection of these two sets. If the intersection is null, then NCCL simply retains the affinity set by the MPI library. Additional NCCL Optimizations We also list some additional NCCL tuning configurations for better performance. Note that these drive the benchmark performance to the peak, but it is important to fine tune these parameters for the end training application as some of these have impact of SM utilization. Config Value NCCL_MIN_CHANNELS 32 NCCL_P2P_NET_CHUNKSIZE 512K NCCL_IB_QPS_PER_CONNECTION 4 NCCL_PXN_DISABLE 1 Increasing NCCL_MIN_CHANNELS to 32 increases the throughput of certain collectives (especially ncclReduceScatter). Increasing NCCL_P2P_NET_CHUNKSIZE to 512K (from the default value of 128K) gives better throughput for ncclSend and ncclRecv calls when channel buffers are used instead of user-registered buffer for communication. Increasing NCCL_IB_QPS_PER_CONNECTION to 4 from 1 also slightly increases the throughput of collectives. Enforcing NCCL_PXN_DISABLE to 1 is essential to enable the zero-copy design for ncclSend and ncclRecv calls. Inter-node zero-copy designs are present in NCCL 2.24 onwards but is activated only if NCCL_PXN_DISABLE is set to 1 and the user buffers are registered with the NIC via NCCL calls ("-R 1" flag registers user buffers in NCCL tests). We found the bandwidth of the zero-copy point-to-point design to be around 10 GB/s better than their copy-based design. Results Small message sizes In the above figure, we compare the small message latency for the different collectives using the NCCL topology file-based binding to the default mpirun binding. For all the NCCL benchmarks, we see a consistent higher performance with using the NCCL topology file-based binding. The reason for this improvement is that, for small message sizes, NCCL uses the Low Latency (LL) protocol, and this protocol uses pinned CPU buffers for inter-node communication. In the LL protocol, the GPU copies the data to be sent over the network to the CPU buffer. The CPU thread polls this buffer to check if all the flags are set, and once they are set, the thread sends the data over the network. With the default mpirun pinning, all eight processes per node and their allocated memory are on NUMA node 0. However, GPUs 0-3 have affinity to NUMA node 0 and GPUs 4-7 have affinity to NUMA node 1. Therefore, the data copied from GPUs 4-7 need to traverse through the NUMA interconnect to reach the pinned CPU buffer. Furthermore, while communicating the CPU buffer via the NIC (closest to the GPU), the inter-NUMA interconnect needs to be traversed again. The figures above clearly show this additional overhead from inter-NUMA interconnect latency. With NCCL topology file-based binding, this overhead does not exist. This is because the topology file contains information of the GPUs attached to a NUMA node and the CPU mask for all the cores in a NUMA node. Using this information, NCCL correctly binds the processes to the NUMA nodes. Medium message sizes The medium message size bandwidth is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. Similar to small message sizes, NCCL uses the LL protocol for most of the above medium message runs. Nearly, all NCCL benchmarks show an improvement in bandwidth with using the NCCL topology file-based binding. The reason for this improvement is same as that for the small message sizes. Large message sizes The bandwidth for the large message sizes is compared in the above figure for the different collectives using the NCCL topology file-based binding to the default mpirun binding. As the message sizes gets larger and larger, there is not much improvement in bandwidth with using NCCL topology file-based binding, for all the NCCL benchmarks. This is because the dominant NCCL protocol for this message size range is SIMPLE. In this protocol, on systems with GPUDirectRDMA (which is the case for Azure ND-series SKUs), the GPU buffer is directly communicated over the network for inter-node communication. There is no intermediate copy of the message to a CPU buffer. In this scenario, the GPU-CPU communication is used to only updates a few flags for auxiliary tasks such indicating buffer readiness. The increase in the latency of these auxiliary tasks is insignificant compared to the total message transfer time. Thus, the impact of CPU pinning on the achieved bandwidth reduces for large message sizes. Conclusion This blog post describes the impact of system topology awareness and NCCL tuning for AI workload optimization and lists the relevant NCCL configuration options. The impact of topology awareness is significant for small and medium messages where NCCL uses the LL protocol. We also show the performance results comparing the different configurations and highlight the importance of performance tuning. To summarize, the recommended configuration for NCCL (version greater than v2.24) on Azure HPC/AI VMs (NDv4/NDv5) are: Config Value NCCL_MIN_CHANNELS 32 NCCL_P2P_NET_CHUNKSIZE 512K NCCL_IB_QPS_PER_CONNECTION 4 NCCL_PXN_DISABLE 1 NCCL_TOPO_FILE <corresponding-topology-file.xml> NCCL_IGNORE_CPU_AFFINITY 1 For Azure HPC/AI VMs, this NCCL topology file is available in /etc/nccl.conf inside the VM image. For container runs inside the VM, it is recommended to mount the NCCL topology file and /etc/nccl.conf file from the VM to the container.Network Design Ideas for VMs in Azure
Hello, I am analyzing the current Azure environment at my new job and trying to figure out the architectural choices mostly networking wise. Currently, we have 10 VMs and each VM has its own VNet and they are all in the same region. In my experience so far, I have never seen such network design in Azure before. If all VMs are in the same region, we could have one Vnet and utilize subnets and NSGs to segment the VMs and control the traffic. Having so many different VNets makes it very complex to manage. Looking for opinions what other people think. Is this just a bad design or just to keep the VMs separate from each other.147Views0likes3CommentsNew Da/Ea/Fav6 VMs with increased performance and Azure Boost are now generally available
By Sasha Melamed, Senior Product Manager, Azure Compute We are excited to announce General Availability of new Dalsv6, Dasv6, Easv6, Falsv6, Fasv6, and Famsv6-series Azure Virtual Machines (VMs) based on the 4th Gen AMD EPYC™ processor (Genoa). These VMs deliver significantly improved performance and price/performance versus the prior Dasv5 and Easv5 VMs, NVMe connectivity for faster local and remote storage access, and Azure Boost for improved performance and enhanced security. With the broad selection of compute, memory, and storage configurations available with these new VM series, there is a best fit option for a wide range of workloads. What’s New The new Dalsv6, Davs6, and Easv6 VMs are offered with vCPU counts ranging from 2 to 96 vCPUs. The new general purpose and memory optimized VMs will come in a variety of memory (GiB)-to-vCPU ratios, including the Dalsv6 at 2:1, Dasv6 at 4:1, and Easv6 at 8:1 ratios. The VMs are also available with and without a local disk so that you can choose the option that best fits your workload. Workloads can expect up to 20% CPU performance improvement over the Dasv5 and Easv5 VMs and up to 15% better price/performance. Further expanding our offerings, we are proud to introduce the first Compute-optimized VM series based on AMD processors also in three memory-to-vCPU ratios. The new Falsv6, Fasv6, and Famsv6 VMs offer the fastest x86 CPU performance in Azure and have up to 2x CPU performance improvement over our previous v5 VMs, as shown in the graph below. We are excited to announce that the new Dalsv6, Dasv6, Easv6, and suite of Fasv6 virtual machines are powered by Azure Boost. Azure Boost has been providing benefits to millions of existing Azure VMs in production today, such as enabling exceptional remote storage performance and significant improvements in networking throughput and latency. Our latest Azure Boost infrastructure innovation, in combination with new AMD-based VMs, delivers improvements in performance, security, and reliability. The platform provides sub-second servicing capabilities for the most common infrastructure updates, delivering a 10x reduction in impact. To learn more about Azure Boost, read our blog. To drive the best storage performance for your workloads, the new AMD-based VMs come with the NVMe interface for local and remote disks. Many workloads will benefit from improvements over the previous generation of AMD-based with up to: 80% better remote storage performance 400% faster local storage speeds 25% networking bandwidth improvement 45% higher NVMe SSD capacity per vCPU for Daldsv6, Dadsv6, Eadsv6-series VMs with local disks The 4th Gen AMD EPYC™ processors provide new capabilities for these VMs, including: Always-On Transparent Secure Memory Encryption ensuring that your sensitive information remains secure without compromising performance. AVX-512 to handle compute-intensive tasks such as scientific simulations, financial analytics, AI, and machine learning. Vector Neural Network Instructions enhancing the performance of neural network inference operations, making it easier to deploy and scale AI solutions. Bfloat16 for efficient training and inference of deep learning models, providing a balance between performance and precision. Dasv6, Dadsv6, Easv6, Eadsv6, Fasv6, and Fadsv6-series VMs are SAP Certified. Whether you’re running a simple test infrastructure, mission critical enterprise applications, high-performance computing tasks, or AI workloads, our new VMs are ready to meet your needs. Explore the new capabilities and start leveraging the power of Azure today! General-purpose workloads The new Dasv6-series VMs offer a balanced ratio of memory to vCPU performance and increased scalability, up to 96 vCPUs and 384 GiB of RAM. Whereas the new Dalsv6-series VM series are ideal for workloads that require less RAM per vCPU, with a max of 192 GiB of RAM. The Dalsv6 series are the first 2GiB/vCPU memory offerings in our family of AMD-based VMs. The Dalsv6 series can reduce your costs when running non-memory intensive applications, including web servers, gaming, video encoding, AI/ML, and batch processing. The Dasv6-series VMs work well for many general computing workloads, such as e-commerce systems, web front ends, desktop virtualization solutions, customer relationship management applications, entry-level and mid-range databases, application servers, and more. Series vCPU Memory (GiB) Max Local NVMe Disk (GiB) Max IOPS for Local Disk Max Uncached Disk IOPS for Managed Disks Max Managed Disks Throughput (MBps) Dalsv6 2-96 4-192 N/A N/A 4 - 172K 90 – 4,320 Daldsv6 2-96 4-192 1x110 - 6x880 1.8M 4 - 172K 90 – 4,320 Dasv6 2-96 8-384 N/A N/A 4 - 172K 90 – 4,320 Dadsv6 2-96 8-384 1x110 - 6x880 1.8M 4 - 172K 90 – 4,320 Memory-intensive workloads For more memory demanding workloads, the new Easv6-series VMs offer high memory-to-vCPU ratios with increased scalability up to 96 vCPUs and 672 GiB of RAM. The Easv6-series VMs are ideal for memory-intensive enterprise applications, data warehousing, business intelligence, in-memory analytics, and financial transactions. Series vCPU Memory (GiB) Max Local NVMe Disk (GiB) Max IOPS for Local Disk Max Uncached Disk IOPS for Managed Disks Max Managed Disks Throughput (MBps) Easv6 2-96 16-672 N/A N/A 4 - 172K 90 – 4,320 Eadsv6 2-96 16-672 1x110 - 6x880 1.8M 4 - 172K 90 – 4,320 Compute-intensive workloads For compute-intensive workloads, the new Falsv6, Fasv6 and Famsv6 VM series come without Simultaneous Multithreading (SMT), meaning a vCPU equals one physical core. These VMs will be the best fit for workloads demanding the highest CPU performance, such as scientific simulations, financial modeling and risk analysis, gaming, and video rendering. Series vCPU Memory (GiB) Max Uncached Disk IOPS for Managed Disks Max Managed Disks Throughput (MBps) Max Network Bandwidth (Gbps) Falsv6 2-64 4-128 4 - 115K 90 - 2,880 12.5 - 36 Fasv6 2-64 8-256 4 - 115K 90 - 2,880 12.5 - 36 Famsv6 2-64 16-512 4 - 115K 90 - 2,880 12.5 - 36 Customers are excited about new AMD v6 VMs FlashGrid offers software solutions that help Oracle Database users on Azure achieve maximum database uptime and minimize the risk of outages. The Easv6 series VMs make it easier to support Oracle RAC workloads with heavy transaction processing on Azure using FlashGrid Cluster. The NVMe protocol enhances disk error handling, which is important for failure isolation in high-availability database architectures. The CPU boost frequency of 3.7 GHz and higher network bandwidth per vCPU enable database clusters to handle spikes in client transactions better while keeping a lower count of vCPU to limit licensing costs. The Easv6 VMs have passed our extensive reliability and compatibility testing and are now available for new deployments and upgrades. – Art Danielov, CEO, FlashGrid Inc. Helio is a platform for large-scale computing workloads, optimizing for costs, scale, and emissions. Its main focus is 3D rendering Our architectural and media & entertainment (VFX) 3D rendering workloads have been accelerated by an average of ~42% with the new v6 generation, while maintaining low cost and high scale. In addition, we are seeing significant improvements in disk performance with the new NVMe interface, resulting in much faster render asset load times. -- Kevin Häfeli, CEO / Cofounder Helio AG Silk's Software-Defined Cloud Storage delivers unparalleled price/performance for the most demanding, real-time applications. Silk has tested the new Da/Eav6 VM offering from Azure and we are looking forward to enable our customers to benefit from its new capabilities, allowing higher throughput at lower cost, while providing increased reliability” -- Adik Sokolovski, Chief R&D Officer, Silk ZeniMax Online Studios creates online RPG worlds where you can play and create your own stories. The new VMs we tested provided a significant performance boost in our build tasks. The super-fast storage not only made the workflows smoother and faster, but it also helped highlight other bottlenecks in our design and allowed us to improve our pipeline overall. We are excited for their availability and plan on utilizing these machines to expand our workload in Azure. -- Merrick Moss, Product Owner, ZeniMax Online Studios Getting started The new VMs are now available in the East US, East US 2, Central US, South Central US, West US 3, West Europe, and North Europe regions with more to follow. Check out pricing on the following pages for Windows and Linux. You can learn more about the new VMs in the documentation for Dal-series, Da-series, Ea-series, and Fa-series. We also recommend reading the NVMe overview and FAQ. You can find the Ultra disk and Premium SSD V2 regional availability to pair with the new NVMe based v6 series at their respective links.4.3KViews4likes6CommentsBoosting Performance with the Latest Generations of Virtual Machines in Azure
Microsoft Azure recently announced the availability of the new generation of VMs (v6)—including the Dl/Dv6 (general purpose) and El/Ev6 (memory-optimized) series. These VMs are powered by the latest Intel Xeon processors and are engineered to deliver: Up to 30% higher per-core performance compared to previous generations. Greater scalability, with options of up to 128 vCPUs (Dv6) and 192 vCPUs (Ev6). Significant enhancements in CPU cache (up to 5× larger), memory bandwidth, and NVMe-enabled storage. Improved security with features like Intel® Total Memory Encryption (TME) and enhanced networking via the new Microsoft Azure Network Adaptor (MANA). By Microsoft By Microsoft Evaluated Virtual Machines and Geekbench Results The table below summarizes the configuration and Geekbench results for the two VMs we tested. VM1 represents a previous-generation machine with more vCPUs and memory, while VM2 is from the new Dld e6 series, showing superior performance despite having fewer vCPUs. VM1 features VM1 - D16S V5 (16 Vcpus - 64GB RAM) VM1 - D16S V5 (16 Vcpus - 64GB RAM) VM2 features VM2 - D16ls v6 (16 Vcpus - 32GB RAM) VM2 - D16ls v6 (16 Vcpus - 32GB RAM) Key Observations: Single-Core Performance: VM2 scores 2013 compared to VM1’s 1570, a 28.2% improvement. This demonstrates that even with half the vCPUs, the new Dld e6 series provides significantly better performance per core. Multi-Core Performance: Despite having fewer cores, VM2 achieves a multi-core score of 12,566 versus 9,454 for VM1, showing a 32.9% increase in performance. VM 1 VM 2 Enhanced Throughput in Specific Workloads: File Compression: 1909 MB/s (VM2) vs. 1654 MB/s (VM1) – a 15.4% improvement. Object Detection: 2851 images/s (VM2) vs. 1592 images/s (VM1) – a remarkable 79.2% improvement. Ray Tracing: 1798 Kpixels/s (VM2) vs. 1512 Kpixels/s (VM1) – an 18.9% boost. These results reflect the significant advancements enabled by the new generation of Intel processors. Score VM 1 VM 1 VM 1 Score VM 2 VM 2 VM 2 Evolution of Hardware in Azure: From Ice Lake-SP to Emerald Rapids Technical Specifications of the Processors Evaluated Understanding the dramatic performance improvements begins with a look at the processor specifications: Intel Xeon Platinum 8370C (Ice Lake-SP) Architecture: Ice Lake-SP Base Frequency: 2.79 GHz Max Frequency: 3.5 GHz L3 Cache: 48 MB Supported Instructions: AVX-512, VNNI, DL Boost VM 1 Intel Xeon Platinum 8573C (Emerald Rapids) Architecture: Emerald Rapids Base Frequency: 2.3 GHz Max Frequency: 4.2 GHz L3 Cache: 260 MB Supported Instructions: AVX-512, AMX, VNNI, DL Boost VM 2 Impact on Performance Cache Size Increase: The jump from 48 MB to 260 MB of L3 cache is a key factor. A larger cache reduces dependency on RAM accesses, thereby lowering latency and significantly boosting performance in memory-intensive workloads such as AI, big data, and scientific simulations. Enhanced Frequency Dynamics: While the base frequency of the Emerald Rapids processor is slightly lower, its higher maximum frequency (4.2 GHz vs. 3.5 GHz) means that under load, performance-critical tasks can benefit from this burst capability. Advanced Instruction Support: The introduction of AMX (Advanced Matrix Extensions) in Emerald Rapids, along with the robust AVX-512 support, optimizes the execution of complex mathematical and AI workloads. Efficiency Gains: These processors also offer improved energy efficiency, reducing the energy consumed per compute unit. This efficiency translates into lower operational costs and a more sustainable cloud environment. Beyond Our Tests: Overview of the New v6 Series While our tests focused on the Dld e6 series, Azure’s new v6 generation includes several families designed for different workloads: 1. Dlsv6 and Dldsv6-series Segment: General purpose with NVMe local storage (where applicable) vCPUs Range: 2 – 128 Memory: 4 – 256 GiB Local Disk: Up to 7,040 GiB (Dldsv6) Highlights: 5× increased CPU cache (up to 300 MB) and higher network bandwidth (up to 54 Gbps) 2. Dsv6 and Ddsv6-series Segment: General purpose vCPUs Range: 2 – 128 Memory: Up to 512 GiB Local Disk: Up to 7,040 GiB in Ddsv6 Highlights: Up to 30% improved performance over the previous Dv5 generation and Azure Boost for enhanced IOPS and network performance 3. Esv6 and Edsv6-series Segment: Memory-optimized vCPUs Range: 2 – 192* (with larger sizes available in Q2) Memory: Up to 1.8 TiB (1832 GiB) Local Disk: Up to 10,560 GiB in Edsv6 Highlights: Ideal for in-memory analytics, relational databases, and enterprise applications requiring vast amounts of RAM Note: Sizes with higher vCPUs and memory (e.g., E128/E192) will be generally available in Q2 of this year. Key Innovations in the v6 Generation Increased CPU Cache: Up to 5× more cache (from 60 MB to 300 MB) dramatically improves data access speeds. NVMe for Storage: Enhanced local and remote storage performance, with up to 3× more IOPS locally and the capability to reach 400k IOPS remotely via Azure Boost. Azure Boost: Delivers higher throughput (up to 12 GB/s remote disk throughput) and improved network bandwidth (up to 200 Gbps for larger sizes). Microsoft Azure Network Adaptor (MANA): Provides improved network stability and performance for both Windows and Linux environments. Intel® Total Memory Encryption (TME): Enhances data security by encrypting the system memory. Scalability: Options ranging from 128 vCPUs/512 GiB RAM in the Dv6 family to 192 vCPUs/1.8 TiB RAM in the Ev6 family. Performance Gains: Benchmarks and internal tests (such as SPEC CPU Integer) indicate improvements of 15%–30% across various workloads including web applications, databases, analytics, and generative AI tasks. My personal perspective and point of view The new Azure v6 VMs mark a significant advancement in cloud computing performance, scalability, and security. Our Geekbench tests clearly show that the Dld e6 series—powered by the latest Intel Xeon Platinum 8573C (Emerald Rapids)—delivers up to 30% better performance than previous-generation machines with more resources. Coupled with the hardware evolution from Ice Lake-SP to Emerald Rapids—which brings a dramatic increase in cache size, improved frequency dynamics, and advanced instruction support—the new v6 generation sets a new standard for high-performance workloads. Whether you’re running critical enterprise applications, data-intensive analytics, or next-generation AI models, the enhanced capabilities of these VMs offer significant benefits in performance, efficiency, and cost-effectiveness. References and Further Reading: Microsoft’s official announcement: Azure Dld e6 VMs Internal tests performed with Geekbench 6.4.0 (AVX2) in the Germany West Central Azure region.237Views0likes0CommentsBoosting Performance with the Latest Generations of Virtual Machines in Azure
Microsoft Azure recently announced the availability of the new generation of VMs (v6)—including the Dl/Dv6 (general purpose) and El/Ev6 (memory-optimized) series. These VMs are powered by the latest Intel Xeon processors and are engineered to deliver: Up to 30% higher per-core performance compared to previous generations. Greater scalability, with options of up to 128 vCPUs (Dv6) and 192 vCPUs (Ev6). Significant enhancements in CPU cache (up to 5× larger), memory bandwidth, and NVMe-enabled storage. Improved security with features like Intel® Total Memory Encryption (TME) and enhanced networking via the new Microsoft Azure Network Adaptor (MANA). By Microsoft Evaluated Virtual Machines and Geekbench Results The table below summarizes the configuration and Geekbench results for the two VMs we tested. VM1 represents a previous-generation machine with more vCPUs and memory, while VM2 is from the new Dld e6 series, showing superior performance despite having fewer vCPUs. VM1 features VM1 - D16S V5 (16 Vcpus - 64GB RAM) VM1 - D16S V5 (16 Vcpus - 64GB RAM) VM2 features VM2 - D16ls v6 (16 Vcpus - 32GB RAM) VM2 - D16ls v6 (16 Vcpus - 32GB RAM) Key Observations: Single-Core Performance: VM2 scores 2013 compared to VM1’s 1570, a 28.2% improvement. This demonstrates that even with half the vCPUs, the new Dld e6 series provides significantly better performance per core. Multi-Core Performance: Despite having fewer cores, VM2 achieves a multi-core score of 12,566 versus 9,454 for VM1, showing a 32.9% increase in performance. VM 1 VM 2 Enhanced Throughput in Specific Workloads: File Compression: 1909 MB/s (VM2) vs. 1654 MB/s (VM1) – a 15.4% improvement. Object Detection: 2851 images/s (VM2) vs. 1592 images/s (VM1) – a remarkable 79.2% improvement. Ray Tracing: 1798 Kpixels/s (VM2) vs. 1512 Kpixels/s (VM1) – an 18.9% boost. These results reflect the significant advancements enabled by the new generation of Intel processors. Score VM 1 VM 1 VM 1 Score VM 2 VM 2 VM 2 Evolution of Hardware in Azure: From Ice Lake-SP to Emerald Rapids Technical Specifications of the Processors Evaluated Understanding the dramatic performance improvements begins with a look at the processor specifications: Intel Xeon Platinum 8370C (Ice Lake-SP) Architecture: Ice Lake-SP Base Frequency: 2.79 GHz Max Frequency: 3.5 GHz L3 Cache: 48 MB Supported Instructions: AVX-512, VNNI, DL Boost VM 1 Intel Xeon Platinum 8573C (Emerald Rapids) Architecture: Emerald Rapids Base Frequency: 2.3 GHz Max Frequency: 4.2 GHz L3 Cache: 260 MB Supported Instructions: AVX-512, AMX, VNNI, DL Boost VM 2 Impact on Performance Cache Size Increase: The jump from 48 MB to 260 MB of L3 cache is a key factor. A larger cache reduces dependency on RAM accesses, thereby lowering latency and significantly boosting performance in memory-intensive workloads such as AI, big data, and scientific simulations. Enhanced Frequency Dynamics: While the base frequency of the Emerald Rapids processor is slightly lower, its higher maximum frequency (4.2 GHz vs. 3.5 GHz) means that under load, performance-critical tasks can benefit from this burst capability. Advanced Instruction Support: The introduction of AMX (Advanced Matrix Extensions) in Emerald Rapids, along with the robust AVX-512 support, optimizes the execution of complex mathematical and AI workloads. Efficiency Gains: These processors also offer improved energy efficiency, reducing the energy consumed per compute unit. This efficiency translates into lower operational costs and a more sustainable cloud environment. Beyond Our Tests: Overview of the New v6 Series While our tests focused on the Dld e6 series, Azure’s new v6 generation includes several families designed for different workloads: 1. Dlsv6 and Dldsv6-series Segment: General purpose with NVMe local storage (where applicable) vCPUs Range: 2 – 128 Memory: 4 – 256 GiB Local Disk: Up to 7,040 GiB (Dldsv6) Highlights: 5× increased CPU cache (up to 300 MB) and higher network bandwidth (up to 54 Gbps) 2. Dsv6 and Ddsv6-series Segment: General purpose vCPUs Range: 2 – 128 Memory: Up to 512 GiB Local Disk: Up to 7,040 GiB in Ddsv6 Highlights: Up to 30% improved performance over the previous Dv5 generation and Azure Boost for enhanced IOPS and network performance 3. Esv6 and Edsv6-series Segment: Memory-optimized vCPUs Range: 2 – 192* (with larger sizes available in Q2) Memory: Up to 1.8 TiB (1832 GiB) Local Disk: Up to 10,560 GiB in Edsv6 Highlights: Ideal for in-memory analytics, relational databases, and enterprise applications requiring vast amounts of RAM Note: Sizes with higher vCPUs and memory (e.g., E128/E192) will be generally available in Q2 of this year. Key Innovations in the v6 Generation Increased CPU Cache: Up to 5× more cache (from 60 MB to 300 MB) dramatically improves data access speeds. NVMe for Storage: Enhanced local and remote storage performance, with up to 3× more IOPS locally and the capability to reach 400k IOPS remotely via Azure Boost. Azure Boost: Delivers higher throughput (up to 12 GB/s remote disk throughput) and improved network bandwidth (up to 200 Gbps for larger sizes). Microsoft Azure Network Adaptor (MANA): Provides improved network stability and performance for both Windows and Linux environments. Intel® Total Memory Encryption (TME): Enhances data security by encrypting the system memory. Scalability: Options ranging from 128 vCPUs/512 GiB RAM in the Dv6 family to 192 vCPUs/1.8 TiB RAM in the Ev6 family. Performance Gains: Benchmarks and internal tests (such as SPEC CPU Integer) indicate improvements of 15%–30% across various workloads including web applications, databases, analytics, and generative AI tasks. My personal perspective and point of view The new Azure v6 VMs mark a significant advancement in cloud computing performance, scalability, and security. Our Geekbench tests clearly show that the Dld e6 series—powered by the latest Intel Xeon Platinum 8573C (Emerald Rapids)—delivers up to 30% better performance than previous-generation machines with more resources. Coupled with the hardware evolution from Ice Lake-SP to Emerald Rapids—which brings a dramatic increase in cache size, improved frequency dynamics, and advanced instruction support—the new v6 generation sets a new standard for high-performance workloads. Whether you’re running critical enterprise applications, data-intensive analytics, or next-generation AI models, the enhanced capabilities of these VMs offer significant benefits in performance, efficiency, and cost-effectiveness. References and Further Reading: Microsoft’s official announcement: Azure Dld e6 VMs Internal tests performed with Geekbench 6.4.0 (AVX2) in the Germany West Central Azure region.88Views0likes0CommentsAnnouncing General Availability of Azure Dl/D/E v6 VMs powered by Intel EMR processor & Azure Boost
Today we are excited to announce General Availability of the new Azure General Purpose and Memory Optimized Virtual Machines powered by the 5 th Gen Intel® Xeon® processor (code-named Emerald Rapids). The new virtual machines are available in three different memory-to-core ratios and offer the option of with or without local NVMe SSD. The General Purpose families include Dlsv6, Dldsv6, Dsv6, and Ddsv6-series. The Memory Optimized families include Esv6 and Edsv6-series.4.2KViews9likes0CommentsAzure Windows Virtual Machine Activation: two new KMS IP addresses (…and why you should care)
This blog contains important information about KMS IP addresses changes that may impact Windows Virtual machine activations for Azure Global Cloud customers who configured custom routes or firewall rules to allow KMS IP addresses. Who will be affected? In July 2022, we announced two new KMS IP addresses, 20.118.99.224 and 40.83.235.53, in Azure Global Cloud via Azure Update - Generally available: New KMS DNS in Azure Global Cloud. We expect that most Azure Windows Virtual Machine customers will not be impacted. However, Azure Global Cloud customers who have followed troubleshooting guides, like the ones listed below, to configure custom routes or firewall rules that allow Windows VMs to reach KMS IP address in the past, must take actions to include these two new KMS IP addresses, 20.118.99.224 and 40.83.235.53. Otherwise, after October 3rd, 2022, your Windows Virtual Machines will report warnings of failing to reach Windows Licensing Servers for activation. https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/custom-routes-enable-kms-activation https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/troubleshoot-activation-problems https://docs.microsoft.com/en-us/azure/firewall/protect-azure-virtual-desktop How will customers be affected? As explained in Generally available: New KMS DNS in Azure Global Cloud, most Windows Virtual Machines in Global Cloud rely on new azkms.core.windows.net for Windows Activation. The new azkms.core.windows.net is currently pointing to kms.core.windows.net. After October 3 rd , 2022, azkms.core.windows.net will point to two new IP addresses 20.118.99.224 and 40.83.235.53. For customers who follow https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/custom-routes-enable-kms-activation, without taking the actions to include these two new IP addresses 20.118.99.224 and 40.83.235.53 in custom routes, your Windows Virtual Machines will not be able to connect to new KMS server for Windows Activation. For customers who follow https://docs.microsoft.com/en-us/azure/firewall/protect-azure-virtual-desktop, without taking the actions to include these two new IP addresses 20.118.99.224 and 40.83.235.53 in firewall rules, your Windows Virtual Machines will not be able to connect to new KMS server for Windows Activation. When failing to connect to KMS server for activation, Azure Windows Virtual Machines report warnings like the following - “We can't activate Windows on this device as we can't connect to your organization's activation server. Make sure you're connected to your organization's network and try again. If you continue having problems with activation, contact your organization's support person. Error code: 0xC004F074.” As explained in Key Management Services (KMS) activation planning, “KMS activations are valid for 180 days, a period known as the activation validity interval. KMS clients must renew their activation by connecting to the KMS host at least once every 180 days to stay activated. By default, KMS client computers attempt to renew their activation every seven days. After a client's activation is renewed, the activation validity interval begins again”. Within the 180-day KMS activate validity interval, customers can still access the full functionality of the Windows virtual machine. Customers should fix activation issues during the 180-day KMS activation validity interval. Action required To customers who follow https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/custom-routes-enable-kms-activation, include these two new IP addresses 20.118.99.224 and 40.83.235.53 in custom routes before October 3 rd , 2022. To customers who follow https://docs.microsoft.com/en-us/azure/firewall/protect-azure-virtual-desktop, include these two new IP addresses 20.118.99.224 and 40.83.235.53 in firewall rules before October 3 rd , 2022. How to check You can remote login to your Windows Virtual Machines and complete the following: Open PowerShell. Run the following command to confirm the connectivity to new KMS IP addresses: test-netconnection azkms.core.windows.net -port 1688 test-netconnection 20.118.99.224 -port 1688 test-netconnection 40.83.235.53 -port 1688 If the connections are successful, no more action is needed. If the connection(s) fails, you need to go to the “Action required” section. Important timeline After October 3 rd , 2022, most Azure Windows Virtual Machines will rely on two new KMS IP addresses 20.118.99.224 and 40.83.235.53 for Windows Activation, when azkms.core.windows.net points to these two new IP addresses. After March 1 st , 2023, all Azure Windows Virtual Machines will rely on two new KMS IP addresses 20.118.99.224 and 40.83.235.53 for Windows Activation, when kms.core.windows.net points to 20.118.99.224.47KViews3likes1Comment