Blog Post

Azure High Performance Computing (HPC) Blog
3 MIN READ

Azure announces new AI optimized VM series featuring AMD’s flagship MI300X GPU

MarcCharest's avatar
MarcCharest
Icon for Microsoft rankMicrosoft
Nov 15, 2023

 
In our relentless pursuit of pushing the boundaries of artificial intelligence, we understand that cutting-edge infrastructure and expertise is needed to harness the full potential of advanced AI. At Microsoft, we've amassed a decade of experience in supercomputing and have consistently supported the most demanding AI training and generative inferencing workloads. Today, we're excited to announce the latest milestone in our journey. We’ve created a virtual machine (VM) with an unprecedented 1.5 TB of high bandwidth memory (HBM) that leverages the power of AMD’s flagship MI300X GPU. Our Azure VMs powered with the MI300X GPU give customers even more choices for AI optimized VMs.

 

ND MI300X v5, Azure’s AI-focused virtual machine series with largest HBM capacity

What differentiates a ND MI300X v5 VM from our existing family of ND-series VMs is the inclusion of 8 x AMD Instinct MI300X GPUs interconnected via Infinity Fabric 3.0 in each VM. MI300X offers industry leading HBM capacity and bandwidth, 192 GB of HBM3 memory per GPU capable of speeds up to 5.2 TB/s. With greater capacity and more bandwidth customers can process larger models faster, using fewer GPUs. On top of this, these virtual machines use the same winning platform as our other ND-series VMs, which includes the latest technologies like:

  • 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU with 3.2Tb/s per VM
  • 4th Gen Intel Xeon Scalable processors
  • PCIE Gen5 host-to-GPU interconnect with 64GB/s bandwidth per GPU
  • 16 Channels of DDR5 DIMMs

1.5 TB of HBM Means Processing Bigger AI Models

HBM is essential for AI applications due to its high bandwidth, low power consumption, and compact size. It is ideal for AI applications that need to quickly process vast amounts of data. ND MI300X v5 has the most HBM capacity available, 1.5 TB per VM or 192 GB of HBM per GPU, the highest HBM capacity available in the cloud. This lets customers run the largest, most advanced AI models faster, and with fewer GPUs. In the end, you save power, cost, and time to solution.

 

Microsoft and AMD’s Long-Term Partnership

ND MI300X v5 is the culmination of a long-term partnership between Microsoft and AMD.  Years ago, we built clusters of MI50 and MI100 at Microsoft to optimize the training and inferencing of large models with ROCm on AMD GPUs.  ONNX Runtime, DeepSpeed, and MSCCL are all examples of Microsoft frameworks that now support AMD GPUs.  Then we built the world’s top cloud supercomputer (as of June 2023, June 2023 | TOP500) with VMs featuring AMD MI200 GPUs and have been running several internal production workloads on it in Azure. Today marks the next phase in this journey, the ND MI300X v5 VMs. Customers can seamlessly switch to ND MI300X v5 from our other ND-series VMs as AMD’s open software platform, ROCm, contains all the libraries, compilers, runtimes, and tools necessary for accelerating compute-intensive applications. Many of the major machine learning frameworks (like PyTorch, TensorFlow) all have ROCm support built in and are supported in an ND MI300X v5 VM. This seamlessness results in lower engineering costs and faster time-to-market for customers’ solutions.

 

AI in Azure's DNA

Azure is pioneering AI by embracing a comprehensive approach to silicon diversity and developing the most flexible and extensible AI infrastructure. We offer the best capabilities of our industry partners and give customers more options to meet their unique needs. We are proud to extend our AI and supercomputing capabilities with NDv5 MI300X. ND MI300X v5 is available for early access and part of the Azure portfolio soon, allowing you to unlock the potential of AI at Scale in the cloud.

Updated Nov 17, 2023
Version 10.0
  • lemurian-theo's avatar
    lemurian-theo
    Copper Contributor

    I am not able to find these VMs in the Azure portal. Could anyone share a link to these VMs?

  • Charles_Dodgson's avatar
    Charles_Dodgson
    Copper Contributor

    AMD TSVs are so exciting that I'm willing to forgive ROCm. PyTorch (or TensorFlow if you're in to that kind of thing) is the real common software stack.

     

    p.s. Your grammar checker is wrong. It is "in to", not "into".