Microsoft Mechanics Blog

12 MIN READ

High Performance Computing in Azure - with Mark Russinovich

Bronze Contributor

Jan 13, 2025

See the latest innovations in silicon design from AMD with new system-on-a-chip high bandwidth memory breakthroughs with up to 7 terabytes of memory bandwidth in a single virtual machine — and how it’s possible to get more than 8x speed-ups without sacrificing compatibility from the previous generation to HBv5. These use AMD EPYC™ 9004 Processors with AMD 3D V-Cache™ Technology.

And find out how Microsoft’s own silicon including custom ARM-based Cobalt CPUs and Maia AI accelerators for performance and power efficiency.

Mark Russinovich, Azure CTO, Deputy CISO, Technical Fellow, and Microsoft Mechanics lead contributor, shows Jeremy Chapman how with workloads spanning Databricks, Siemens, Snowflake, or Microsoft Teams, Azure provides the tools to improve efficiency and performance in your datacenter at hyperscale.

Maximize memory bandwidth.

Up to 7 TB/s of memory bandwidth with Azure’s HBv5 VMs for superior performance in memory-intensive applications. See how it works.

Overcome memory bandwidth bottlenecks.

Check out Azure’s advanced HB-series VMs, featuring custom-made AMD processors, only available on Azure, for unparalleled performance. Watch here.

Choose the right hardware for your HPC applications.

Ensure optimal performance with CPUs or GPUs. Watch Mark Russinovich demonstrate how to get started with Azure and make the best choice for your workloads.

Watch our video here.

QUICK LINKS:

00:00–7TB memory bandwidth in a single VM
00:51 — Efficiency and optimization
02:33 — Choose the right hardware for workloads
04:52 — Microsoft Cobalt CPUs and Maia AI accelerators
06:14 — Hardware innovation for diverse workloads
07:53 — Speedups with HBv5 VMs
09:04 — Compatibility moving from HBv4 to HBv5
11:29 — Future of HPC
12:01 — Wrap up

Link References

Check out https://aka.ms/AzureHPC

For more about HBv5 go to https://aka.ms/AzureHBv5

Unfamiliar with Microsoft Mechanics?

As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.

Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries
Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog
Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast

Keep getting this insider knowledge, join us on social:

Follow us on Twitter: https://twitter.com/MSFTMechanics
Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/
Enjoy us on Instagram: https://www.instagram.com/msftmechanics/
Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics

Video Transcript:

- So what would you do with 7 Terabytes per second of memory bandwidth? Well, High Performance Computing or HPC with the latest hardware is today’s focus And I’m joined once again by Mark Russinovich, a man who needs no introduction. Welcome back.

- Happy to be back.

- So, 7 Terabytes per second, that amount of memory bandwidth is a tall order order. So what makes a thing like this even possible?

- Well, it is possible, and this is all on a CPU. Let me prove it to you. Here I have the Stream benchmark set up. If you’re not familiar with the Stream Triad Benchmark, it’s a benchmark focused on measuring memory bandwidth. And you can see that this virtual machine running in Azure is hitting almost 7TB per second of memory bandwidth like we said. Just to put that into perspective, this is an over 8x improvement generation over generation for VM’s in our Azure HB and HX series VMs using AMD’s Genoa-X processors.

- And it’s really a lot of bandwidth. So where do you see this coming in in terms of being really useful and helpful?

- I can think of a number of critical workloads where the primary bottleneck is the speed of memory. Think of scientific workloads For example, if you are forecasting weather like the UK Met Office, or you’re tracking dangerous storms that could be life threatening and disruptive, these require the highest precision forecasts, And that’s where you need high performance compute that can enable precise computations at speed. And I can show you the difference between HBv4 and HBv5 in terms of memory speed. Let’s look at another example with computational fluid dynamics or CFD where we want to iterate faster with different simulations. Here, I have an ANSYS benchmark ready to simulate a wind tunnel test of airplane wings. On the left side, I’ve configured the HBv4 virtual machine and on the right is the HBv5 machine. I’ll start the run on the left from our HBv4 system, and now I’ll kick off the HBv5 one on the right. Each system takes 2 to 3 minutes to run and there’s nothing to look at while it completes, so I’ll fast forward in time a little. Now I’ll open up the logs, starting with the HBv4 VM on left, then the HBv5 VM on the right, so that we can compare each benchmark. If we focus in on the “Solver” rows for each, you can see the wall time per iteration of 0.911 seconds for the HBv4 and 0.345 seconds for the HBv5. The Solver speed on the HBv5 is about 2.6 times faster at almost 42 million cells per second per iteration and the Solver rating is also 2.6x higher. So, yes, it’s faster. And the real benefit is that you can perform more test iterations in the same amount of time on a single system. When memory bandwidth is the bottleneck it doesn’t matter what other resources you throw at it, including faster processors, more memory capacity, better disk speed or network speed. Those alone won’t help the application until memory speeds improve. And that’s why high-bandwidth memory makes such a huge difference.

- And I want to come back to this to unpack what makes this even possible for CPU and memory-intensive workloads. But can you answer me one question, why you wouldn’t just use a GPU for something this?

- So, GPUs and AI accelerators are great for certain kinds of workloads, both for AI training and inference as well as various modeling and simulation applications, but it doesn’t make sense to use them for everything, and they can actually introduce latency bottlenecks for specific HPC workloads. I have some code here to highlight a few inefficiencies when you’re programming for GPUs today versus for CPUs, especially with memory management. On the left side of the screen, is is an example of code that runs on a CPU natively and on the right, it’s the same code, but ported to a GPU. Now let’s dive into the difference and I’ll show you a few operations that would introduce latency. For example, in this operation with GPU code, you have to explicitly manage your memory by moving data to the GPU. So first, there is an extra step where we have to allocate memory not just on the host CPU, but on the GPU device as well. Then, you have to copy the data from the host CPU memory to the GPU device memory. Obviously, you wouldn’t need to do either of these on the CPU alone. Then this section is where the work is done and where everything is processed. It’s more or less the same on both sides. And for each operation like this on the GPU, you’d then need to move the data back, like you’re seeing in this line, to the host CPU. The code for the CPU on the left doesn’t need any of these operations that can add latency and introduce more complexity. And it also gives the GPU code on the right more points of failure, too.

- And whether your workload is either CPU or GPU dependent, one thing that a lot of them do have in common is that many HPC workloads, they don’t need systems running 24 by 7 by 365 days a year, and that’s where cloud computing makes a lot of sense. So you’re not locked into those expensive hardware purchases, sometimes for 5+ years.

- Right, and in the Cloud on Azure, we give you a lot more choice. Computationally-intensive workloads can have very different requirements, and in Azure, we want to make sure you have the right tool for the job. We have a wide variety of diverse platforms and offerings available to cater to your workloads This way you can make the right choice in terms of performance per dollar. Some HPC workloads for example, still require 64-bit float as well as 32-bit float for double or single precision math and these make more sense to run on a CPU for better cost performance. And on Azure you are not limited to any instance type. Whether you decide to use GPUs or CPUs, you can tailor your compute instance to meet your precise needs.

- And speaking of which, this is actually including Microsoft’s own silicon in addition to new chip infrastructures that we’ve co-designed with the broader hardware ecosystem.

- And as we’ve spoken about before on past shows, our work here spans more than a decade to bring you the best choice. We’re committed to silicon diversity both with the work we’re doing with our silicon partners and our own in-house silicon-development. For example, about a year ago we announced that we’re building Microsoft custom silicon with our in-house developed Cobalt CPUs, which are power-efficient, ARM-based designs. We’ve made Cobalt 100-based virtual machines generally available and we’ve already seen massive workloads running on them spanning Databricks, Siemens, and Snowflake And you can also provision these VMs directly from the Azure portal, like any other VM. They’re already powering critical first-party workloads, like Microsoft Teams media processing and a few of its Copilot capabilities. At the same time, we also announced our work on in-house-developed AI accelerators, with Maia 100. And these are already being used with the Copilot service, in addition to Azure OpenAI Services, for inference at scale. The goal with both of these efforts is really to increase efficiency, optimizing performance per watt. And at the same, pushing the boundaries for related areas when running datacenters at hyperscale, like developing our own dedicated liquid cooling tower for Microsoft silicon that comprises a rack-level, closed-loop liquid cooling system for more efficient heat transfer. And our in-house data centers, we’ve also optimized for high-bandwidth network connectivity as we connect arrays of multiple CPUs and GPUs for performance demands.

- Yeah, so given Microsoft’s own silicon innovation here, then, how are we influencing the broader hardware ecosystem to support different workloads that you see a demand for?

- I’ll give you a few examples. When we started building out cloud-based supercomputers with thousands of connected CPUs and GPUs, not only was the network speed and reliability critical, but we also started to see memory capacity limitations causing bottlenecks. We’ve been working with our hardware partners for years with their silicon designs, and if we focus just on AMD and the example we started with, our collaboration on their MI300 series AI accelerators last year resulted in 192GB of high-bandwidth memory At the time, NVIDIA H100-series had just 80GB, so this was more than double that and is still higher than the 2024 H100 refresh, which was upgraded to 144GB. And beyond AI workloads, for HPC workloads, together we enabled AMD EPYC server CPUs with 3D V-Cache, superfast cache, vertically stacked over the CPU specifically, for our HX series VMs. In fact, I recently spoke to AMD’s CTO, Mark Papermaster. Here’s what he had to say.

- So, we’re actually users of that technology here at AMD. And we’re using the HX VM series to verify its next-generation chips and I’m really excited that we’re seeing up to 60% speed up. And we’re continuing our work together. The new specialized EPYC processors used in Azure’s HBv5 VMs take this to a whole new level with more than an 8x increase in memory bandwidth to 7TBs per second. The HBv5 bandwidth increase enables new levels of performance for workloads with large datasets that far exceed the capacity of the 3D V-Cache.

- And it’s interesting that AMD themselves are actually using the AMD-powered Azure HX series VMs to test their new CPU chip designs in HPC scenarios. In fact, Mark Papermaster mentioned the speedups with HBv5. So what makes this even possible?

- Well, so, for CPUs, memory bandwidth and memory capacity have always been tied. And now to support more demanding workloads, we’ve decoupled memory bandwidth from memory capacity on the CPU. Let me explain what difference this makes. For CPUs, traditionally memory is provided with DIMMs connected via slots on the motherboard, each slot has just one memory channel going to it, and it’s connected to the CPU via copper traces on the board. We’ve now moved away from a separate component-based architecture to a unified system-on-a-chip design where high-bandwidth memory can now coexist on the same package as the compute cores, which moves the memory closer to the CPU. In this case, we have eight stacks of memory with 16 channels per stack. So, there are a lot more memory channels to connect the HBM to the CPU. This means for HBv5, we have 128 memory channels per socket, compared to 12 channels per socket on HBv4. So, by moving high-bandwidth memory and the CPU to the same package, you gain more power efficiencies, and this is also how modern GPUs are designed.

- Right, and before in your code demo, you actually demonstrated the challenge of having to port applications to achieve the performance moving from a CPU basically to a GPU. So how much effort will it be, then, for a customer who’s maybe watching right now to go from say, HBv4 to HBv5 and still take advantage of high-bandwidth memory?

- Yeah, we get that question a lot, in terms of porting from one processor architecture to another like from x86 to ARM in order to capture all of the promised performance. This was really important to us when we were working with AMD on the HBv5 platform to make it easy to move from HBv4 to HBv5 and get the performance benefits. Here, I have the same side-by-side setup as before. On the left, I’m connected to an HBv4 VM, and on the right I’m in HBv5. First, I’m going to clone the benchmark that I showed in the beginning for measuring memory bandwidth. Now I can move into that directory and I’m going compile it to run on this HBv4. That’s going to take a moment to compile. I want to show that this benchmark will work without any modification on both generations of processors, and I’ll prove that by checksum-ing the same file in both VMs. First, with it compiled, I’ll go ahead and run it again on our HBv4 VM. It ran. And you’ll notice this Triad value is about 717 GB/s versus around 6.7 TB/s we saw on the HBv5. Now I’m going to run md5sum stream to get a checksum of the binary we just compiled, just so that I can prove it’s the same binary when we run it on HBv5. There’s our checksum. Now, I’ll move over to the HBv5 VM on the right and I’ll SCP, or copy, the binary over from the HBv4 VM. I’ll copy the path from the HBv4 into the HBv5, then bring the stream file over. Next, I’ll also bring my scripts over. And from here, I’ll run the md5sum stream command again on the HBv5 to prove that it’s the same binary. And with both checksums in view, you can see it is. Let’s see if it runs. And as you can see, we get the full performance of HBv5 without having to do any porting or optimizing of the code, or even recompiling. So, that means if you have apps that are starved for memory bandwidth, you can just run them on HBv5 and get that performance.

- Right, so no additional work then if your code’s already running on HBv4, but you get the massive memory performance gains when you move it to HBv5.

- That’s right. And all of this was by design from the outset. Normally, when you make significant changes to silicon architecture resulting in that much performance improvement, that level of change isn’t without consequence, and together with AMD we’ve achieved this without impacting compatibility.

- And this is really interesting insider perspective. So where do you see things going, then, in the HPC space?

- So, whether we realize it or not, HPC has a profound impact on our world, from safer cars and planes and buildings to cheaper energy to even to studying the fluid dynamics of beer. You’ll increasingly see AI/ML integrated into HPC applications to improve how we simulate the physical world. And as that happens, we’ll see more merging of traditional HPC models with AI/ML techniques. And our goal is to continually guide that innovation so that we can help you achieve this without losing accuracy and maintain overall cost performance.

- So thanks so much for joining us today, Mark. And if you want to learn more about High Performing Computing on Azure, be sure to check out aka.ms/AzureHPC and also aka.ms/AzureHBv5 for more info about HBv5. Subscribe to Microsoft Mechanics if you haven’t already and thank you so much for watching.

Published Jan 13, 2025

Version 1.0

Zachary-Cavanell

Bronze Contributor

Joined July 14, 2016

View Profile

Microsoft Mechanics Blog

Follow this blog board to get notified when there's new activity