Blog Post

Educator Developer Blog
2 MIN READ

Understanding your GPU Performance on Azure with GPU Monitor

Lee_Stott's avatar
Lee_Stott
Icon for Microsoft rankMicrosoft
Mar 21, 2019
First published on MSDN on May 03, 2018
So I get lots of questions from Academics.

Many are now around performance and optimisation of cloud services. Or simply understanding what students are doing with the resources.

Many are specifically around the measurement and management of Azure GPS being used in the teaching of DNN, ML and AI https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu The most common is 'what's the best practice for monitoring GPU cores/RAM usage on N-series DSVM(s)?'

So there are solutions like logging into each VM and running "watch nvidia-smi" but this simply is not scalable and complex to manage across an estate of machines or clusters.

So the request is how can I do this simply and have a nice visual of usage across my class or cohort.

So wouldn't it be great is to have a single view of the utilisation in some form of dashboard visual.

Well you now can! Thanks to some Microsoft colleagues Mathew Salvaris and Miguel Fierro. They have created an app for monitoring GPUs on a single machine and across a clusters.

You can use it to record various GPU measurements during a specific period using the context based loggers or continuously using the gpumon cli command. The context logger can either record to a file, which can be read back into a dataframe, or to an InfluxDB database.

Data from the InfluxDB database can then be accessed using the python InfluxDB client or can be viewed in realtime using dashboards such as Grafana.

They have a great example which is available in Jupyter notebook and can be found here

Below is an example dashboard using the InfluxDB log context and a Grafana dashboard


You can download the installation and source from https://github.com/msalvaris/gpu_monitor
Updated Mar 21, 2019
Version 2.0
No CommentsBe the first to comment