Azure High Performance Computing (HPC) Blog

8 MIN READ

GPU Slicing in CycleCloud Slurm with CUDA Multi-Process Service (MPS)

Microsoft

Jan 14, 2025

High-performance computing thrives on efficient GPU resource sharing, and integrating NVIDIA’s CUDA Multi-Process Service (MPS) with CycleCloud-managed Slurm clusters can revolutionize how teams optimize their workloads.

CUDA MPS streamlines GPU sharing by creating a shared GPU context for multiple CUDA processes. This happens transparently, thanks to an MPS control daemon that manages workloads behind the scenes. It ensures that users can seamlessly run their CUDA programs without modification, each benefiting from their own MPS server for efficient workload distribution and GPU access.

When combined with the dynamic scaling of Azure CycleCloud and the robust job scheduling of Slurm, this setup becomes a game-changer. Teams can maximize GPU utilization, minimize costs, and scale effortlessly, all while maintaining the flexibility to adapt to diverse workloads.

Unlike NVIDIA’s Multi-Instance GPU (MIG), which is tied to specific GPU models like A100 and H100, MPS works across any GPU with a compute capability of 3.5 or higher. While MIG provides hardware-level isolation, MPS excels at enabling multiple concurrent CUDA processes on the same GPU, requiring careful memory management to maintain performance.

Refer to my previous post on Creating a SLURM Cluster (Without CycleCloud) for Scheduling NVIDIA MIG-Based GPU Accelerated workloads.

In this guide, we’ll explore how to integrate CUDA MPS with Slurm using Azure CycleCloud. You’ll discover how to run multiple GPU jobs on a single GPU, unlocking new levels of efficiency and scalability for your HPC operations.

Prerequisites
CycleCloud Version: 8.7
• Status: Must be installed and operational
Scheduler Node
• VM Size: Standard D4s v5 (4 vCPUs, 16 GiB memory)
• Image: Ubuntu-HPC 2204 - Gen2 (microsoft-dsvm:ubuntu-hpc:2204:latest)
• Scheduling Software: Slurm 24.05.4
Execute Node
• VM Size: Standard_NC24ads_A100_v4
• Image: Ubuntu-HPC 2204 - Gen2 (microsoft-dsvm:ubuntu-hpc:2204:latest)
• Scheduling Software: Slurm 24.05.4
• Nvidia Driver – integrated to Image.

Here is the procedure to integrate CUDA MPS in CycleCloud.

git clone https://github.com/vinil-v/slurm-cuda-mps
cd slurm-cuda-mps/
cyclecloud project upload "<Locker name>"

in your CycleCloud Slurm cluster configuration, add this project as a cluster-init in the Scheduler configuration.

This will enable the CUDA MPS support in the CycleCloud Slurm cluster. I have tested with A100, H100 and T4 GPU VMs and its working as expected.

The custom project is going to make the following configuration changes to slurm and GRES configuration.

----- slurm.conf ----
(base) vinil@slurmgpu-hpc-1:~$ grep Gres /etc/slurm/slurm.conf
GresTypes=gpu,mps

----- azure.conf -------
(base) vinil@slurmgpu-hpc-1:~$ grep mps /etc/slurm/azure.conf
Nodename=slurmgpu-hpc-1 Feature=cloud STATE=CLOUD CPUs=24 ThreadsPerCore=1 RealMemory=214016 Gres=gpu:1,mps:100

------gres.conf ----
(base) vinil@slurmgpu-hpc-1:~$ cat /etc/slurm/gres.conf
Nodename=slurmgpu-hpc-1 Name=gpu Count=1 File=/dev/nvidia0
Nodename=slurmgpu-hpc-1 Name=mps Count=100 File=/dev/nvidia0

Testing the setup:

Here’s the job script used to request a partial GPU allocation for compute. In this example, the job requests 25% of a GPU, allowing up to 4 jobs per GPU since each GPU is divided into 100 MPS shares. CUDA_MPS_ACTIVE_THREAD_PERCENTAGE variable controlling GPU usage. This variable set by Slurm using --gres=mps directive.

To test this setup, I used the distributed_training.py . You need to setup an Anaconda environment to run this job.

Since MPS doesn’t enforce strict memory limits per process, managing memory at the application level is essential for efficient memory use in multi-process setups.

To ensure successful concurrent execution of 4 jobs on a single GPU, I added code to limit each process's memory usage to 15GB.

# Limit GPU memory to 15 GB
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
       for gpu in gpus:
           tf.config.experimental.set_memory_growth(gpu, True)  # Enable memory growth
           tf.config.experimental.set_virtual_device_configuration(gpu,[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=15360)])
    except RuntimeError as e:
        print(e)  # Memory growth must be set at program startup

Slurm job script:

#!/bin/bash
#SBATCH --job-name=cuda_mps_job       # Job name
#SBATCH --output=cuda_mps_output.%j    # Output file
#SBATCH --error=cuda_mps_error.%j      # Error file
#SBATCH --ntasks=1                     # Number of tasks (processes)
#SBATCH --cpus-per-task=6              # Number of CPU cores per task (adjust as needed)
#SBATCH --gres=mps:25                   # Request MPS shares
#SBATCH --time=01:00:00                # Time limit (adjust as needed)
#SBATCH --partition=hpc                # Specify the GPU partition (adjust as needed)

# Define directories for MPS control pipes and logs
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps-$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log-$SLURM_JOB_ID

# Create the necessary directories
mkdir -p $CUDA_MPS_PIPE_DIRECTORY
mkdir -p $CUDA_MPS_LOG_DIRECTORY

# Start the MPS control daemon if it isn't already running
if ! pgrep -x "nvidia-cuda-mps-control" > /dev/null; then
    echo "Starting MPS control daemon..."
        nvidia-cuda-mps-control -d
fi

# Set the MPS thread utilization limit (optional)
export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=25

# Load your Python environment
source /shared/home/vinil/anaconda3/etc/profile.d/conda.sh
conda activate training_env

# Run your CUDA applications
echo "Running CUDA applications..."
python distributed_training.py

# Clean up MPS directories (optional)
echo "Stopping MPS control daemon..."
echo quit | nvidia-cuda-mps-control

# Clean up MPS directories
rm -rf $CUDA_MPS_PIPE_DIRECTORY
rm -rf $CUDA_MPS_LOG_DIRECTORY

We can see 4 jobs are running in 1 GPU under the MPS control. you can see M+C Type in the nvidia-smi output.

(base) vinil@slurmgpu-hpc-1:~$ nvidia-smi
Tue Nov  5 05:51:22 2024      
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000001:00:00.0 Off |                    0 |
| N/A   41C    P0             79W /  300W |   62748MiB /  81920MiB |     94%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     66542    M+C   python                                      15648MiB |
|    0   N/A  N/A     66543    M+C   python                                      15648MiB |
|    0   N/A  N/A     66544    M+C   python                                      15648MiB |
|    0   N/A  N/A     66545    M+C   python                                      15648MiB |
|    0   N/A  N/A     66560      C   nvidia-cuda-mps-server                         30MiB |
|    0   N/A  N/A     66563      C   nvidia-cuda-mps-server                         30MiB |
|    0   N/A  N/A     66564      C   nvidia-cuda-mps-server                         30MiB |
|    0   N/A  N/A     66567      C   nvidia-cuda-mps-server                         30MiB |
+-----------------------------------------------------------------------------------------+

We can observe that four jobs are currently running, while the fifth job is in the queue. In this setup, there is only one node with a single GPU, which is why the fifth job remains queued. Each job utilizes 25% of the GPU's capacity.

vinil@slurmgpu-scheduler:~$ squeue 
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 
3 hpc cuda_mps vinil R 0:02 1 slurmgpu-hpc-1 
4 hpc cuda_mps vinil R 0:02 1 slurmgpu-hpc-1 
2 hpc cuda_mps vinil R 0:05 1 slurmgpu-hpc-1 
1 hpc cuda_mps vinil R 1:03 1 slurmgpu-hpc-1 
5 hpc cuda_mps vinil PD 0:00 1 (Resources)

Current Limitations:

While MPS effectively handles multiple processes on a single GPU, it struggles with seamless multi-GPU support in Slurm environments. Testing on an 8-GPU machine revealed that only the first GPU was utilized, leaving the others idle, highlighting a need for further exploration to unlock full multi-GPU utilization under MPS.

The issue arises from Slurm's interaction with CUDA and its management of GPU visibility through environment variables like CUDA_VISIBLE_DEVICES and control groups (cgroups). Slurm uses CUDA_VISIBLE_DEVICES to isolate GPUs for jobs, mapping the assigned GPU to 0 from the job’s perspective. While effective for single-GPU setups, this approach limits MPS, which requires visibility of all GPUs it might manage. Additionally, Slurm's use of cgroups to enforce GPU allocation confines MPS servers to the assigned GPUs, preventing access to others. Compounding this, MPS allocates a shared GPU context per user (UID) and doesn’t inherently distribute processes across multiple GPUs, meaning jobs assigned to a single GPU by Slurm remain limited to that GPU’s context.

Manual Setup for Running CUDA-MPS Jobs in a Multi-GPU Environment

To begin, we need to start a separate CUDA MPS daemon for each GPU. Below is a script I created to enable the CUDA MPS daemon across all GPUs:

#!/bin/bash 
set -eux 
# Set the number of available GPUs
GPUS=$(nvidia-smi -L | wc -l) 

# Loop through each GPU 
for i in $(seq 0 $((GPUS-1))); do 
nvidia-smi -i $i -c EXCLUSIVE_PROCESS 
mkdir -p /tmp/mps_$i /tmp/mps_log_$i 
export CUDA_VISIBLE_DEVICES=$i 
export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i 
export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_$i 
nvidia-cuda-mps-control -d
done

To run a job on a specific GPU, setting the CUDA_VISIBLE_DEVICES variable is unnecessary. Instead, we leverage the CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY environment variables to direct jobs to specific GPUs.

The MPS control daemon, MPS server, and associated MPS clients communicate using named pipes and UNIX domain sockets, which are created by default in /tmp/nvidia-mps. The CUDA_MPS_PIPE_DIRECTORY variable lets you customize the location of these pipes and sockets. It is crucial that this variable is consistently set across all MPS clients sharing the same MPS server and control daemon.

Below is the command I used to submit a job to GPU0. The CUDA_MPS_ACTIVE_THREAD_PERCENTAGE variable was set to 25%, allowing up to four concurrent jobs on the GPU:

#!/bin/bash 
source /shared/home/vinil/anaconda3/etc/profile.d/conda.sh 
conda activate training_env 
export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=25 CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_0 
CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_0 
python case0/distributed_training.py

This approach enables job submission to each GPU by specifying unique CUDA_MPS_PIPE_DIRECTORY and CUDA_MPS_LOG_DIRECTORY values.

Below is the nvidia-smi output showcasing the results.

-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000001:00:00.0 Off |                    0 |
| N/A   33C    P0             64W /  400W |    8519MiB /  40960MiB |     22%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off |   00000002:00:00.0 Off |                    0 |
| N/A   32C    P0             65W /  400W |    8519MiB /  40960MiB |     22%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off |   00000003:00:00.0 Off |                    0 |
| N/A   32C    P0             61W /  400W |    8519MiB /  40960MiB |     16%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          Off |   00000004:00:00.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |    8519MiB /  40960MiB |     22%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          Off |   0000000B:00:00.0 Off |                    0 |
| N/A   32C    P0             63W /  400W |    8519MiB /  40960MiB |     22%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          Off |   0000000C:00:00.0 Off |                    0 |
| N/A   32C    P0             63W /  400W |    8519MiB /  40960MiB |     22%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          Off |   0000000D:00:00.0 Off |                    0 |
| N/A   32C    P0             63W /  400W |    8519MiB /  40960MiB |     24%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          Off |   0000000E:00:00.0 Off |                    0 |
| N/A   32C    P0             61W /  400W |    8519MiB /  40960MiB |     14%   E. Process |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                     
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    122702      C   nvidia-cuda-mps-server                         30MiB |
|    0   N/A  N/A    137145    M+C   python                                       8480MiB |
|    1   N/A  N/A    123715      C   nvidia-cuda-mps-server                         30MiB |
|    1   N/A  N/A    137143    M+C   python                                       8480MiB |
|    2   N/A  N/A    124455      C   nvidia-cuda-mps-server                         30MiB |
|    2   N/A  N/A    137141    M+C   python                                       8480MiB |
|    3   N/A  N/A    130412      C   nvidia-cuda-mps-server                         30MiB |
|    3   N/A  N/A    137148    M+C   python                                       8480MiB |
|    4   N/A  N/A    131362      C   nvidia-cuda-mps-server                         30MiB |
|    4   N/A  N/A    137142    M+C   python                                       8480MiB |
|    5   N/A  N/A    132061      C   nvidia-cuda-mps-server                         30MiB |
|    5   N/A  N/A    137144    M+C   python                                       8480MiB |
|    6   N/A  N/A    137147    M+C   python                                       8480MiB |
|    6   N/A  N/A    137180      C   nvidia-cuda-mps-server                         30MiB |
|    7   N/A  N/A    137146    M+C   python                                       8480MiB |
|    7   N/A  N/A    137186      C   nvidia-cuda-mps-server                         30MiB |

Conclusion

By integrating CUDA MPS into CycleCloud, GPU resource sharing reaches new levels of efficiency. Despite its limitations, MPS offers an excellent solution for environments prioritizing flexibility over strict isolation. This project is a step toward optimizing GPU usage for diverse workloads in cloud-based HPC clusters.

Reference:

Nvidia CUDA Multi-Process Service

Azure CycleCloud

Slurm MPS Management

Published Jan 14, 2025

Version 1.0

hpc

vinilv

Microsoft

Joined November 12, 2020

View Profile

Azure High Performance Computing (HPC) Blog