AI - Machine Learning Blog

7 MIN READ

Supercharge Your Deep Learning Workflows with NVIDIA Nsight Systems

Microsoft

Dec 12, 2024

Machine learning (ML), deep learning (DL), and AI workloads are becoming increasingly complex, demanding efficient use of hardware resources and time. With the growing need to accelerate model training and improve runtime performance, it's crucial to leverage the right tools to profile and optimize these workloads. One such tool is NVIDIA Nsight Systems, a powerful performance analysis tool that helps you visualize and optimize the behavior of your applications running on GPUs.

Why Optimize ML/DL Workloads?

In the world of ML and DL, training times can significantly affect productivity and model iteration cycles. Whether you're working with large datasets, sophisticated models, or distributed systems, optimizing workload performance is essential for faster results. Often, bottlenecks in computation or data preprocessing can reduce the overall speed of training and inference, leading to wasted resources and delayed model deployment.

This is where tools like NVIDIA Nsight Systems come in. By offering insights into how your application interacts with the underlying hardware, Nsight Systems can help pinpoint inefficiencies and optimize both GPU and CPU usage

What is NVIDIA Nsight Systems?

NVIDIA Nsight Systems is a performance analysis tool designed to help you analyze and optimize your applications' behavior on GPUs, CPUs, and across multi-node systems. It provides a detailed, time-accurate view of system-wide activities such as kernel launches, memory transfers, and CPU/GPU interactions. Nsight Systems integrates seamlessly with CUDA, DNNL, TensorRT, and other NVIDIA libraries, making it an ideal tool for optimizing ML, DL, and AI workloads.

You can download and install Nsight Systems from the official NVIDIA website here

Key Components in the Nsight Systems UI

When you open the Nsight Systems UI, you’ll be greeted with a comprehensive set of visualizations and performance metrics. Here are some key components of the interface:

Timeline View: This shows the execution timeline of different threads and processes on the system. You can see how CPU and GPU workloads overlap and where bottlenecks occur.
CUDA Kernels: Here, you can monitor the performance of CUDA kernels. It includes kernel execution time, memory accesses, and other critical metrics.
CPU/GPU Activity: This section displays the amount of time spent on CPU and GPU operations. It helps identify if there's an imbalance between CPU and GPU workloads.
NVTX Ranges: NVIDIA Tools Extensions (NVTX) are user-defined markers that help track specific events in your code. You can use these markers to visualize where time is being spent in your model or application.

Optimizing a Deep Learning Model: Example with FashionMNIST Dataset

Let’s dive into a practical example using NVIDIA Nsight Systems to optimize a deep learning model built on the FashionMNIST dataset.

First, let's create a simple DNN model for image classification using PyTorch:

class ImageClassifier(nn.Module):
    def __init__(self):
        super(ImageClassifier, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 4 * 4, 120)
        self.fc2 = nn.Linear(120, 240)
        self.fc3 = nn.Linear(240, 480)
        self.fc4 = nn.Linear(480, 240)
        self.fc5 = nn.Linear(240, 120)
        self.fc6 = nn.Linear(120, 100)
        self.fc7 = nn.Linear(100, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x))) 
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 4 * 4)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.relu(self.fc4(x))
        x = F.relu(self.fc5(x))
        x = F.relu(self.fc6(x))
        x = self.fc7(x)
        return x

Step 1: Baseline Performance - Batch Size 100

Initially, we run the training loop for 3 epochs using a batch size of 100. The total training time takes around 28 seconds. Upon profiling this run with Nsight Systems, the performance report (nsys-profile.rep) reveals that there are small CUDA operations followed by a large CPU waiting period, where time is primarily spent on loading and preprocessing the data. This indicates a potential bottleneck in data handling rather than GPU computation.

Step 2: Optimizing with Larger Batch Size

We next increase the batch size to 1000 to see if it leads to a performance improvement. However, after profiling, there is no significant improvement. At this point, we realize that simply increasing the batch size isn't enough; the bottleneck still exists on the data loading side.

Step 3: Optimizing Data Loading with Multiple Workers

To address this, we adjust the number of data loading workers using num_workers = os.cpu_count() to parallelize data loading. For this experiment, I am using Standard_NC24ads_A100_v4 VM which has 24 cores. This reduces the total runtime to 3.1 seconds, a marked improvement over the initial 28 seconds. The GPU now spends more time on computation rather than waiting for data.

Step 4: Reducing Fork Operations and Enhancing Asynchronous Data Transfer

Upon further analysis of the above Nsight Systems visual, we notice that there is a significant delay between when NVTX ranges are triggered and when the code execution starts. This delay is attributed to many fork operations. Forks can lead to inefficiencies because each fork creates a new process, and this can consume a lot of time.

To address this, we:

Set pin_memory=True in the DataLoader to enable faster data transfer from host to device memory. Pinned memory is a region of memory that is reserved for faster access. Normally, when data is transferred between the CPU and GPU, it first needs to be copied into a pinned memory location before being transferred to the GPU. By pre-allocating your data in pinned memory, this step is skipped, resulting in faster data transfers.
Set persistent_workers=True to reduce the number of fork operations during training. When this option is set, the worker processes are created once and are kept alive for the entire duration of the DataLoader's lifetime. This can save the overhead of creating and destroying worker processes for each epoch, especially when the data loading pipeline is complex or when each epoch involves loading a large amount of data. This is particularly beneficial when you have a large number of epochs or when the initialization of workers is costly.
Use data.cuda(non_blocking=True), target.cuda(non-blocking = True) for asynchronous data transfer from CPU to GPU.

After making these adjustments, the training time drops to just 2.2 seconds from the baseline 28s, marking a 92.9% reduction from the original 28 seconds.

From the above figure, we can see that the runtime for epoch1 went down to 596.471 ms from 1.134s before which is a significant 50% improvement in runtime. Also, the kernel operations are more densely packed indicating a good GPU utilization and 93% of time is spent in CUDA kernel execution with only 7.8% of time in memory transfer from HostToDevice and DeviceToHost which is a good signal that most of the time is effectively utilized for CUDA operations.

Step 5: The Power of Automatic Mixed Precision (AMP)

One final optimization to consider is the use of AMP (Automatic Mixed Precision) training. By default, PyTorch uses FP32 precision, which can lead to higher memory requirements and slower computation. Switching to AMP allows the use of FP16 precision where possible, reducing memory usage and speeding up computation.

Although this specific model is relatively simple, AMP can significantly benefit larger models with larger batch sizes. The reduced memory footprint and improved computational efficiency can result in faster training times.

To enable Mixed Precision training, make the following changes to your code:-
1. Enable automatic casting of operations by wrapping the forward and backward pass with `autocast`
2. Initialize GradScaler for AMPscaler = GradScaler(). These helps scale the loss to avoid underflow in gradients when performing backpropagation with mixed precision.

# Initialize GradScaler for AMP
scaler = GradScaler()

start = time.time()
# Start profiling
torch.cuda.profiler.start()
# Training loop
for epoch in range(3):

    torch.cuda.nvtx.range_push(f"Epoch {epoch}")
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx%10 == 0:
            data, target = data.cuda(non_blocking = True), target.cuda(non_blocking = True)
            optimizer.zero_grad()
            with autocast(device_type='cuda'):
                output = model(data)
                loss = loss_fn(output, target)
            
            # Scale loss and backpropagate
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()

Step 6: Using DistributedDataParallel for Multi-Node GPUs

For large-scale ML/DL workloads, using a single GPU might not be enough. If you're using multiple nodes with multiple GPUs, switching to DistributedDataParallel from DataParallel can greatly improve performance. DistributedDataParallel efficiently distributes the model and data across multiple GPUs, reducing training time and improving scalability.

Conclusion

By profiling the training process with NVIDIA Nsight Systems, we identified several optimization opportunities that resulted in a dramatic reduction in training time — from 28 seconds to just 2 seconds, a 92.9% improvement. From adjusting batch sizes to optimizing data loading and transfer, every step contributed to more efficient use of GPU resources. Additionally, using AMP and DistributedDataParallel can offer even further performance improvements, especially with larger models and multi-GPU setups.

As machine learning workloads continue to grow in complexity, tools like NVIDIA Nsight Systems are invaluable for ensuring that your models train efficiently and that your resources are used effectively. By combining these optimizations, you can significantly reduce training time and accelerate your ML and DL workflows.

You can access the full code here

I would like to extend my heartfelt gratitude to eyasttaifour for his invaluable insights, suggestions, and contributions to this blog post. His expertise and support played a crucial role in shaping the content, and I am deeply appreciative of the collaborative spirit.

Reference:
Nsight Systems | NVIDIA Developer
Designing Deep Learning Applications with NVIDIA Nsight Deep Learning Designer | NVIDIA Technical Blog
Nsight Deep Learning (DL) Designer | NVIDIA Developer
Profiling Deep Learning Applications with NVIDIA NSight
6. Nvidia Nsight Developer Tools -- Max Katz
Deepset achieves a 3.9x speedup and 12.8x cost reduction for training NLP models by working with AWS and NVIDIA | AWS Machine Learning Blog

Updated Dec 12, 2024

Version 2.0

machine learning

Priya_Kedia

Microsoft

Joined October 13, 2023

View Profile

AI - Machine Learning Blog

Follow this blog board to get notified when there's new activity