Forum Discussion

ITManager8815's avatar
ITManager8815
Copper Contributor
Jan 24, 2025

Determining sizing requirements for GPU enabled Azure VM

Greetings,

We are trying to determine the correct VM sizing requirement for our AI workload, which is used for NLP processing. This workload does not require any training, but will only be used for inference.

We have the following software configuration:

  1. a C# application that is heavily multithreaded using a lot of socket I/O. The application has concentrated bursts where 10-20 threads are fired concurrently to perform tasks (mostly socket I/O).

This app communicates via dedicated sockets to:

  1. a Python application which performs various NLP tasks. This app is also multithreaded to handle multiple incoming requests from the .NET app. This app sends queries to a local LLM (model size will vary based on query type). We estimate we will need to support sub-second performance (at the very least) on a 7B parameter model. Ultimately, we may need to go to larger model sizes if accuracy is insufficient. The amount of text passed to the LLM will range from 300-3000 tokens.

In short, we need:

  1. a) a CPU with sufficient cores to handle multiple concurrent threads on the .NET side. The app will have 5 or 6 background threads running continuously, and sudden bursts of activity which will require a minimum of 10-20 threads to run shorter-lived tasks.
  2. b) a GPU with sufficient VRAM to handle at the very least, a 7B parameter model. Ultimately, we may need to support larger models to perform the same task due to insufficient accuracy.

We need the ideal configuration of GPU/VRAM and CPU/RAM to handle these tasks, and also, potentially, larger LLM sizes of up to 14B or 70B parameters.

We are looking at the NC-series VMs, with a budget of about $1,000/month (see https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/#pricing). Any feedback on the optimal configuration in terms of CPU/GPU would be greatly appreciated.

Thank you in advance.

  • How about this:

     

    • CPU Requirements: The NC-series VMs, particularly the NC24r and the NVv4 series, provide numerous vCPUs and substantial memory, which should handle your multithreaded C# application. For instance, the NC24r configuration offers up to 24 vCPUs and 224 GB of RAM.
    • GPU Requirements: Given your need to support at least a 7B parameter model and potentially larger models, the NC A100 v4-series is highly suitable. It provides NVIDIA A100 Tensor Core GPUs, which are among the best for AI inference workloads. The A100's large VRAM capacity (40-80 GB) will efficiently handle models up to 70B parameters.
    • Budget Considerations: The NC A100 v4 instances, such as the NC96ads_A100_v4, offer a balance between performance and cost. They allow you to take advantage of NVIDIA's Multi-Instance GPU (MIG) technology, which means you can partition the GPU to optimize resource usage and potentially reduce costs.

    Based on your budget of about $1,000/month, it seems feasible to utilize these high-performance configurations by optimizing resource allocation, particularly with the flexibility provided by the A100 GPUs. You can further refine your cost estimation by using the Azure Pricing Calculator to ensure your specific needs align with your financial considerations.

Resources