Blog Post

Azure High Performance Computing (HPC) Blog

5 MIN READ

Optimizing Language Model Inference on Azure

Microsoft

Oct 02, 2024

By Shantanu Deepak Patankar, Software Engineer Intern, and Hugo Affaticati, Technical Program Manager 2 Inefficient inference optimization can lead to skyrocketing costs for customers, making i...

Updated Nov 13, 2024

Version 2.0

Microsoft

Joined July 26, 2022

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity

HugoAffaticati

Microsoft

Nov 13, 2024

Hello Dmonakhov, thank you for your interest in our benchmarking approach!

The optimum values referenced in this post serve as examples to illustrate our optimization process. Variations in these values are expected, particularly as we continue updating the virtual machine configurations. Throughput performance depends on both the VM version and on how the engines are configured. For instance, hyperparameters like –max_num_tokens and –max_seq_len impact memory allocation during engine build, which in turn influences throughput. These parameters can be tailored to specific use cases, enabling optimal configurations across different engine setups.

As you mentioned, while the exact throughput values may vary, the general shape of the throughput-to-batch-size curve remains consistent. For further customization, you can set –max_num_tokens above max_batch_size * max_seq_len here: GitHub Link. With increased values, the Azure team has successfully enabled engines to support larger batch sizes.

Thank you again for your valuable feedback!

dmonakhov
Copper Contributor
Nov 14, 2024
For further customization, you can set –max_num_tokens above max_batch_size * max_seq_len here: GitHub Link. With increased values, the Azure team has successfully enabled engines to support larger batch sizes.
Benchmark you mentioned above use tensortt-llm for inference performance, which has a bug which prevent to use tensor size above 1<<31, which practically means that for seq_len=1024,128 it is impossible to scale above batch_size=512, see https://github.com/NVIDIA/TensorRT-LLM/issues/2422. So there is no way to scale configuration you mentioned seq_size=1024,128 up to batch_size=750 as it claimed in a paper. Benchmark AI-benchmarking-guide/Benchmarks/LLMBenchmark.py simply crashes on assertions inside TensorRT-LLM. Probably you use different inference benchmark which has not this limitation, or tensorrt-llm but with smaller seq_size so, for example seq_len=128,8 can scale up to batch_size=4k