Hello Dmonakhov, thank you for your interest in our benchmarking approach!
The optimum values referenced in this post serve as examples to illustrate our optimization process. Variations in these values are expected, particularly as we continue updating the virtual machine configurations. Throughput performance depends on both the VM version and on how the engines are configured. For instance, hyperparameters like –max_num_tokens and –max_seq_len impact memory allocation during engine build, which in turn influences throughput. These parameters can be tailored to specific use cases, enabling optimal configurations across different engine setups.
As you mentioned, while the exact throughput values may vary, the general shape of the throughput-to-batch-size curve remains consistent. For further customization, you can set –max_num_tokens above max_batch_size * max_seq_len here: GitHub Link. With increased values, the Azure team has successfully enabled engines to support larger batch sizes.
Thank you again for your valuable feedback!