Using NVIDIA Triton Inference Server on Azure Container Apps

theringe

Microsoft

Mar 06, 2025

This tutorial provides a step-by-step guide to help you deploy the Nvidia Triton server on Azure Container Apps and use a sample ONNX model for image inference.

TOC

Introduction to Triton
System Architecture
- Architecture
- Focus of This Tutorial
Setup Azure Resources
- File and Directory Structure
- ARM Template
- ARM Template From Azure Portal
Testing Azure Container Apps
Conclusion
References

1. Introduction to Triton

Triton Inference Server is an open-source, high-performance inferencing platform developed by NVIDIA to simplify and optimize AI model deployment. Designed for both cloud and edge environments, Triton enables developers to serve models from multiple deep learning frameworks, including TensorFlow, PyTorch, ONNX Runtime, TensorRT, and OpenVINO, using a single standardized interface. Its goal is to streamline AI inferencing while maximizing hardware utilization and scalability.

A key feature of Triton is its support for multiple model execution modes, including dynamic batching, concurrent model execution, and multi-GPU inferencing. These capabilities allow organizations to efficiently serve AI models at scale, reducing latency and optimizing throughput. Triton also offers built-in support for HTTP/REST and gRPC endpoints, making it easy to integrate with various applications and workflows. Additionally, it provides model monitoring, logging, and GPU-accelerated inference optimization, enhancing performance across different hardware architectures.

Triton is widely used in AI-powered applications such as autonomous vehicles, healthcare imaging, natural language processing, and recommendation systems. It integrates seamlessly with NVIDIA AI tools, including TensorRT for high-performance inference and DeepStream for video analytics. By providing a flexible and scalable deployment solution, Triton enables businesses and researchers to bring AI models into production with ease, ensuring efficient and reliable inferencing in real-world applications.

2. System Architecture

Architecture

Development Environment

OS:	Ubuntu
Version:	Ubuntu 18.04 Bionic Beaver
Docker version:	26.1.3

Azure Resources

Storage Account:	SKU - General Purpose V2
Container Apps Environments:	SKU - Consumption
Container Apps:	N/A

Focus of This Tutorial

This tutorial walks you through the following stages:

Setting up Azure resources
Publishing the project to Azure
Testing the application

Each of the mentioned aspects has numerous corresponding tools and solutions. The relevant information for this session is listed in the table below.

Local OS

Windows	Linux	Mac
	V

How to setup Azure resources and deploy

Portal (i.e., REST api)	ARM	Bicep	Terraform
	V

3. Setup Azure Resources

File and Directory Structure

Please open a terminal and enter the following commands:

git clone https://github.com/theringe/azure-appservice-ai.git
cd azure-appservice-ai

After completing the execution, you should see the following directory structure:

File and Path	Purpose
triton/tools/arm-template.json	The ARM template to setup all the Azure resources related to this tutorial, including a Container Apps Environments, a Container Apps, and a Storage Account with the sample dataset.

ARM Template

We need to create the following resources or services:

	Manual Creation Required	Resource/Service
Container Apps Environments	Yes	Resource
Container Apps	Yes	Resource
Storage Account	Yes	Resource
Blob	Yes	Service
Deployment Script	Yes	Resource

Let’s take a look at the triton/tools/arm-template.json file. Refer to the configuration section for all the resources.

Since most of the configuration values don’t require changes, I’ve placed them in the variables section of the ARM template rather than the parameters section. This helps keep the configuration simpler. However, I’d still like to briefly explain some of the more critical settings.

As you can see, I’ve adopted a camelCase naming convention, which combines the [Resource Type] with [Setting Name and Hierarchy]. This makes it easier to understand where each setting will be used. The configurations in the diagram are sorted by resource name, but the following list is categorized by functionality for better clarity.

Configuration Name	Value	Purpose
storageAccountContainerName	data-and-model	[Purpose 1: Blob Container for Model Storage] Use this fixed name for the Blob Container.
scriptPropertiesRetentionInterval	P1D	[Purpose 2: Script for Uploading Models to Blob Storage] No adjustments are needed. This script is designed to launch a one-time instance immediately after the Blob Container is created. It downloads sample model files and uploads them to the Blob Container. The Deployment Script resource will automatically be deleted after one day.
caeNamePropertiesPublicNetworkAccess	Enabled	[Purpose 3: For Testing] ACA requires your local machine to perform tests; therefore, external access must be enabled.
appPropertiesConfigurationIngressExternal	true	[Purpose 3: For Testing] Same as above.
appPropertiesConfigurationIngressAllowInsecure	true	[Purpose 3: For Testing] Same as above.
appPropertiesConfigurationIngressTargetPort	8000	[Purpose 3: For Testing] The Triton service container uses port 8000.
appPropertiesTemplateContainers0Image	nvcr.io/nvidia/tritonserver:22.04-py3	[Purpose 3: For Testing] The Triton service container utilizes this online resource.

ARM Template From Azure Portal

In addition to using az cli to invoke ARM Templates, if the JSON file is hosted on a public network URL, you can also load its configuration directly into the Azure Portal by following the method described in the article [Deploy to Azure button - Azure Resource Manager]. This is my example.

Click Me

After filling in all the required information, click Create.

And we could have a test once the creation process is complete.

4. Testing Azure Container App

In our local environment, use the following command to start a one-time Docker container. We will use NVIDIA's official test image and send a sample image from within it to the Triton service that was just deployed to Container Apps.

# Replace XXX.YYY.ZZZ.azurecontainerapps.io with the actual FQDN of your app. There is no need to add https://
docker run --rm nvcr.io/nvidia/tritonserver:22.04-py3-sdk /workspace/install/bin/image_client -u XXX.YYY.ZZZ.azurecontainerapps.io -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg

After sending the request, you should see the prediction results, indicating that the deployed Triton server service is functioning correctly.

5. Conclusion

Beyond basic model hosting, Triton Inference Server's greatest strength lies in its ability to efficiently serve AI models at scale. It supports multiple deep learning frameworks, allowing seamless deployment of diverse models within a single infrastructure. With features like dynamic batching, multi-GPU execution, and optimized inference pipelines, Triton ensures high performance while reducing latency. While it may not replace custom-built inference solutions for highly specialized workloads, it excels as a standardized and scalable platform for deploying AI across cloud and edge environments. Its flexibility makes it ideal for applications such as real-time recommendation systems, autonomous systems, and large-scale AI-powered analytics.