This tutorial provides a step-by-step guide to help you deploy the Nvidia Triton server on Azure Container Apps and use a sample ONNX model for image inference.
TOC
- Introduction to Triton
- System Architecture
- Architecture
- Focus of This Tutorial
- Setup Azure Resources
- File and Directory Structure
- ARM Template
- ARM Template From Azure Portal
- Testing Azure Container Apps
- Conclusion
- References
1. Introduction to Triton
Triton Inference Server is an open-source, high-performance inferencing platform developed by NVIDIA to simplify and optimize AI model deployment. Designed for both cloud and edge environments, Triton enables developers to serve models from multiple deep learning frameworks, including TensorFlow, PyTorch, ONNX Runtime, TensorRT, and OpenVINO, using a single standardized interface. Its goal is to streamline AI inferencing while maximizing hardware utilization and scalability.
A key feature of Triton is its support for multiple model execution modes, including dynamic batching, concurrent model execution, and multi-GPU inferencing. These capabilities allow organizations to efficiently serve AI models at scale, reducing latency and optimizing throughput. Triton also offers built-in support for HTTP/REST and gRPC endpoints, making it easy to integrate with various applications and workflows. Additionally, it provides model monitoring, logging, and GPU-accelerated inference optimization, enhancing performance across different hardware architectures.
Triton is widely used in AI-powered applications such as autonomous vehicles, healthcare imaging, natural language processing, and recommendation systems. It integrates seamlessly with NVIDIA AI tools, including TensorRT for high-performance inference and DeepStream for video analytics. By providing a flexible and scalable deployment solution, Triton enables businesses and researchers to bring AI models into production with ease, ensuring efficient and reliable inferencing in real-world applications.
2. System Architecture
Architecture
Development Environment
OS: | Ubuntu |
Version: | Ubuntu 18.04 Bionic Beaver |
Docker version: | 26.1.3 |
Azure Resources
Storage Account: | SKU - General Purpose V2 |
Container Apps Environments: | SKU - Consumption |
Container Apps: | N/A |
Focus of This Tutorial
This tutorial walks you through the following stages:
- Setting up Azure resources
- Publishing the project to Azure
- Testing the application
Each of the mentioned aspects has numerous corresponding tools and solutions. The relevant information for this session is listed in the table below.
Local OS
Windows |
Linux |
Mac |
|
V |
|
How to setup Azure resources and deploy
Portal (i.e., REST api) |
ARM |
Bicep |
Terraform |
|
V |
|
|
3. Setup Azure Resources
File and Directory Structure
Please open a terminal and enter the following commands:
git clone https://github.com/theringe/azure-appservice-ai.git
cd azure-appservice-ai
After completing the execution, you should see the following directory structure:
File and Path |
Purpose |
The ARM template to setup all the Azure resources related to this tutorial, including a Container Apps Environments, a Container Apps, and a Storage Account with the sample dataset. |
ARM Template
We need to create the following resources or services:
|
Manual Creation Required |
Resource/Service |
Container Apps Environments |
Yes |
Resource |
Container Apps |
Yes |
Resource |
Storage Account |
Yes |
Resource |
Blob |
Yes |
Service |
Deployment Script |
Yes |
Resource |
Let’s take a look at the triton/tools/arm-template.json file. Refer to the configuration section for all the resources.
Since most of the configuration values don’t require changes, I’ve placed them in the variables section of the ARM template rather than the parameters section. This helps keep the configuration simpler. However, I’d still like to briefly explain some of the more critical settings.
As you can see, I’ve adopted a camelCase naming convention, which combines the [Resource Type] with [Setting Name and Hierarchy]. This makes it easier to understand where each setting will be used. The configurations in the diagram are sorted by resource name, but the following list is categorized by functionality for better clarity.
Configuration Name |
Value |
Purpose |
storageAccountContainerName |
data-and-model | [Purpose 1: Blob Container for Model Storage] Use this fixed name for the Blob Container. |
scriptPropertiesRetentionInterval |
P1D | [Purpose 2: Script for Uploading Models to Blob Storage] No adjustments are needed. This script is designed to launch a one-time instance immediately after the Blob Container is created. It downloads sample model files and uploads them to the Blob Container. The Deployment Script resource will automatically be deleted after one day. |
caeNamePropertiesPublicNetworkAccess |
Enabled | [Purpose 3: For Testing] ACA requires your local machine to perform tests; therefore, external access must be enabled. |
appPropertiesConfigurationIngressExternal |
true | [Purpose 3: For Testing] Same as above. |
appPropertiesConfigurationIngressAllowInsecure | true | [Purpose 3: For Testing] Same as above. |
appPropertiesConfigurationIngressTargetPort | 8000 | [Purpose 3: For Testing] The Triton service container uses port 8000. |
appPropertiesTemplateContainers0Image |
nvcr.io/nvidia/tritonserver:22.04-py3 | [Purpose 3: For Testing] The Triton service container utilizes this online resource. |
ARM Template From Azure Portal
In addition to using az cli to invoke ARM Templates, if the JSON file is hosted on a public network URL, you can also load its configuration directly into the Azure Portal by following the method described in the article [Deploy to Azure button - Azure Resource Manager]. This is my example.
After filling in all the required information, click Create.
And we could have a test once the creation process is complete.
4. Testing Azure Container App
In our local environment, use the following command to start a one-time Docker container. We will use NVIDIA's official test image and send a sample image from within it to the Triton service that was just deployed to Container Apps.
# Replace XXX.YYY.ZZZ.azurecontainerapps.io with the actual FQDN of your app. There is no need to add https://
docker run --rm nvcr.io/nvidia/tritonserver:22.04-py3-sdk /workspace/install/bin/image_client -u XXX.YYY.ZZZ.azurecontainerapps.io -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
After sending the request, you should see the prediction results, indicating that the deployed Triton server service is functioning correctly.
5. Conclusion
Beyond basic model hosting, Triton Inference Server's greatest strength lies in its ability to efficiently serve AI models at scale. It supports multiple deep learning frameworks, allowing seamless deployment of diverse models within a single infrastructure. With features like dynamic batching, multi-GPU execution, and optimized inference pipelines, Triton ensures high performance while reducing latency. While it may not replace custom-built inference solutions for highly specialized workloads, it excels as a standardized and scalable platform for deploying AI across cloud and edge environments. Its flexibility makes it ideal for applications such as real-time recommendation systems, autonomous systems, and large-scale AI-powered analytics.
6. References
Quickstart — NVIDIA Triton Inference Server
Deploying an ONNX Model — NVIDIA Triton Inference Server
Model Repository — NVIDIA Triton Inference Server
Triton Tutorials — NVIDIA Triton Inference Server
Updated Mar 06, 2025
Version 1.0theringe
Microsoft
Joined April 08, 2022
Apps on Azure Blog
Follow this blog board to get notified when there's new activity