Multi-Model Inference with Triton Inference Server

January 10, 2024

Challenges with Multi-Model Real-Time Inference

Real-time inference of deployed deep learning models is a pretty complex task. Here, we will discuss a few challenges that need to be addressed for successful deployment in various applications.

Low Latency: The performance of the deployed model is measured upon really strict low latency constraints (practically about 200ms) and optimized over high throughput.

Scalability: When we scale our infrastructure to handle a large number of simultaneous requests, it requires load balancing and distributed computing solutions to ensure that the infrastructure can handle varying loads in a cost-effective manner.

Model Complexity: Multiple models can be highly complex, as they need to understand and integrate information from different sources. Training and deploying these models for real-time inference can be resource-intensive.

Resource Management: Managing resources, such as CPU, GPU, memory, and storage, is crucial for real-time inference. Ensuring that resources are allocated efficiently to handle multi-model data can be complex.

Deployment and Maintenance: Deploying and maintaining multi-model real-time systems in the field can be complex, as they often require ongoing updates, maintenance, and monitoring.

In the upcoming section, we will understand the key functionalities and the best features of the Triton Inference Server.

Understanding Triton Inference Server

Figure: Triton Server Architecture with highlighted features (source: Triton Architecture)

Let's have a look at the 7 most powerful features of the Triton Inference Server:

1. Computational Devices Plugin

In a deployment environment, there can be multiple associated models to fulfill a single task or an ensemble of tasks. These models might need different types of compute backends (GPUs/ CPUs). Triton provides support for hosting multiple models on different computational devices such as GPUs and CPUs.

2. Monitoring Feature

When we host a server in production, it is really important that we plug-in monitoring pipelines. It can be model monitoring or infra monitoring. The Triton server has a built-in metric monitoring system, which exports its metric reports using http; hence it can be seamlessly integrated with any other monitoring system like the Prometheus Grafana dashboard which collects these metrics.

3. Multiple Framework Support

The Triton Inference Server provides flexibility to host models built upon different deep learning frameworks such as PyTorch, TensorFlow, ONNX, etc. This feature enables us to seamlessly develop task-specific models and deploy them to serve a client's request.

4. Scheduler

When we use multiple models, each with multiple versions, and each receives a high volume of data input requests, we need an effective orchestrator and scheduler. This functionality is in-built in the Triton Inference Server.

5. HTTP and gRPC

Triton also supports HTTP and gRPC, allowing integration through both HTTP and gRPC requests using different ports.

6. Client Application

Triton has numerous options for SDKs in different languages; Python, C++, Java, which can be used to create client applications in order to interact with the inference server.

7. Model Analyzer

This tool, provided by Triton, offers a range of optimization features through the Triton Model Analyzer. We can think of it as a grid search tool that explores various optimization options and provides us with the most optimized combination of model configurations.

Getting Started

Docker and Triton Installations


!apt-get -qq install docker.io
docker pull nvcr.io/nvidia/tritonserver:20.06-py3

You can check the installation of the server as:

Setting Up the Project Repository

Triton expects the models, configurations, version files, etc. in a specific file structure format. These can be stored in a local file system or cloud object store like E2E Solutions. Check out the object storage options on their website:

Triton for Gen AI Inference

In this section, we will deploy and inference a Generative AI model, namely, DCGAN on FASHIONGEN dataset from PyTorch Hub and understand how to use Triton for single model inference. The official Triton Inference Server GitHub repository contains Quick_Deploy examples for different frameworks such as ONNX, TensorFlow, vLLM, and so on.

Figure: Inference pipeline for deployment and generation with DCGAN.

In this section, we will use the single pipeline approach for client-model interaction. Here, we will deploy the pipeline without explicitly breaking apart the model from the pipeline.

Setting Up the Project Repository

As discussed earlier, we will have to set up our model repository in a Triton's readable format, as provided below:


mkdir -p model_repository/DCGAN_FASHIONGEN/1

After creating the file structure, save the ‘model.pt’ file in folder 1, and ‘config.pbtxt’ file in the DCGAN_FASHIONGEN folder, indicating the version of the model to Triton.


model_repository
|
+-- DCGAN_FASHIONGEN
   |
   +-- config.pbtxt 
   +-- 1
       |
       +-- model.pt

Now, as we have our file structure ready, we will move towards the PyTorch hub to save our generative ai model ‘DCGAN’.

Preparing the Torchscript Model

We will prepare a DCGAN.py file to export the model and save a trace of the 64x64 size image.


!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip3 install torchvision


import torch
import torchvision.models as models
use_gpu = True if torch.cuda.is_available() else False
model = torch.hub.load('facebookresearch/pytorch_GAN_zoo:hub', 'DCGAN', pretrained=True, useGPU=use_gpu)
torch.save(model, "model.pt")
print("TorchScript DCGAN model saved.")

Now, as we have the model and repository structure ready, let's look at the config.pbtxt model configuration file. The minimum requirements for the configuration file are that you must satisfy the platform (or backend properties), the max_batch_size property, and the input and output tensors of the model.

Setup Model Configuration


name: "DCGAN_FASHIONGEN"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
 {
   name: "input__0"
   data_type: TYPE_INT32
 }
]
input [
 {
   name: "input__1"
   data_type: TYPE_FP32
   dims: [ 32, 120 ]
 }
]
output [
 {
   name: "output__0"
   data_type: TYPE_FP32
   dims: [ 32, 120]
 }
]
output [
 {
   name: "output__1"
   data_type: TYPE_FP32
   dims: [ 32, 3 ,64, 64]
   reshape { shape: [ 32, 64, 64, 3 ] }
 }
]

Name: ‘name’ is an optional field, the value of which should match the name of the directory of the model.
Backend: This field indicates which backend is being used to run the model. Triton supports a wide variety of backends like TensorFlow, PyTorch, Python, ONNX and more.
max_batch_size: As the name implies, this field defines the maximum batch size that the model can support.
Input and output: The input and output sections specify the name, shape, datatype, and more, while providing operations like reshaping.

Now that we are ready to launch the server, we will use the freely available pre-built Docker containers from NGC.

Launch the Triton Server


docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models

Now we have to build a Client, which requires three basic points:

It should set up a connection with the Triton Inference Server.
It should specify the names of the input and output layers of our model.
It should send an inference request to the Triton Inference Server.

First download the dependencies (torchvision) inside our workspace:


docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:-py3-sdk bash
pip install torchvision

Set up connection with the client.


client = httpclient.InferenceServerClient(url="localhost:8000")

Specify the names of the input and output layers of the model.


num_images = 32

input_0 = httpclient.InferInput("input__0", dtype="INT32")
input_0.set_data_from_numpy(num_images, binary_data=True)

output_0, _ = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=48)

input_1 = httpclient.InferInput("input__1", output_0.shape, datatype="FP32")
input_1.set_data_from_numpy(transformed_img, binary_data=True)

outputs = httpclient.InferRequestedOutput("output__1", binary_data=True, class_count=48)

Send an inference request to the server.


results = client.infer(model_name="DCGAN_FASHIONGEN", inputs=[input_0, input_1], outputs=[output_0, output_1])
generated_image = results.as_numpy('output__1')
plt.show(generated_image)

The model output should look like:

Multi-Model Inference with NVIDIA Triton

In this section, we will understand the workflow for the deployment of multiple models. For this we will take a deeper look at the official docs provided by triton-inference-server GitHub repository.

The agenda for this part is to deploy a pipeline for transcribing a text-from-images model. However, in this section, we will use a break apart pipeline, and will leverage different backends for multiple models’ input-output processing while deploying core models on different framework backends.

Ensemble pipeline: We will divide the problem into two major components: the text detection pipeline and the text recognition pipeline. The Triton Inference Server allows us to deconstruct the model deployment pipeline and build an ensemble model, applying pre- and post-processing steps along with the exported models.

So let's get started with the multi-model deployment – and we will learn some cool features of Triton.

Downloading the Model

Text Detection

wget https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz tar -xvf frozen_east_text_detection.tar.gz

Text Recognition


wget https://www.dropbox.com/sh/j3xmli4di1zuv3s/AABzCC1KGbIRe2wRwa3diWKwa/None-ResNet-None-CTC.pth

Export to ONNX

NGC TensorFlow container environment: docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/tensorflow:<yy.mm>-tf2-py3
install tf2onnx: pip install -U tf2onnx
Converting OpenCV's EAST model to ONNX:


python -m tf2onnx.convert --input frozen_east_text_detection.pb --inputs "input_images:0" --outputs "feature_fusion/Conv_7/Sigmoid:0","feature_fusion/concat_3:0" --output detection.onnx

Setting Up the Model's Repository

As we have already looked at Triton's way of reading our model, let's now focus on setting up a model repository for these multiple models.

This file structure can be set up in the following manner:


mkdir -p model_repository/text_detection/1
mv detection.onnx model_repository/text_detection/1/model.onnx
mkdir -p model_repository/text_recognition/1
mv str.onnx model_repository/text_recognition/1/model.onnx

Setting Up the Model Configurations

The Triton Inference Server GitHub repository provides the configuration files of text_detection and text_recognition.

Installing and Importing Dependencies

We will need the following dependencies for image processing and HTTP client.


!pip install cv2
!pip install tritonclient
import math
import numpy as np
import cv2
import tritonclient.http as httpclient

Launching the Triton Server


docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:-py3

Building a Client Application

As we discussed three key points to build a client, the conceptual guide (of triton repository) has defined a few helper functions that take care of pre- and post-processing steps in our pipeline. You can check out the client.py file.

Let's recall three golden points for the client application:

It should set up a connection with the Triton Inference Server


client = httpclient.InferenceServerClient(url="localhost:8000")

It should specify the names of the input and output layers of our model.


raw_image = cv2.imread("./img2.jpg")
preprocessed_image = detection_preprocessing(raw_image)

detection_input = httpclient.InferInput("input_images:0", preprocessed_image.shape, datatype="FP32")
detection_input.set_data_from_numpy(preprocessed_image, binary_data=True)

It should send an inference request to the Triton Inference Server.


detection_response = client.infer(model_name="text_detection", inputs=[detection_input])

We will repeat the process for text recognition model to finally perform the post-process and print the results:

Process responses from the detection model.


scores = detection_response.as_numpy('feature_fusion/Conv_7/Sigmoid:0')
geometry = detection_response.as_numpy('feature_fusion/concat_3:0')
cropped_images = detection_postprocessing(scores, geometry, preprocessed_image)

Create an input object for the recognition model.


recognition_input = httpclient.InferInput("input.1", cropped_images.shape, datatype="FP32")
recognition_input.set_data_from_numpy(cropped_images, binary_data=True)

Query the server.


recognition_response = client.infer(model_name="text_recognition", inputs=[recognition_input])

Process the response from the recognition model.


text = recognition_postprocessing(recognition_response.as_numpy('308'))
print(text)

Voilà! Now we understand how seamless multi-model deployment using the Triton Inference Server is. Do check out the Triton Inference Server GitHub repository to try out the text detection and recognition deployment yourself.

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Multi-Model Inference with Triton Inference Server

January 10, 2024

Siddhant Saxena

Challenges with Multi-Model Real-Time Inference

Real-time inference of deployed deep learning models is a pretty complex task. Here, we will discuss a few challenges that need to be addressed for successful deployment in various applications.

Low Latency: The performance of the deployed model is measured upon really strict low latency constraints (practically about 200ms) and optimized over high throughput.

Deployment and Maintenance: Deploying and maintaining multi-model real-time systems in the field can be complex, as they often require ongoing updates, maintenance, and monitoring.

In the upcoming section, we will understand the key functionalities and the best features of the Triton Inference Server.

Understanding Triton Inference Server

Figure: Triton Server Architecture with highlighted features (source: Triton Architecture)

Let's have a look at the 7 most powerful features of the Triton Inference Server:

1. Computational Devices Plugin

2. Monitoring Feature

3. Multiple Framework Support

4. Scheduler

5. HTTP and gRPC

Triton also supports HTTP and gRPC, allowing integration through both HTTP and gRPC requests using different ports.

6. Client Application

Triton has numerous options for SDKs in different languages; Python, C++, Java, which can be used to create client applications in order to interact with the inference server.

7. Model Analyzer

Getting Started

Docker and Triton Installations


!apt-get -qq install docker.io
docker pull nvcr.io/nvidia/tritonserver:20.06-py3

You can check the installation of the server as:

Setting Up the Project Repository

Triton for Gen AI Inference

Figure: Inference pipeline for deployment and generation with DCGAN.

In this section, we will use the single pipeline approach for client-model interaction. Here, we will deploy the pipeline without explicitly breaking apart the model from the pipeline.

Setting Up the Project Repository

As discussed earlier, we will have to set up our model repository in a Triton's readable format, as provided below:


mkdir -p model_repository/DCGAN_FASHIONGEN/1

After creating the file structure, save the ‘model.pt’ file in folder 1, and ‘config.pbtxt’ file in the DCGAN_FASHIONGEN folder, indicating the version of the model to Triton.


model_repository
|
+-- DCGAN_FASHIONGEN
   |
   +-- config.pbtxt 
   +-- 1
       |
       +-- model.pt

Now, as we have our file structure ready, we will move towards the PyTorch hub to save our generative ai model ‘DCGAN’.

Preparing the Torchscript Model

We will prepare a DCGAN.py file to export the model and save a trace of the 64x64 size image.


!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip3 install torchvision


import torch
import torchvision.models as models
use_gpu = True if torch.cuda.is_available() else False
model = torch.hub.load('facebookresearch/pytorch_GAN_zoo:hub', 'DCGAN', pretrained=True, useGPU=use_gpu)
torch.save(model, "model.pt")
print("TorchScript DCGAN model saved.")

Setup Model Configuration


name: "DCGAN_FASHIONGEN"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
 {
   name: "input__0"
   data_type: TYPE_INT32
 }
]
input [
 {
   name: "input__1"
   data_type: TYPE_FP32
   dims: [ 32, 120 ]
 }
]
output [
 {
   name: "output__0"
   data_type: TYPE_FP32
   dims: [ 32, 120]
 }
]
output [
 {
   name: "output__1"
   data_type: TYPE_FP32
   dims: [ 32, 3 ,64, 64]
   reshape { shape: [ 32, 64, 64, 3 ] }
 }
]

Name: ‘name’ is an optional field, the value of which should match the name of the directory of the model.
Backend: This field indicates which backend is being used to run the model. Triton supports a wide variety of backends like TensorFlow, PyTorch, Python, ONNX and more.
max_batch_size: As the name implies, this field defines the maximum batch size that the model can support.
Input and output: The input and output sections specify the name, shape, datatype, and more, while providing operations like reshaping.

Now that we are ready to launch the server, we will use the freely available pre-built Docker containers from NGC.

Launch the Triton Server


docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models

Now we have to build a Client, which requires three basic points:

It should set up a connection with the Triton Inference Server.
It should specify the names of the input and output layers of our model.
It should send an inference request to the Triton Inference Server.

First download the dependencies (torchvision) inside our workspace:


docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:-py3-sdk bash
pip install torchvision

Set up connection with the client.


client = httpclient.InferenceServerClient(url="localhost:8000")

Specify the names of the input and output layers of the model.


num_images = 32

input_0 = httpclient.InferInput("input__0", dtype="INT32")
input_0.set_data_from_numpy(num_images, binary_data=True)

output_0, _ = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=48)

input_1 = httpclient.InferInput("input__1", output_0.shape, datatype="FP32")
input_1.set_data_from_numpy(transformed_img, binary_data=True)

outputs = httpclient.InferRequestedOutput("output__1", binary_data=True, class_count=48)

Send an inference request to the server.


results = client.infer(model_name="DCGAN_FASHIONGEN", inputs=[input_0, input_1], outputs=[output_0, output_1])
generated_image = results.as_numpy('output__1')
plt.show(generated_image)

The model output should look like:

Multi-Model Inference with NVIDIA Triton

In this section, we will understand the workflow for the deployment of multiple models. For this we will take a deeper look at the official docs provided by triton-inference-server GitHub repository.

So let's get started with the multi-model deployment – and we will learn some cool features of Triton.

Downloading the Model

Text Detection

wget https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz tar -xvf frozen_east_text_detection.tar.gz

Text Recognition


wget https://www.dropbox.com/sh/j3xmli4di1zuv3s/AABzCC1KGbIRe2wRwa3diWKwa/None-ResNet-None-CTC.pth

Export to ONNX

NGC TensorFlow container environment: docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/tensorflow:<yy.mm>-tf2-py3
install tf2onnx: pip install -U tf2onnx
Converting OpenCV's EAST model to ONNX:


python -m tf2onnx.convert --input frozen_east_text_detection.pb --inputs "input_images:0" --outputs "feature_fusion/Conv_7/Sigmoid:0","feature_fusion/concat_3:0" --output detection.onnx

Setting Up the Model's Repository

As we have already looked at Triton's way of reading our model, let's now focus on setting up a model repository for these multiple models.

This file structure can be set up in the following manner:


mkdir -p model_repository/text_detection/1
mv detection.onnx model_repository/text_detection/1/model.onnx
mkdir -p model_repository/text_recognition/1
mv str.onnx model_repository/text_recognition/1/model.onnx

Setting Up the Model Configurations

The Triton Inference Server GitHub repository provides the configuration files of text_detection and text_recognition.

Installing and Importing Dependencies

We will need the following dependencies for image processing and HTTP client.


!pip install cv2
!pip install tritonclient
import math
import numpy as np
import cv2
import tritonclient.http as httpclient

Launching the Triton Server


docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:-py3

Building a Client Application

Let's recall three golden points for the client application:

It should set up a connection with the Triton Inference Server


client = httpclient.InferenceServerClient(url="localhost:8000")

It should specify the names of the input and output layers of our model.


raw_image = cv2.imread("./img2.jpg")
preprocessed_image = detection_preprocessing(raw_image)

detection_input = httpclient.InferInput("input_images:0", preprocessed_image.shape, datatype="FP32")
detection_input.set_data_from_numpy(preprocessed_image, binary_data=True)

It should send an inference request to the Triton Inference Server.


detection_response = client.infer(model_name="text_detection", inputs=[detection_input])

We will repeat the process for text recognition model to finally perform the post-process and print the results:

Process responses from the detection model.


scores = detection_response.as_numpy('feature_fusion/Conv_7/Sigmoid:0')
geometry = detection_response.as_numpy('feature_fusion/concat_3:0')
cropped_images = detection_postprocessing(scores, geometry, preprocessed_image)

Create an input object for the recognition model.


recognition_input = httpclient.InferInput("input.1", cropped_images.shape, datatype="FP32")
recognition_input.set_data_from_numpy(cropped_images, binary_data=True)

Query the server.


recognition_response = client.infer(model_name="text_recognition", inputs=[recognition_input])

Process the response from the recognition model.


text = recognition_postprocessing(recognition_response.as_numpy('308'))
print(text)

Sign up for Free Trial

Latest Blogs

Multi-Model Inference with Triton Inference Server

Challenges with Multi-Model Real-Time Inference

Understanding Triton Inference Server

1. Computational Devices Plugin

2. Monitoring Feature

3. Multiple Framework Support

4. Scheduler

5. HTTP and gRPC

6. Client Application

7. Model Analyzer

Getting Started

Docker and Triton Installations

Setting Up the Project Repository

Triton for Gen AI Inference

Setting Up the Project Repository

Preparing the Torchscript Model

Setup Model Configuration

Launch the Triton Server

Multi-Model Inference with NVIDIA Triton

Downloading the Model

Export to ONNX

Setting Up the Model's Repository

Setting Up the Model Configurations

Installing and Importing Dependencies

Launching the Triton Server

Building a Client Application

Multi-Model Inference with Triton Inference Server

Challenges with Multi-Model Real-Time Inference

Understanding Triton Inference Server

1. Computational Devices Plugin

2. Monitoring Feature

3. Multiple Framework Support

4. Scheduler

5. HTTP and gRPC

6. Client Application

7. Model Analyzer

Getting Started

Docker and Triton Installations

Setting Up the Project Repository

Triton for Gen AI Inference

Setting Up the Project Repository

Preparing the Torchscript Model

Setup Model Configuration

Launch the Triton Server

Multi-Model Inference with NVIDIA Triton

Downloading the Model

Export to ONNX

Setting Up the Model's Repository

Setting Up the Model Configurations

Installing and Importing Dependencies

Launching the Triton Server

Building a Client Application

Generative AI in Healthcare: Applications, Benefits, and Its Future

No-Code Deployment of Fine-Tuned Models on TIR Foundation Studio: BYOM Made Easy

Building Production Ready Visual Query Systems: Llama 3.2 Vision on TIR

Exploring TIR GenAI APIs: Quickstart Guide with Llama 3 Chatbot

GPU Clusters: What It Is, Key Components, and Why They Matter

9 Cloud Computing Trends Shaping India’s Digital Future in 2025

LoRA fine-tune Gemma 7B Using TIR with 10 Easy Steps

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025

What is Retrieval-Augmented Generation (RAG)?