In the modern deep learning era, post-Transformer, there has been exponential growth in AI startups. These startups leverage cutting-edge research to build innovative products for both users and businesses. However, creating and deploying a scalable machine-learning-embedded product that meets end-to-end client requirements is challenging. Challenges include handling streaming data volumes, integrating new models, and achieving extremely low latency. Enter Triton Inference Server, an open-source software enabling the deployment of deep learning models for inference in production environments.
In this blog post, we will explore how Triton is used to deploy a multi-model deep learning architecture. We'll also take a deep dive into understanding the superb features of the Triton Inference Server.
Challenges with Multi-Model Real-Time Inference
Real-time inference of deployed deep learning models is a pretty complex task. Here, we will discuss a few challenges that need to be addressed for successful deployment in various applications.
Low Latency: The performance of the deployed model is measured upon really strict low latency constraints (practically about 200ms) and optimized over high throughput.
Scalability: When we scale our infrastructure to handle a large number of simultaneous requests, it requires load balancing and distributed computing solutions to ensure that the infrastructure can handle varying loads in a cost-effective manner.
Model Complexity: Multiple models can be highly complex, as they need to understand and integrate information from different sources. Training and deploying these models for real-time inference can be resource-intensive.
Resource Management: Managing resources, such as CPU, GPU, memory, and storage, is crucial for real-time inference. Ensuring that resources are allocated efficiently to handle multi-model data can be complex.
Deployment and Maintenance: Deploying and maintaining multi-model real-time systems in the field can be complex, as they often require ongoing updates, maintenance, and monitoring.
In the upcoming section, we will understand the key functionalities and the best features of the Triton Inference Server.
Understanding Triton Inference Server
Figure: Triton Server Architecture with highlighted features (source: Triton Architecture)
Let's have a look at the 7 most powerful features of the Triton Inference Server:
1. Computational Devices Plugin
In a deployment environment, there can be multiple associated models to fulfill a single task or an ensemble of tasks. These models might need different types of compute backends (GPUs/ CPUs). Triton provides support for hosting multiple models on different computational devices such as GPUs and CPUs.
2. Monitoring Feature
When we host a server in production, it is really important that we plug-in monitoring pipelines. It can be model monitoring or infra monitoring. The Triton server has a built-in metric monitoring system, which exports its metric reports using http; hence it can be seamlessly integrated with any other monitoring system like the Prometheus Grafana dashboard which collects these metrics.
3. Multiple Framework Support
The Triton Inference Server provides flexibility to host models built upon different deep learning frameworks such as PyTorch, TensorFlow, ONNX, etc. This feature enables us to seamlessly develop task-specific models and deploy them to serve a client's request.
4. Scheduler
When we use multiple models, each with multiple versions, and each receives a high volume of data input requests, we need an effective orchestrator and scheduler. This functionality is in-built in the Triton Inference Server.
5. HTTP and gRPC
Triton also supports HTTP and gRPC, allowing integration through both HTTP and gRPC requests using different ports.
6. Client Application
Triton has numerous options for SDKs in different languages; Python, C++, Java, which can be used to create client applications in order to interact with the inference server.
7. Model Analyzer
This tool, provided by Triton, offers a range of optimization features through the Triton Model Analyzer. We can think of it as a grid search tool that explores various optimization options and provides us with the most optimized combination of model configurations.
Getting Started
Docker and Triton Installations
You can check the installation of the server as:
Setting Up the Project Repository
Triton expects the models, configurations, version files, etc. in a specific file structure format. These can be stored in a local file system or cloud object store like E2E Solutions. Check out the object storage options on their website:
Triton for Gen AI Inference
In this section, we will deploy and inference a Generative AI model, namely, DCGAN on FASHIONGEN dataset from PyTorch Hub and understand how to use Triton for single model inference. The official Triton Inference Server GitHub repository contains Quick_Deploy examples for different frameworks such as ONNX, TensorFlow, vLLM, and so on.
Figure: Inference pipeline for deployment and generation with DCGAN.
In this section, we will use the single pipeline approach for client-model interaction. Here, we will deploy the pipeline without explicitly breaking apart the model from the pipeline.
Setting Up the Project Repository
As discussed earlier, we will have to set up our model repository in a Triton's readable format, as provided below:
After creating the file structure, save the ‘model.pt’ file in folder 1, and ‘config.pbtxt’ file in the DCGAN_FASHIONGEN folder, indicating the version of the model to Triton.
Now, as we have our file structure ready, we will move towards the PyTorch hub to save our generative ai model ‘DCGAN’.
Preparing the Torchscript Model
We will prepare a DCGAN.py file to export the model and save a trace of the 64x64 size image.
Now, as we have the model and repository structure ready, let's look at the config.pbtxt model configuration file. The minimum requirements for the configuration file are that you must satisfy the platform (or backend properties), the max_batch_size property, and the input and output tensors of the model.
Setup Model Configuration
- Name: ‘name’ is an optional field, the value of which should match the name of the directory of the model.
- Backend: This field indicates which backend is being used to run the model. Triton supports a wide variety of backends like TensorFlow, PyTorch, Python, ONNX and more.
- max_batch_size: As the name implies, this field defines the maximum batch size that the model can support.
- Input and output: The input and output sections specify the name, shape, datatype, and more, while providing operations like reshaping.
Now that we are ready to launch the server, we will use the freely available pre-built Docker containers from NGC.
Launch the Triton Server
Now we have to build a Client, which requires three basic points:
- It should set up a connection with the Triton Inference Server.
- It should specify the names of the input and output layers of our model.
- It should send an inference request to the Triton Inference Server.
First download the dependencies (torchvision) inside our workspace:
Set up connection with the client.
Specify the names of the input and output layers of the model.
Send an inference request to the server.
The model output should look like:
Multi-Model Inference with NVIDIA Triton
In this section, we will understand the workflow for the deployment of multiple models. For this we will take a deeper look at the official docs provided by triton-inference-server GitHub repository.
The agenda for this part is to deploy a pipeline for transcribing a text-from-images model. However, in this section, we will use a break apart pipeline, and will leverage different backends for multiple models’ input-output processing while deploying core models on different framework backends.
Ensemble pipeline: We will divide the problem into two major components: the text detection pipeline and the text recognition pipeline. The Triton Inference Server allows us to deconstruct the model deployment pipeline and build an ensemble model, applying pre- and post-processing steps along with the exported models.
So let's get started with the multi-model deployment – and we will learn some cool features of Triton.
Downloading the Model
- Text Detection
- Text Recognition
Export to ONNX
- NGC TensorFlow container environment: docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/tensorflow:<yy.mm>-tf2-py3
- install tf2onnx: pip install -U tf2onnx
- Converting OpenCV's EAST model to ONNX:
Setting Up the Model's Repository
As we have already looked at Triton's way of reading our model, let's now focus on setting up a model repository for these multiple models.
This file structure can be set up in the following manner:
Setting Up the Model Configurations
The Triton Inference Server GitHub repository provides the configuration files of text_detection and text_recognition.
Installing and Importing Dependencies
We will need the following dependencies for image processing and HTTP client.
Launching the Triton Server
Building a Client Application
As we discussed three key points to build a client, the conceptual guide (of triton repository) has defined a few helper functions that take care of pre- and post-processing steps in our pipeline. You can check out the client.py file.
Let's recall three golden points for the client application:
- It should set up a connection with the Triton Inference Server
It should specify the names of the input and output layers of our model.
It should send an inference request to the Triton Inference Server.
We will repeat the process for text recognition model to finally perform the post-process and print the results:
- Process responses from the detection model.
Create an input object for the recognition model.
Query the server.
Process the response from the recognition model.
Voilà! Now we understand how seamless multi-model deployment using the Triton Inference Server is. Do check out the Triton Inference Server GitHub repository to try out the text detection and recognition deployment yourself.