Object Detection Using Triton Inference Server on E2E Cloud

In our journey through deep learning, our primary areas of exploration and emphasis are often the stages of data preprocessing and model training. For any machine learning workflow in production, model deployment and inference is an equally important phase. The quality and impact of AI on the business largely depend on the inference infrastructure. Take an example of a shopping website. Their recommendation engines are hit by thousands of requests per second. The company gets the customer experience it wants only when all ML systems are robust and capable of accepting and responding to each request as fast and efficiently as possible. Thus, they have state-of-the-art infrastructure for deploying and running their AI systems. The same applies to all businesses seeking to accelerate using AI solutions.

Inference is the crucial step that harnesses the power of AI models to produce actionable insights from raw data. Depending on the application, it can be real-time or offline inference. Recommendation systems and fraud detection systems require immediate responses and so does real-time inference. But systems like predictive maintenance do not require immediate output but need to feed large amounts of data and maximize throughput. Here offline or batch inference is used. Inference environments can vary from model to model. It also depends on the framework used and the deployment device. Hence, it is not possible to assume a generalized inference platform for all models, frameworks and devices.

NVIDIA Triton Inference Server

The problems we discussed above are largely solved by using a scalable AI inference platform that handles all frameworks and model types. The Triton Inference is one such solution which is a fast and scalable open-source AI inference application. It allows inference in any kind of CPU and GPU environment. It handles a wide range of frameworks including TensorRT, PyTorch, TensorFlow,

Triton Server Architecture (Source: Nvidia Blog)

ONNX and Python, even integrating with Kubernetes and MLOps platforms. It allows maximum hardware utilization using ensemble models and concurrent executions.

Additionally, it boasts the capability of dynamic batching. This allows the Triton server to flexibly group incoming client requests, processing them in larger batches. This enhances model throughput and latency. Triton server thus makes standardized inference possible on cloud, edge devices or any platforms with any framework.

In this tutorial, we will explore NVIDIA Triton Inference on the E2E Cloud platform. E2E Cloud has native integration with Triton server to create and deploy models. Here we will deploy a YoloV5 model on the Triton inference server and perform the inference.

Prerequisites

For the tutorial, you require an account in E2E cloud. You should have access to TIR AI platform. Make sure git is installed in your system. We will have to get a YoloV5 model to deploy. You can find it in the official repository. Clone the repository.

!git clone https://github.com/ultralytics/yolov5

NVIDIA Triton servers support formats like onnx. So, we must first convert a model to this format. Let’s export a pre-trained YoloV5 model in onnx format.

%cd yolov5
!pip install -r requirements.txt
!python export.py --weights yolov5s.pt --include torchscript onnx

Creating the Model

The model has to be created with storage before deploying. The models in TIR are containers for sharing and using model weights. At the backend, model weights and other files are stored in E2E Object Storage (EOS) buckets. Navigate to the model storage tab in the inference section of TIR platform page and click the Create Model button. In the model types, select Triton. After creating the model, you will be able to see the model credentials.

‍

We have created a model with a Triton backend and an object storage. The next step is to upload the weights. You can access E2E Object Storage (EOS) using any S3-compatible CLI or SDK. We recommend using Minio CLI.

We will be using the Mino CLI here. To set up Minio CLI, please run the following command. You can also refer the document to know the steps on how to run the command.

# Setup Host

mc config host add https://objectstore.e2enetworks.net

# Copy Model to Bucket

mc cp -r $FOLDER_NAME yolo-v5/yolo-v5-36cea6

Here, FOLDER_NAME is the path to the onnx file and yolo-v5/yolo-v5-36cea6 is the name of the model and bucket storage I just created. Feel free to change the names as required. The model will be successfully uploaded to the bucket storage.

Deploying the Endpoint

Before moving to endpoints, create an authorization token in the API Tokens section. To create a model endpoint for our object detection model, go to the Model Endpoints and click on the Create Endpoint button. Select the GPU configuration required for your model and create an inference endpoint.

‍

You have successfully deployed the YoloV5 model. Now, let’s see how we can get inference results.

Model Inference

We need to install all Python libraries for the Triton server. Ensure you install the same version used to create the model backend in the cloud. Here I am using version 2.31.0.

!pip install tritonclient[all]==2.31.0

Create a Triton client. This client can be used to send requests to the server and receive responses. The tritonclient.http module is part of the Triton Inference Server client library, which provides a Python API for interacting with Triton servers.

from tritonclient.http import InferenceServerClient
client = InferenceServerClient(url=="infer.e2enetworks.net/project//endpoint//")

If facing any SSL issues, you can try the same with this code snippet.

import gevent.ssl as ssl
from tritonclient.http import InferenceServerClient
client = InferenceServerClient(url="infer.e2enetworks.net/project//endpoint//",  ssl=True, ssl_context_factory=ssl.create_default_context)
client.is_server_ready(headers={'Authorization': 'Bearer '})

Replace the url with the endpoint URL from your model endpoints page.

Let’s test the endpoint using a sample image. Preprocess the image and convert it to a format expected by the YOLO model. For YOLOv5, the input size is typically 640x640. You might need to adjust this based on your model configuration.

import cv2
import numpy as np

# Load the image

image = cv2.imread('infer.jpg')

resized_image = cv2.resize(image, (640, 640))

# Normalize the image. YOLOv5 expects the pixel values to be in the range [0, 1].

normalized_image = resized_image / 255.0

# Add a batch dimension

input_data = np.expand_dims(normalized_image, axis=0)

# Convert to float32

input_data = input_data.astype(np.float32)

Now create the model name and an InferInput object. 'input__0' is the name of the input for the YOLOv5 model.

model_name = "yolo-v5"
input = InferInput('input__0', input_data.shape, 'FP32')

Set the input data for the InferInput object.

input.set_data_from_numpy(input_data)

Now, create an InferRequestedOutput object for each output of the model. 'output__0' is the name of the output for the YOLOv5 model.

output = InferRequestedOutput('output__0')

Finally, send the image as an http request to the server and get the response from the model.

results = client.infer(
    model_name,
    inputs=[input],
    outputs=[output]
)

Wrapping Up

You have successfully deployed an object detection model on the NVIDIA Triton server using E2E Cloud. We encourage you to play around with other models too. Ideally you should use Docker, a popular platform that packages applications in containers. This allows for easier distribution and deployment of applications, including AI models. We hope you enjoyed this tutorial and found it useful for your projects. Thank you for reading and happy coding!

Object Detection Using Triton Inference Server on E2E Cloud

NVIDIA Triton Inference Server

Prerequisites

Creating the Model

Deploying the Endpoint

Model Inference

Wrapping Up

Related Articles

Making AI Deployment Affordable and Scalable: Cost Efficiency of Quantization

Interpretable vs. Black-Box Models: A Comprehensive Exploration on Early Prediction under Uncertainty

Generative AI in Healthcare: Applications, Benefits, and Its Future

Company

Legal & Policies

Investor Relations

Resources