NVIDIA RTX Server Sizing Guide to run Maya Software with Arnold renderer

September 28, 2020

This specification provides insights on how to deploy NVIDIA® Quadro® Virtual Data Center Workstation (Quadro vDWS) software for modern-day production pipelines within the Media and Entertainment industry. Recommendations are based on actual customer deployments
and sample-of-concept (POC) artistic 3D production pipeline workflows and cover three common questions:

Which NVIDIA GPU should I use for a 3D Production pipeline?

How do I select the right profile(s) for the types of users I will have?

Using sample 3D production pipeline workflows, how many users can be supported (user
density) for this server configuration and workflow?

NVIDIA RTX™ Server offers a highly flexible reference design which combines NVIDIA Quadro RTX™ 8000 graphics processing units (GPUs with NVIDIA virtual GPU software running on OEM server hardware. NVIDIA RTX Server can be configured to accelerate multiple workloads within the data center. IT administrators can provision multiple, easy-to manage virtual workstations to tackle various artistic workloads. Since user behavior varies and is a critical factor in determining the best GPU and profile size, the recommendations in this reference architecture are meant to be a guide. The most successful customer deployments start with a Proof of Concept (POC) and are “tuned” throughout the lifecycle of the deployment. Beginning with a POC enables customers to understand the expectations and behavior of their users and optimize their deployment for the best user density while maintaining required performance levels. A POC also allows administrators to understand infrastructure conditions, such as network, which is a key component to ensure performance within their specific environment. Continued maintenance is important because user behavior can change over the course of a project and as the role of an individual changes in the organization along with potential improvement of displays during refresh cycles. A 3D production artist that was once a light graphics user might become a heavy graphics user when they change teams, assigned to a different project or even receive a display upgrade to a
higher resolution monitor. NVIDIA virtual GPU management and monitoring tools enable administrators and IT staff to ensure their deployment is optimized for each user

About Autodesk Maya 2020 and Arnold

Autodesk Maya 2020 is one of the most recognizable applications for 3D computer animation, modelling, simulation, and rendering utilized to create expansive worlds, complex characters, and dazzling effects. Creative professionals bring believable characters to life with engaging
animation tools, shape 3D objects and scenes with intuitive modelling tools and create realistic effects - from explosions to cloth simulation all within the Maya software.

Autodesk Arnold is the built-in interactive renderer for Maya and is an advanced Monte Carlo ray tracing renderer. It is designed for artists and for the demands of modern animation and visual effects (VFX) production.

It is available as a standalone renderer on Linux, Windows, and Mac OS, with supported plug-ins for Maya, 3dsMax, Houdini, Cinema 4D, and Katana.

Autodesk works closely with NVIDIA to ensure that creative innovation is never over. Studio drivers are released throughout the year to supercharge your favourite, most demanding applications. Using the same NVIDIA Studio drivers that are deployed on non-virtualized systems, NVIDIA Quadro vDWS software provides virtual machines (VMs) with the same
breakthrough performance and versatility that the NVIDIA RTX platform offers to a physical environment. VDI eliminates the need to install Autodesk Arnold and Maya on a local client, which can help reduce IT support and maintenance costs and enables greater mobility and
collaboration. This virtual workstation deployment option enhances flexibility and further expands the wide variety of platform choices available to Autodesk customers.

About NVIDIA RTX Servers

NVIDIA RTX Server is a reference design comprised of the following components
Qualified server
NVIDIA Quadro RTX 8000 graphics cards
NVIDIA Quadro vDWS GPU virtualization software
Autodesk Maya 2020 design software - to be installed by the client
Autodesk Arnold 6 rendering software - to be installed by the client
Teradici Cloud access software - to be installed by the client
When combined, this validated NVIDIA RTX Server solution provides unprecedented rendering and compute performance at a fraction of the cost, space, and power consumption of traditional CPU-based render nodes, as well as high-performance virtual workstations enabling designers and artists to arrive at their best work, faster.


NVIDIA RTX Server is a validated reference design for multiple workloads that are accelerated Quadro RTX 8000 GPUs. When deployed for high performance virtual workstations, the NVIDIA RTX Server solution delivers a native physical workstation experience from the data center, enabling creative professionals to do their best work from anywhere,
using any device. NVIDIA RTX Server can also bring GPU-acceleration and performance to deliver the most efficient end-to-end rendering solution, from interactive sessions in the desktop to final batch rendering in the data center. Content production is undergoing massive growth as render complexity and quality demands increase. Designers and artists across
industries continually strive to produce more visually rich content faster than ever before, yet find their creativity and productivity bound by inefficient CPU-based render solutions. NVIDIA RTX Server delivers the performance that all artists need, by allowing them to take advantage
of key GPU enhancements to increase interactivity and visual quality, while centralizing GPU resources.


The NVIDIA Quadro RTX 8000, is powered by the NVIDIA Turing™
architecture and the NVIDIA RTX platform, bring the most significant advancement in computer graphics in over a decade to professional workflows. Designers and artists can now wield the power of hardware-accelerated ray tracing, deep learning, and advanced shading to
dramatically boost productivity and create amazing content faster than ever before. The Quadro RTX 8000 has 48 GB to handle larger animations or visualizations. The artistic workflows covered within our testing
for this reference architecture used Quadro RTX 6000 GPUs.

NVIDIA Quadro Virtual Data Center Workstation Software

NVIDIA virtual GPU (vGPU) software enables the delivery of graphics-rich virtual desktops and workstations accelerated by NVIDIA GPUs. There are three versions of NVIDIA vGPU software available, one being NVIDIA Quadro Virtual Data Center Workstation (Quadro vDWS). NVIDIA
Quadro vDWS software includes the Quadro graphics driver required to run professional 3D applications. The Quadro vDWS license enables sharing an NVIDIA GPU across multiple virtual machines, or multiple GPUs can be allocated to a single virtual machine to power the most demanding workflows.

NVIDIA Quadro is the world’s preeminent visual computing platform, trusted by millions of creative and technical professionals to accelerate their workflows. With Quadro vDWS software, you can deliver the most powerful virtual workstation from the data center. Designers and artists can work more efficiently, leveraging high performance virtual
workstations that perform just like physical workstations. IT has the flexibility to provision render nodes and virtual workstations, scaling resources up or down as needed. An NVIDIA RTX Server solution can be configured to deliver multiple virtual workstations customized for
specific tasks. This means that utilization of compute resources can be optimized, and virtual machines can be adjusted to handle workflows that may demand more or less memory.

To deploy an NVIDIA vGPU solution for Autodesk Maya 2020 with Arnold, you will need an NVIDIA GPU that is supported with Quadro vDWS software, licensed for each concurrent user.

Teradici Cloud Access Software

Teradici is the creator of the industry-leading PCoIP remoting protocol technology and Cloud Access software. Teradici Cloud Access software enables enterprises to securely deliver high performance graphics-intensive applications and workstations from private data centers, public clouds or hybrid environments with crisp text clarity, true color accuracy and lossless
image quality to any endpoint, anywhere. Teradici PCoIP Ultra with NVIDIA RTX Server can provide virtual machines to multiple artists resulting in virtual machines that are indistinguishable from physical workstations. Artists can enjoy workspaces set up on the latest hardware, and work with confidence in high fidelity with steady frame rates.

Autodesk Maya and Arnold PoC Testing

To determine the optimal configuration of Quadro vDWS for Autodesk Maya and Arnold, both user performance and scalability were considered. For comparative purposes, we considered the requirements for a configuration optimized for performance only, and this configuration is
based solely on performance using sample artistic workflows. The scenes used within our POC testing focused on a VFX pipeline where a single shot is the result of several artist specialists working on different pieces. The following illustration shows the entire 3D production pipeline and illustrates the areas where our POC testing focused.

Our testing focused on a few of the phases illustrated in the above figure. We executed three GPU-accelerated artistic workflows within 4 VM’s:
VM1 and VM2 - Modeling, Texturing and Shading
VM3 - Animation
VM4 - Lighting and Rendering
The goal of this testing was to show how four artists from three unique parts of the pipeline can all work at the same time using shared server virtualized resources and be productive. The following paragraphs goes into further detail of each of these workflows

VM1 and VM2 - Modeling, Texturing and Shading

For artists to model effectively, they need fast interaction with their models to see different views, quick material changes, and realistic rendering. This workflow takes advantage of the NVIDIA® TensorRT™ cores in the NVIDIA RTX Server to accelerate the rendering process, and artists can view their noiseless assets by leveraging NVIDIA OptiX™ AI Denoising. The GPU
memory needed to support this artist would be considered small to medium, therefore a single VM was assigned half of the Quadro RTX 6000 GPU, which equates to a 12Q vGPU profile. Two VM’s can share the same GPU on a server. The following screenshot illustrates
the artist’s work.

In order to bring characters to life in film, they need to go through a “Look Development” process. In the example illustrated in Figure 4-2, Autodesk’s Arnold GPU Renderer utilizes NVIDIA RTX compatible features for performant ray tracing. Look Development involves the
Refining textures and materials that often result in a time-consuming, back and forth process
Real time updates with NVIDIA RTX Server allow for artistic interaction to accurately dial in the look of the character, in-context to the scene.
NVIDIA RTX AI, employing NVIDIA OptiX Denoiser, provides high-fidelity changes in real time.
Artists can define and deliver higher quality content in a more intuitive workflow providing an overall increase in production value.
Having a full color range without compression is important to make accurate changes in confidence. Teradici PCoIP Ultra, which takes advantage of NVIDIA RTX GPU encoding, ensures that the virtual machines look indistinguishable from a local display.

VM3 - Animation

For artists to animate effectively, artists need smooth playback with no pauses or stutters as they make pose changes. Since this artist uses the Maya 2020 GPU animation cache, the GPU memory needed to support this artist would be considered large. Therefore, a single VM was
assigned an entire Quadro RTX 6000 GPU, which equates to a 24Q vGPU profile. The following screenshot illustrates the artist’s work.

Animation production can place extreme demands on compute hardware. Traditional workflows involve artists outputting time-consuming preview videos. Since Autodesk Maya 2019, real time animation playback and preview is now possible. Furthermore, with Viewport 2.0 enhancements, real-time rendering features are also available. In this scene, we are using
the GPU to cache animation, and preview ambient occlusion, shadows, lights and reflections, all in real-time in the viewport. Maya Viewport 2.0 leverages GPU memory to deliver high quality materials, lights, screen space ambient occlusion and more - at interactive speed. Starting in Maya 2019, you can use your GPU to cache animation calculations to memory in a
fraction of the time of a CPU cache. With this feature, you can playback your animations in real time, and continue to tweak and update your shots without having to play blast the timeline. By leveraging NVIDIA RTX GPU encoding with PCoIP Ultra, this VM is able to deliver interactive, real time animation playback without dropping any frames, which is really important to animators who are constantly reviewing their changes. Every frame counts.

VM4 - Lighting and Rendering

Artists who work with lighting and rendering, need fast resolution of the full image so they can see the impact of their lighting and camera changes. Since this artist is the user who most intensely uses the NVIDIA TensorRT cores in the NVIDIA RTX Server (for accelerating the rendering process), the GPU memory needed to support this artist is the largest of all and may
even need acceleration from multiple GPUs. NVIDIA vGPU technology provides administrators the ability to assign up to four shared GPUs to a single VM. The following screenshot
illustrates the artist’s work.

Lighting and rendering are resource intensive processes that are responsible for the final output of a scene. NVIDIA RTX Server enables artists to work and adjust scenes while utilizing leftover GPU resources to render. This provides for an incredibly efficient use of GPU resources, furthering the production pipeline workflow.

Evaluating vGPU Frame Buffer

The GPU Profiler is a tool which can be installed within each of the VM’s and used for evaluating GPU to CPU utilization rates while executing the aforementioned artistic workflows. The vGPU frame buffer is allocated out of the physical GPU frame buffer at the time the vGPU is assigned to the VM and the NVIDIA vGPU retains exclusive use of that frame buffer. All
vGPUs resident on a physical GPU share access to the GPUs engines including the graphic 3D, video decode, and video encode engines. Since user behavior varies and is a critical factor in determining the best GPU and profile size, it is highly recommended to profile your own data
and workflows during your PoC to properly size your environment for optional performance


Our testing showed that four artists from three unique parts of the pipeline can all effectively do their 3D production work using VMs. To determine the optimal configuration of Quadro vDWS to support these four artists, both user performance and scalability were considered. To further support this conclusion, NVIDIA collected insights from Media and Entertainment
customers as well, to understand how animation studio customers are deploying Quadro vDWS. A dual socket, 2U rack server configured with three Quadro RTX 6000 GPUs provided the necessary resources so that 3D production artists could work more efficiently, leveraging
high-performance virtual workstations which perform just like physical workstations. When sizing a Quadro vDWS deployment for Autodesk Maya and Arnold, NVIDIA recommends conducting your own PoC to fully analyze resource utilization using objective measurements and subjective feedback. It is highly recommended that you install the GPU Profiler within your
artist VMs to properly size your VMs.

Deployment Best Practices

Run a Proof of Concept

The most successful deployments are those that balance user density (scalability) with performance. This is achieved when Quadro vDWS-powered virtual machines are used in production while objective measurements and subjective feedback from end users is gathered.
We highly recommend a PoC is run prior to doing a full deployment to provide a better understanding of how your users work and how many GPU resources they really need, analyzing the utilization of all resources, both physical and virtual. Consistently analyzing resource utilization and gathering subjective feedback allows for optimizing the configuration
to meet the performance requirements of end users while optimizing the configuration for best scale.

Leverage Management and Monitoring Tools

Quadro vDWS software provides extensive monitoring features enabling IT to better understand usage of the various engines of an NVIDIA GPU. The utilization of the compute engine, the frame buffer, the encoder, and decoder can all be monitored and logged through a command line interface called the NVIDIA System Management Interface (nvidia-smi), accessed on the hypervisor or within the virtual machine. In addition, NVIDIA vGPU metrics are integrated with Windows Performance Monitor (PerfMon) and through management packs like VMware vRealize Operations. To identify bottlenecks of individual end users or of the physical GPU serving multiple end users, execute the following nvidia-smi commands on the hypervisor.

Understand Your Users

Another benefit of performing a PoC prior to deployment is that it enables more accurate categorization of user behavior and GPU requirements for each virtual workstation. Customers often segment their end users into user types for each application and bundle similar user types on a host. Light users can be supported on a smaller GPU and smaller profile size while heavy users require more GPU resources, a large profile size, and may be
best supported on a larger GPU like the Quadro RTX 8000 for example.

Understanding the GPU Scheduler

NVIDIA Quadro vDWS provides three GPU scheduling options to accommodate a variety of QoS requirements of customers.
Fixed share scheduling: Always guarantees the same dedicated quality of service. The fixed share scheduling policies guarantee equal GPU performance across all vGPUs sharing the same physical GPU. Dedicated quality of service simplifies a POC since it allows the use of common benchmarks used to measure physical workstation performance such as SPECviewperf, to compare the performance with current physical or
virtual workstations.
Best effort scheduling1: Provides consistent performance at a higher scale and therefore reduces the TCO per user. This is the default scheduler.
The best effort scheduler leverages a round-robin scheduling algorithm which shares GPU resources based on actual demand which results in optimal utilization of resources. This results in consistent performance with optimized user density. The best effort scheduling policy best utilizes the GPU during idle and not fully utilized times, allowing for optimized
density and a good QoS.
Equal share scheduling: Provides equal GPU resources to each running VM. As vGPUs are added or removed, the share of GPU processing cycles allocated changes accordingly, resulting in performance to increase when utilization is low, and decrease when utilization is high.

Organizations typically leverage the best effort GPU scheduler policy for their deployment to achieve better utilization of the GPU, which usually results in supporting more users per server with a lower quality of service (QoS) and better TCO per user.


A qualified OEM server configured with three Quadro RTX 6000 GPUs provided the necessary resources for 3D production artists to work more efficiently, leveraging high performance virtual workstations which perform just like physical workstations. When sizing a Quadro
vDWS deployment for Autodesk Maya and Arnold, NVIDIA recommends conducting your own PoC to fully analyze resource utilization using objective measurements and subjective feedback. NVIDIA RTX Server offers flexibility to IT administrators to size VMs based on
workload or workflow needs.

Server Recommendation: Dual Socket, 2U Rack Server
A 2RU, 2-socket server configured with two Intel Xeon Gold 6154 processors is recommended. With a high-frequency 3.0 GHz combined with 18-cores, this CPU is well-suited for optimal performance for each end user while supporting the highest user scale, making it a costeffective solution for Autodesk Maya.

Flash Based Storage for Best Performance
The use of flash-based storage, such as solid-state drives (SSDs) are recommended for optimal performance. Flash-based storage is the common choice for users on physical workstations and similar performance can be achieved in similarly configured virtual environments. A typical configuration for non-persistent virtual machines is to use the direct attached storage (DAS) on the server in a RAID 5 or RAID 10 configuration. For persistent virtual machines, a high performing all-flash storage solution is the preferred option.

Typical Networking Configuration for Quadro vDWS
There is no typical network configuration for in a Quadro vDWS powered virtual environment since this varies based on multiple factors including choice of hypervisor, persistent versus non-persistent virtual machines, and choice of storage solution. Most customers are using 10 GbE networking for optimal performance.

Optimizing for Dedicated Quality of Service
For comparative purposes, we considered the requirements for a configuration optimized for performance only. This configuration option does not take into account the need to further optimize for scale, or user density. Additionally, this configuration option is based solely on
performance using the aforementioned sample 3D production artistic workflows.

To run Maya with Arnold renderer workloads on E2E RTX 8000 GPU servers sign up here

Latest Blogs
This is a decorative image for Clustering in deep learning- A acknowledged tool.
June 27, 2022

Clustering in deep learning- A acknowledged tool

When learning something new about anything, such as music, one strategy may be to seek relevant groupings or collections. You may organize your music by genre, but your friend may organize it by the singer. The way you combine items allows you to learn more about them as distinct pieces of music, somewhat similar to what clustering algorithms do. 

Let’s discuss a detailed brief on Clustering algorithms, their applications, and how GPUs can be used to accelerate the potential of clustering models. 

Table of content-

  1. What is Clustering?
  2. How to do Clustering?
  3. Methods of Clustering. 
  4. DNN in Clustering. 
  5. Accelerating analysis with clustering and GPU. 
  6. Applications of Clustering.
  7. Conclusion. 

What is Clustering?

In machine learning, we typically group instances as a first step in interpreting a data set in a machine learning system. The technique of grouping unlabeled occurrences is known as clustering. Clustering is based on unsupervised machine learning since the samples are unlabeled. When the instances are tagged, clustering transforms into classification.

Clustering divides a set of data points or populations into groups so that data points in the same group are more similar to one another and dissimilar from data points in other groups. It is simply a collection of elements classified according to their similarity and dissimilarity.

How to do Clustering?

Clustering is critical because it determines the intrinsic grouping of the unlabeled data provided. There are no requirements for good clustering. It is up to the user to decide which criteria will be used to satisfy their demands. For example, we might be interested in locating representatives for homogenous groups (data reduction), locating "natural clusters" and defining their unknown qualities ("natural" data types), locating useful and appropriate groupings ("useful" data classes), or locating odd data items (outlier detection). This method must make assumptions about point resemblance, and each assumption results in a unique but equally acceptable cluster.

Methods of Clustering: 

  1. Density-Based Approaches: These methods assume clusters to be the dense region of the space, with some similarities and differences to the lower dense region. These algorithms have high accuracy and can combine two clusters. DBSCAN (Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points to Identify Clustering Structure), and other algorithms are examples. 

  1. Methods Based on Hierarchy: The clusters created in this approach form a tree-like structure based on the hierarchy. The previously established cluster is used to generate new clusters. It is classified into two types. aggregative (bottom-up approach) dividing (top-down approach)

  1. Partitioning Methods: These methods divide the items into “k” clusters, with each split becoming a separate cluster. This approach is used to optimize an objective criterion similarity function, such as K-means, CLARANS (Clustering Large Applications Based on Randomized Search), and so on. 

  1. Grid-based Methods: The data space is divided into a finite number of cells that create a grid-like structure in this approach. STING (Statistical Information Grid), wave cluster, CLIQUE (Clustering In Quest), and other clustering processes performed on these grids are rapid and independent of the number of data items.

DNN in clustering-

In Deep Learning, DNNs serve as mappings to better representations for clustering. The properties of these representations might be drawn from different layers of the network, or even from many. This choice may be divided into two categories:

  • One layer: Refers to the general scenario in which just the output of the network's last layer is used. This method makes use of the representation's low dimensionality. 

  • Several layers: This representation is a composite of the outputs of several layers. As a result, the representation is more detailed and allows for embedded space to convey more sophisticated semantic representations, potentially improving the separation process and aiding in the computation of similarity.

Accelerating analysis with clustering and GPU-

Clustering is essential in a wide range of applications and analyses, but it is now facing a computational problem as data volumes continue to grow. One of the most promising options for tackling the computational barrier is parallel computing using GPUs. Because of their huge parallelism and memory access-bandwidth benefits, GPUs are an excellent approach to speed data-intensive analytics, particularly graph analytics. The massively parallel architecture of a GPU, which consists of thousands of tiny cores built to handle numerous tasks concurrently, is ideally suited for the computing job. This may be used for groups of vertices or edges in a big graph.

In an analysis using data, clustering is a task with high parallelism that can be expedited using GPUs. In the future, GPUs will include spectral and hierarchical clustering/partitioning approaches based on the minimal balanced cut metric. 

Applications of Clustering-

The clustering approach may be applied to a wide range of areas. The following are some of the most popular applications of this technique: 

  • Segmentation of the Market: Cluster analysis, in the context of market segmentation, is the application of a mathematical model to uncover groups of similar consumers based on the smallest variances among customers within each group. In market segmentation, the purpose of cluster analysis is to precisely categorize customers in order to create more successful customer marketing through personalization. 

  • Recommendation engine: Clustering may be used to solve a number of well-known difficulties in recommendation systems, such as boosting the variety, consistency, and reliability of suggestions; the data sparsity of user-preference matrices; and changes in user preferences over time.

  • Analysis of social networks: Clustering in social network analysis is not the same as traditional clustering. It necessitates classifying items based on their relationships as well as their properties. Traditional clustering algorithms group items only on their similarity and cannot be used for social network research. A social network clustering analysis technique, unlike typical clustering algorithms, can classify items in a social network based on their linkages and detect relationships between classes. 

  • Segmentation of images: Segmentation of images using clustering algorithms is a method for doing pixel-wise image segmentation. The clustering algorithm here aims to cluster the pixels that are close together in this form of segmentation. There are two ways to conduct segmentation via clustering - Merging Clustering and Divisive Clustering

  • Detecting Anomaly: Clustering may be used to train the normalcy model by grouping comparable data points together into clusters using a distance function. Clustering is appropriate for anomaly detection since no knowledge of the attack classes is required during training. Outliers in a dataset can be found using clustering and related approaches.


Clustering is an excellent method for learning new things from old data. Sometimes the resultant clusters will surprise you, and it may help you make sense of an issue. One of the most interesting aspects of employing clustering for unsupervised learning is that the findings may be used in a supervised learning issue. 

Clusters might be the new features that you employ on a different data set! Clustering may be used on almost every unsupervised machine learning issue, but make sure you understand how to examine the results for accuracy.

Clustering is also simple to apply; however, several essential considerations must be made, such as dealing with outliers in your data and ensuring that each cluster has a sufficient population.

This is a decorative image for How GPUs are affecting Deep Learning inference?
June 27, 2022

How GPUs are affecting Deep Learning inference?

The training step of most deep learning systems is the most time-consuming and resource-intensive. This phase may be completed in a fair period of time for models with fewer parameters, but as the number of parameters rises, so does the training time. This has a two-fold cost: your resources will be engaged for longer, and your staff will be left waiting, squandering time. 

We'll go through how GPUs manage such issues and increase the performance of deep learning inferences like multiclass classification and other inferences. 

Table of Content:

  1. Graphical Processing Unit (GPU)
  2. Why GPUs?
  3. How GPUs improved the performance of Deep Learning Inferences?
  4. Critical Decision Criteria for Inference 
  5. Which hardware should you use for DL inferences? 
  6. Conclusion

Graphical Processing Units (GPU)

A graphics processing unit (GPU) is a specialized hardware component capable of performing many fundamental tasks at once. GPUs were created to accelerate graphics rendering for real-time computer graphics, especially gaming applications. The general structure of the GPU is similar to that of the CPU; both are spatial architectures. Unlike CPUs, which have a few ALUs optimized for sequential serial processing, the GPU contains thousands of ALUs that can do a huge number of fundamental operations at the same time. Because of this exceptional feature, GPUs are a strong competitor for deep learning execution.

Why GPUs?

Graphics processing units (GPUs) can help you save time on model training by allowing you to execute models with a large number of parameters rapidly and efficiently. This is because GPUs allow you to parallelize your training activities, divide them across many processor clusters, and perform multiple computing operations at the same time.

GPUs are also tuned to execute certain jobs, allowing them to complete calculations quicker than non-specialized technology. These processors allow you to complete jobs faster while freeing up your CPUs for other duties. As a result, bottlenecks caused by computational restrictions are no longer an issue.

GPUs are capable of doing several calculations at the same time. This allows training procedures to be distributed and can considerably speed up deep learning operations. You can have a lot of cores with GPUs and consume fewer resources without compromising efficiency or power. The decision to integrate GPUs in your deep learning architecture is based on various factors: Memory bandwidth—GPUs, for example, can offer the necessary bandwidth to support big datasets. This is due to the fact that GPUs have specialized video RAM (VRAM), which allows you to save CPU memory for other operations. Dataset size—GPUs can scale more readily than CPUs, allowing you to analyze large datasets more quickly. The more data you have, the more advantage you may get from GPUs. Optimization—one disadvantage of GPUs is that it might be more difficult to optimize long-running individual activities than it is with CPUs.

How GPUs improved the performance of Deep Learning Inferences?

Multiple matrix multiplications make up the computational costly element of the neural network. So, what can we do to make things go faster? We may easily do this by performing all of the processes at the same time rather than one after the other. In a nutshell, this is why, when training a neural network, we utilize GPUs (graphics processing units) rather than CPUs (central processing units). 

Critical Decision Criteria for Inference-


The speed, efficiency, and accuracy of these projections are some of the most important decision factors in this phase of development. If a model can't analyze data quickly enough, it becomes a theoretical exercise that can't be used in practice. It becomes too expensive to run in manufacturing if it consumes too much energy. Finally, if the model's accuracy is inadequate, a data science team will be unable to justify its continuous usage. Inference speed, in particular, can be a bottleneck in some scenarios and instances, such as Image Classification, which is utilized in a variety of applications such as social media and image search engines. Even though the tasks are basic, timeliness is crucial, especially when it comes to public safety or platform infractions. 

Self-driving vehicles, commerce site suggestions, and real-time internet traffic routing are all instances of edge computing or real-time computing. Object recognition inside 24x7 video feeds, as well as large volumes of images and videos. Pathology and medical imaging are examples of complex images or tasks. These are some of the most difficult photos to decipher. To achieve incremental speed or accuracy benefits from a GPU, data scientists must now partition pictures into smaller tiles. These cases necessitate a decrease in inference speed while also increasing accuracy. Because inference is often not as resource-intensive as training, many data scientists working in these contexts may start with CPUs. Some may resort to leveraging GPUs or other special hardware to obtain the performance or accuracy enhancements they seek as inference speed becomes a bottleneck.

Which hardware should you use for DL inferences? 

There are several online recommendations on how to select DL hardware for training, however, there are fewer on which gear to select for inference. In terms of hardware, inference and training may be very distinct jobs. When faced with the decision of which hardware to use for inference, you should consider the following factors: How critical is it that my inference performance (latency/throughput) be good? Is it more important for me to maximize latency or throughput? Is the typical batch size for my company modest or large? How much of a financial sacrifice am I ready to make in exchange for better results? Which network am I connected to? 

You know how we choose inference hardware? We start by assessing throughput performance. The V100 clearly outperforms the competition in terms of throughput, especially when employing a big batch size (8 images in this case). Furthermore, because the YOLO model has a significant parallelization potential, the CPU outperforms the GPU in this metric.


We looked at the various hardware and software techniques that have been utilized to speed up deep learning inference. We began by explaining what GPUs are, why they are needed, how GPUs increased the performance of Deep Learning Inferences, the essential choice criteria for the deep learning model and the hardware that should be employed. 

There is little question that the area of deep learning hardware will grow in the future years, particularly when it comes to specialized AI processors or GPUs. 

How do you feel about it? 

This is a decorative image for Understanding PyTorch
June 27, 2022

Understanding PyTorch

Every now and then, a library or framework emerges that completely changes the way we think about deep learning and aids in the advancement of deep learning studies by making them computationally quicker and less costly. Here we will be discussing one such library: PyTorch.


PyTorch is the library or framework for Python scripts that make deep learning projects easier to create. PyTorch's approachability and ease of use drew a large number of early adopters from the academic, research, and development communities. And it has developed into one of the most popular deep learning tools across a wide range of applications in the years after its first release.

PyTorch has two primary characteristics or features at its core: An n-dimensional Tensor that works similarly to NumPy but on GPUs and the other is the construction and training of neural networks using automatic differentiation. Apart from these primary features, PyTorch includes a number of other features, which are detailed below in this blog.

PyTorch Tensor-

Numpy is a fantastic framework, however, it is unable to use GPUs to speed up numerical operations. GPUs can frequently deliver speedups of 50x or more for contemporary deep neural networks and today's parallel computing methods may take advantage of GPUs much more. 

To train many models at once PyTorch offers distributed training, allowing academic practitioners and developers to parallelize their work. Using many GPUs to process bigger batches of input data the training of models can be made feasible with distributed training, as a result, the computation time is reduced.

The Tensor, the most fundamental PyTorch concept, is capable to do so. A PyTorch Tensor is basically the same as a NumPy array: a Tensor is an n-dimensional array, and PyTorch has several methods for working with them. Tensors may maintain track of a computational graph and gradients behind the scenes, but they can also be used as a general tool for scientific computing. PyTorch Tensors, unlike NumPy, may use GPUs to speed up their numeric operations. You just need to provide the suitable device to execute a PyTorch Tensor on GPU.

Automatic Differentiation-

Automatic differentiation is a method used by PyTorch to record all of our operations and then compute gradients by replaying them backward. Generally while training neural networks, developers have to manually implement both forward and backward passes. While manually implementing backward pass is easy but doing the same for forward pass might get a bit tricky or exhausting task. This is exactly what the autograd package in PyTorch does. 

When you use autograd, your network's forward pass will construct a computational graph, with nodes being Tensors and edges being functions that produce output Tensors from input Tensors. Because we calculate the gradients on the forward pass, this approach allows us to save time on each epoch.  You may also simply compute gradients by back propagating across this graph.

Flow control and weight sharing-


PyTorch implements a weird model as an example of dynamic graphs and weight sharing: a third-fifth order polynomial that selects a random integer between 3 and 5 and utilizes that many orders on each forward pass, recycling the same weights several times to calculate the fourth and fifth order. We can construct the loop in this model using standard Python flow control, and we can achieve weight sharing by simply repeating the same argument many times.


TorchScript allows you to turn PyTorch code into serializable and optimizable models. Any TorchScript application may be saved from a Python process and loaded into another process that doesn't require or doesn’t have a Python environment.

Pytorch has tools for converting a model from a pure Python program to a TorchScript program that can be executed in any standalone application such as of C++. This allows users to train models in PyTorch using familiar Python tools before exporting the model to a production environment where Python applications may be inefficient due to performance and multi-threading issues.

Dynamic Computation Graphs-

In frameworks like PyTorch, you usually have a set up of the computational network and a distinct execution mechanism than the host language. This unusual design is largely motivated by the need for efficiency and optimization. DL frameworks keep track of a computational graph that specifies the sequence in which calculations must be completed in a model. Researchers have found it difficult to test out more creative ideas because of this inconvenient arrangement.

There are two such types of computational graphs, one is static and the other is dynamic. Variable sizes must be established at the start with a static network i.e. when the graph is Static all the variables are to be created and connected in the beginning, and then later is settled up in a static (non-changing) session which might be inconvenient for some applications, such as NLP as for NLP Dynamic computational graphs are critical since language or input can arrive in a variety of expression lengths.

PyTorch, on the other hand, employs a dynamic graph. That is, the computational graph is constructed dynamically once variables are declared. As a result, after each training cycle, this graph is regenerated. Dynamic graphs are adaptable, allowing us to change and analyze the graph's internals at any moment. 

When all you had before were "goto" commands, introducing dynamic computational graphs is like introducing the idea of a process. We may write our programs in a composable manner thanks to the idea of the procedure. Of course, one may argue that DL designs do not require a stack. Recent research on Stretcher networks and Hyper networks, on the other hand, demonstrates this. Context switching, such as a stack, appears to be beneficial in some networks in studies.

nn Module

Autograd and computational graphs are a powerful paradigm for automatically generating sophisticated operators and computing derivatives; nevertheless, raw autograd may be too low-level for huge neural networks. We often consider stacking the computation when developing neural networks, with some layers containing learnable parameters that will be tweaked throughout the learning process. 

In such cases, we can make use of PyTorch’s nn module. The nn package defines modules, which are fundamentally equivalent to neural network layers. A Module can contain internal data such as Tensors with learnable parameters in addition to taking input Tensors and computing output Tensors.


In this blog, we understood how PyTorch is different from other libraries like NumPy, What are the special features that it offers including Tensor computing with substantial GPU acceleration and a tape-based autograd system used to build deep neural networks. 

We also studied other features like flow control and weight sharing, torch scripts, computation graphs, and nn module. 

This description was adequate to gain a general notion of what PyTorch is and how academicians, researchers, and developers may utilize it to construct better projects.

Build on the most powerful infrastructure cloud

A vector illustration of a tech city using latest cloud technologies & infrastructure