What is DCGM Exporter container in NVIDIA GPU Cloud?

April 8, 2022

For infrastructure or site reliability engineering (SRE) teams dealing with large GPU clusters for artificial intelligence (AI) or high-performance computing (HPC) workloads, monitoring GPUs is crucial. GPU metrics let teams quickly identify workload behavior, allowing them to better allocate and utilize resources, troubleshoot anomalies, and improve overall data center performance. Whether you're a researcher focusing on GPU-accelerated machine learning processes or a data center designer interested in GPU utilization and saturation for capacity planning, measurements may be of relevance to you.

Scaled by leveraging, the container management technologies like Kubernetes, containerized AI/ML workloads and their patterns get even more relevant. In this article, we'll go through how NVIDIA Data Center GPU Manager (DCGM) can be integrated with open-source tools like Prometheus and Grafana to form the foundation of a Kubernetes GPU monitoring solution.


DCGM, at its core, is a smart, lightweight user-space library/agent that performs a variety of tasks on each host system. It facilitates NVIDIA Tesla GPU administration in cluster and datacenter environments. The NVIDIA DCGM incorporates GPU telemetry APIs which analyze GPU utilization measurements for monitoring Tensor Cores, FP64 units, memory metrics, and interconnect traffic metrics. There are Go bindings based on the DCGM APIs for integration within the container environment, where Go is a prominent programming language.

A collector, a time-series database for storing metrics, and a display layer are the common components of monitoring stacks. Prometheus is a widely used open-source stack that is used to construct complex dashboards with Grafana as the visualization tool. A feature of Prometheus that allows you to generate and manage alerts is the Alertmanager. Prometheus is used in conjunction with Kube-state-metrics and node exporter to offer cluster-level and node-level metrics for Kubernetes API objects.

Source: Prometheus documentation

The functions performed by DCGM include:

  • Monitoring the GPU's performance - NVSwitch is the point of interaction for all GPUs on DGX-2 or HGX-2. Configuring the switches to form a single memory fabric for all the GPUs involved and supervising the fabric's NVLinks is done by DCGM's Fabric Manager component.
  • Management of GPU configuration - Users can tailor the behavior of NVIDIA GPUs to meet the needs of certain settings or applications, encompassing clock settings, exclusive constraints like computing mode, and environmental controls like power limits, among other facets. DCGM provides enforcement and persistence techniques to ensure that associated GPUs behave consistently.
  • Governing GPU health and diagnostics by GPU policy - Featuring sophisticated capabilities, the NVIDIA GPUs help with error detection and containment. Higher dependability and a simpler administrative environment are ensured by automated policies that govern GPU reaction to certain groups of events, such as recovery from faults and isolation of faulty hardware. DCGM has policies in place for common scenarios that necessitate notice or automatic action. 

Another critical requirement is the ability to determine the health of a GPU and its interaction with the surrounding system. Manifesting itself in a variety of ways, this demand ranges from passive background monitoring to rapid system validation to in-depth hardware diagnostics. In all of these instances, it's crucial to provide these functions with as little impact on the system and a few new environmental needs as possible. DCGM has a wide range of health and diagnostic capabilities, both automated and non-automated.

  • Analyze process statistics and GPU accounting - Schedulers and resource managers need to know how GPUs are used. Combining this data with RAS events, performance data, and another telemetry, particularly at the boundaries of a workload, is extremely helpful in explaining task behavior and identifying the source of any performance or execution difficulties. At the job level, DCGM provides a framework for gathering, grouping, and analyzing data.

Per-pod GPU metrics in a Kubernetes cluster

dcgm-exporter captures metrics for all of a node's GPUs. However, when a pod asks for GPU resources in Kubernetes, you may not know which GPUs in a node will be assigned to it. With v1.13, kubelet added a device monitoring functionality that enables you to use a pod-resources socket to find out which devices are assigned to the pod—pod name, pod namespace, and device ID. The dcgm-exporter HTTP server connects to the kubelet pod-resources server (/var/lib/kubelet/pod-resources) to identify the GPU devices running on a pod and adds the GPU device pod information to the metrics collected.

DCGM features a multitude of user interfaces to cater to various consumers and use cases. Intended for integration with third-party software, the usage of C and Python languages are utilized. Python interfaces are also designed for scripting contexts where the administrator is in charge. CLI-based tools are available to provide end-users with an interactive experience right out of the box. Each interface has nearly the same amount of capability. The functionality and design of the NVIDIA DCGM are mainly targeting the following users:

  • OEMs and independent software vendors (ISVs) who want to enhance GPU integration in their software.
  • Administrators are in charge of their own GPU-enabled infrastructure.
  • own GPU-enabled infrastructure.
  • Individual users and FAEs who require more information on GPU usage, particularly during problem analysis.
  • The Fabric Manager will be used by all DGX-2 and HGX-2 users to set up and analyze the NVSwitch fabric.
Latest Blogs
This is a decorative image for Project Management for AI-ML-DL Projects
June 29, 2022

Project Management for AI-ML-DL Projects

Managing a project properly is one of the factors behind its completion and subsequent success. The same can be said for any artificial intelligence (AI)/machine learning (ML)/deep learning (DL) project. Moreover, efficient management in this segment holds even more prominence as it requires continuous testing before delivering the final product.

An efficient project manager will ensure that there is ample time from the concept to the final product so that a client’s requirements are met without any delays and issues.

How is Project Management Done For AI, ML or DL Projects?

As already established, efficient project management is of great importance in AI/ML/DL projects. So, if you are planning to move into this field as a professional, here are some tips –

  • Identifying the problem-

The first step toward managing an AI project is the identification of the problem. What are we trying to solve or what outcome do we desire? AI is a means to receive the outcome that we desire. Multiple solutions are chosen on which AI solutions are built.

  • Testing whether the solution matches the problem-

After the problem has been identified, then testing the solution is done. We try to find out whether we have chosen the right solution for the problem. At this stage, we can ideally understand how to begin with an artificial intelligence or machine learning or deep learning project. We also need to understand whether customers will pay for this solution to the problem.

AI and ML engineers test this problem-solution fit through various techniques such as the traditional lean approach or the product design sprint. These techniques help us by analysing the solution within the deadline easily.

  • Preparing the data and managing it-

If you have a stable customer base for your AI, ML or DL solutions, then begin the project by collecting data and managing it. We begin by segregating the available data into unstructured and structured forms. It is easy to do the division of data in small and medium companies. It is because the amount of data is less. However, other players who own big businesses have large amounts of data to work on. Data engineers use all the tools and techniques to organise and clean up the data.

  • Choosing the algorithm for the problem-

To keep the blog simple, we will try not to mention the technical side of AI algorithms in the content here. There are different types of algorithms which depend on the type of machine learning technique we employ. If it is the supervised learning model, then the classification helps us in labelling the project and the regression helps us predict the quantity. A data engineer can choose from any of the popular algorithms like the Naïve Bayes classification or the random forest algorithm. If the unsupervised learning model is used, then clustering algorithms are used.

  • Training the algorithm-

For training algorithms, one needs to use various AI techniques, which are done through software developed by programmers. While most of the job is done in Python, nowadays, JavaScript, Java, C++ and Julia are also used. So, a developmental team is set up at this stage. These developers make a minimum threshold that is able to generate the necessary statistics to train the algorithm.  

  • Deployment of the project-

After the project is completed, then we come to its deployment. It can either be deployed on a local server or the Cloud. So, data engineers see if the local GPU or the Cloud GPU are in order. And, then they deploy the code along with the required dashboard to view the analytics.

Final Words-

To sum it up, this is a generic overview of how a project management system should work for AI/ML/DL projects. However, a point to keep in mind here is that this is not a universal process. The particulars will alter according to a specific project. 

Reference Links:





This is a decorative image for Top 7 AI & ML start-ups in Telecom Industry in India
June 29, 2022

Top 7 AI & ML start-ups in Telecom Industry in India

With the multiple technological advancements witnessed by India as a country in the last few years, deep learning, machine learning and artificial intelligence have come across as futuristic technologies that will lead to the improved management of data hungry workloads.


The availability of artificial intelligence and machine learning in almost all industries today, including the telecom industry in India, has helped change the way of operational management for many existing businesses and startups that are the exclusive service providers in India.


In addition to that, the awareness and popularity of cloud GPU servers or other GPU cloud computing mediums have encouraged AI and ML startups in the telecom industry in India to take up their efficiency a notch higher by combining these technologies with cloud computing GPU. Let us look into the 7 AI and ML startups in the telecom industry in India 2022 below.


Top AI and ML Startups in Telecom Industry 

With 5G being the top priority for the majority of companies in the telecom industry in India, the importance of providing network affordability for everyone around the country has become the sole mission. Technologies like artificial intelligence and machine learning are the key digital transformation techniques that can change the way networks rotates in the country. The top startups include the following:


Founded in 2021, Wiom is a telecom startup using various technologies like deep learning and artificial intelligence to create a blockchain-based working model for internet delivery. It is an affordable scalable model that might incorporate GPU cloud servers in the future when data flow increases. 


As one of the companies that are strongly driven by data and unique state-of-the-art solutions for revenue generation and cost optimization, TechVantage is a startup in the telecom industry that betters the user experiences for leading telecom heroes with improved media generation and reach, using GPU cloud online


As one of the strongest performers is the customer analytics solutions, Manthan is a supporting startup in India in the telecom industry. It is an almost business assistant that can help with leveraging deep analytics for improved efficiency. For denser database management, NVIDIA A100 80 GB is one of their top choices. 


Just as NVIDIA is known as a top GPU cloud provider, NetraDyne can be named as a telecom startup, even if not directly. It aims to use artificial intelligence and machine learning to increase road safety which is also a key concern for the telecom providers, for their field team. It assists with fleet management. 

KeyPoint Tech

This AI- and ML-driven startup is all set to combine various technologies to provide improved technology solutions for all devices and platforms. At present, they do not use any available cloud GPU servers but expect to experiment with GPU cloud computing in the future when data inflow increases.



Actively known to resolve customer communication, it is also considered to be a startup in the telecom industry as it facilitates better communication among customers for increased engagement and satisfaction. 


An AI startup in Chennai, Facilio is a facility operation and maintenance solution that aims to improve the machine efficiency needed for network tower management, buildings, machines, etc.


In conclusion, the telecom industry in India is actively looking to improve the services provided to customers to ensure maximum customer satisfaction. From top-class networking solutions to better management of increasing databases using GPU cloud or other GPU online services to manage data hungry workloads efficiently, AI and MI-enabled solutions have taken the telecom industry by storm. Moreover, with the introduction of artificial intelligence and machine learning in this industry, the scope of innovation and improvement is higher than ever before.






This is a decorative image for Top 7 AI Startups in Education Industry
June 29, 2022

Top 7 AI Startups in Education Industry

The evolution of the global education system is an interesting thing to watch. The way this whole sector has transformed in the past decade can make a great case study on how modern technology like artificial intelligence (AI) makes a tangible difference in human life. 

In this evolution, edtech startups have played a pivotal role. And, in this write-up, you will get a chance to learn about some of them. So, read on to explore more.

Top AI Startups in the Education Industry-

Following is a list of education startups that are making a difference in the way this sector is transforming –

  1. Miko

Miko started its operations in 2015 in Mumbai, Maharashtra. Miko has made a companion for children. This companion is a bot which is powered by AI technology. The bot is able to perform an array of functions like talking, responding, educating, providing entertainment, and also understanding a child’s requirements. Additionally, the bot can answer what the child asks. It can also carry out a guided discussion for clarifying any topic to the child. Miko bots are integrated with a companion app which allows parents to control them through their Android and iOS devices. 

  1. iNurture

iNurture was founded in 2005 in Bengaluru, Karnataka. It provides universities assistance with job-oriented UG and PG courses. It offers courses in IT, innovation, marketing leadership, business analytics, financial services, design and new media, and design. One of its popular products is KRACKiN. It is an AI-powered platform which engages students and provides employment with career guidance. 

  1. Verzeo

Verzeo started its operations in 2018 in Bengaluru, Karnataka. It is a platform based on AI and ML. It provides academic programmes involving multi-disciplinary learning that can later culminate in getting an internship. These programmes are in subjects like artificial intelligence, machine learning, digital marketing and robotics.

  1. EnglishEdge 

EnglishEdge was founded in Noida in 2012. EnglishEdge provides courses driven by AI for getting skilled in English. There are several programmes to polish your English skills through courses provided online like professional edge, conversation edge, grammar edge and professional edge. There is also a portable lab for schools using smart classes for teaching the language. 

  1. CollPoll

CollPoll was founded in 2013 in Bengaluru, Karnataka. The platform is mobile- and web-based. CollPoll helps in managing educational institutions. It helps in the management of admission, curriculum, timetable, placement, fees and other features. College or university administrators, faculty and students can share opinions, ideas and information on a central server from their Android and iOS phones.

  1. Thinkster

Thinkster was founded in 2010 in Bengaluru, Karnataka. Thinkster is a program for learning mathematics and it is based on AI. The program is specifically focused on teaching mathematics to K-12 students. Students get a personalised experience as classes are conducted in a one-on-one session with the tutors of mathematics. Teachers can give scores for daily worksheets along with personalised comments for the improvement of students. The platform uses AI to analyse students’ performance. You can access the app through Android and iOS devices.

  1. ByteLearn 

ByteLearn was founded in Noida in 2020. ByteLean is an assistant driven by artificial intelligence which helps mathematics teachers and other coaches to tutor students on its platform. It provides students attention in one-on-one sessions. ByteLearn also helps students with personalised practice sessions.

Key Highlights

  • High demand for AI-powered personalised education, adaptive learning and task automation is steering the market.
  • Several AI segments such as speech and image recognition, machine learning algorithms and natural language processing can radically enhance the learning system with automatic performance assessment, 24x7 tutoring and support and personalised lessons.
  • As per the market reports of P&S Intelligence, the worldwide AI in the education industry has a valuation of $1.1 billion as of 2019.
  • In 2030, it is projected to attain $25.7 billion, indicating a 32.9% CAGR from 2020 to 2030.

Bottom Line

Rising reliability on smart devices, huge spending on AI technologies and edtech and highly developed learning infrastructure are the primary contributors to the growth education sector has witnessed recently. Notably, artificial intelligence in the education sector will expand drastically. However, certain unmapped areas require innovations.

With experienced well-coordinated teams and engaging ideas, AI education startups can achieve great success.

Reference Links:





Build on the most powerful infrastructure cloud

A vector illustration of a tech city using latest cloud technologies & infrastructure