What is Nvidia-Docker ?
Simple. Bridging the Gap Between Containers and GPU
Nvidia created a runtime for Docker called Nvidia-Docker. The goal of this open source the project was to bring the ease and agility of containers to CUDA programming model.
Since Docker didn’t support GPUs natively, this project instantly became a hit with the CUDA community. Nvidia-Docker is basically a wrapper around the docker CLI that transparently provisions a container with the necessary dependencies to execute code on the GPU. It is only necessary when using Nvidia-Docker run to execute a container that uses GPUs.
You can run Nvidia-Docker on Linux machines that have a GPU along with the required drivers installed.
All our GPU plans support are NVIDIA® CUDA-capable and cuDNNn with Nvidia-Docker installed.
How to verify docker container is able to access the GPU ?
After you create a GPU node, you’ll need to log into the server via SSH:
From a terminal on your local computer, connect to the server as root. Make sure to substitute the server’s IP address which received in your welcome email.
If you did not add an SSH key when you created the server, you’ll be getting your root password in your welcome mail.
Below are the commands to verify Nvidia-Docker on an Ubuntu 16.04 machine powered by a NVIDIA®.
# nvidia-docker run --rm nvidia/cuda nvidia-smi
The nvidia-smi command runs the systems management interface (SMI) to confirm that the Docker container is able to access the GPU. Behind the scenes, SMI talks to the Nvidia driver to talk to the GPU.
We can also verify that CUDA is installed by running the below command.
# nvidia-docker run --rm nvidia/cuda nvcc -V
Where NGC (NVIDIA® GPU Cloud) Container images hosted?
NGC containers are hosted in an nvidia-docker repository called nvcr.io.These containers can be “pulled” from the repository and used for GPU accelerated applications such as scientific workloads, visualization, and deep learning.
A Docker image is simply a file-system that a developer builds. An nvidia-docker image serves as the template for the container, and is a software stack that consists of several layers. Each layer depends on the layer below it in the stack.
From a Docker image, a container is formed. When creating a container, you add a writable layer on top of the stack. A Docker image with a writable container layer added to it is a container. A container is simply a running instance of that image. All changes and modifications made to the container are made to the writable layer. You can delete the container; however, the Docker image remains untouched.
NGC Container Registry Spaces
The NGC container registry uses spaces to group nvidia-docker image repositories for related applications. These spaces appear in the image URL as a nvcr.io/<space>/image-name:tag, when used in pulling, running, or layering additional software on top of NGC container images.
frameworks: NVCaffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, PyTorch, TensorFlow, Theano, and Torch. These framework containers are delivered ready-to-run, including all necessary dependencies such as CUDA runtime, NVIDIA libraries, and an operating system.
This space contains a catalog of fully integrated and optimized deep learning framework containers that take full advantage of NVIDIA GPUs in both single GPU and multi-GPU configurations. They include CUDA Toolkit, DIGITS workflow, and the following deep learning
Each framework container image also includes the framework source code to enable custom modifications and enhancements, along with the complete software development stack.
NVIDIA updates these deep learning containers monthly to ensure they continue to provide peak performance.
This space contains a catalog of HPC visualization containers, currently available in beta, featuring the industry’s leading visualization tools, including ParaView with NVIDIA IndeX volume renderer, NVIDIA Optix ray-tracing library and NVIDIA Holodeck for interactive real-time visualization and high-quality visuals.
This space contains a catalog of popular third-party GPU ready HPC application container provided by partners, including GAMESS, GROMACS, LAMMPS, NAMD and RELION. All third-party containers conform to NGC container standards and best practices, making it easy to get the latest GPU optimized HPC software up and get running quickly.
Here are the GPU optimized application container images is in the repository as of this writing, (using labels as listed in the registry repository)
List of NVIDIA® GPU Cloud (NGC) Components
The document provides an introduction to the following three components of NGC.
Integral to NGC is the NGC Registry which holds a comprehensive catalog of GPU-accelerated containers for AI, machine learning and HPC, pre-trained models for AI tasks, and model-scripts for creating deep learning models.
The NGC website is the portal for browsing the contents of the NGC registry, generating an API key for access to additional features, and for downloading the NGC CLI.
The NGC Catalog CLI is a command-line interface for managing content within the NGC Registry. The CLI operates within a shell and lets you use scripts to automate commands. See the NGC Catalog CLI User Guide for instructions on installing and using the NGC Catalog CLI.
Most of the software is freely available, but some are ‘locked’ and require that you have an NGC account to access them. By signing up for an account through the NGC website, you can access the locked containers in the NGC container registry and run them on a number of accelerated computing environments. The instructions in this document will assist you in getting started using NGC
How to Access NGC Software Catalog?
Accessing the NGC Website:
You can access the NGC website and browse the catalog of containers, models, and model scripts even if you do not have an NGC account.
Accessing the NGC Website without an Account
From your browser, go to https://ngc.nvidia.com, then click a category of interest.
Accessing the NGC Website with Your Account
If you have an NGC account and want to sign in, then click Sign In from the top menu and sign in to your account.
Select to organization to associate with your login.
If you do not already have an NGC account, follow the instructions at Signing Up for an NGC Account.
Browsing the NGC Website
The NGC website opens to the catalog of GPU-optimized accelerated software.
Click from the top menu options to specify the type of software to view.
You can also select a different category from the top ribbon to see the associated catalog of software.
Click one of the software cards to view information about the software.
The example images below show information for the PyTorch repository.
Mandate Sign Up for an NGC Account To Open Locked Images
The following image shows an example of a framework container image that is locked, as indicated by the lock icon highlighted in the upper right corner.
You need to sign up for an account and then obtain
You can begin using the containers from the NGC container registry, including locked containers once you have generated an API key.
Accessing And Pulling A Container From NGC Registry
Before you can pull a container from the NGC container registry, also have access and logged into the NGC container registry as explained above.
In order to issue the pull and run commands, ensure that you are familiar with the following concepts.
A pull command looks similar to:
# docker pull nvcr.io/nvidia/pytorch:19.08-py3
A run command looks similar to:
# nvidia-docker run -it --rm -v local_dir:container_dir nvcr.io/nvidia/pytorch:<xx.xx>-py3
The following concepts describe the separate attributes that make up the both commands.
The name of the container registry, which for the NGC container registry and the NVIDIA DGX container registry is nvcr.io.
The name of the space within the registry that contains the container. For containers provided by NVIDIA, the registry space is nvidia. For more information, see above topic ‘ngc container registry spaces’.
You want to run the container in interactive mode.
You want to delete the container when finished.
You want to mount the directory.
The directory or file from your host system (absolute path) that you want to access from inside your container. For example, the local_dir in the following path is /home/jsmith/data/mnist.
If you are inside the container, for example, using the command ls /data/mnist, you will see the same files as if you issued the ls /home/jsmith/data/mnist command from outside the container.
The target directory when you are inside your container. For example, /data/mnist is the target directory in the example:
The tag. For example, 19.08.
Before accessing the NGC container registry, ensure that the following prerequisites are met.
- Your account is activated.
- You have an API key for authenticating your access to the NGC container registry.
- You are logged in to your server with the privileges required to run nvidia-docker containers.
After your account is activated, you can access the NGC container registry from one of two ways:
- Pulling A Container From The NGC container registry Using The Docker CLI
- Pulling A Container Using The NGC container registry Web Interface
A Docker registry is the service that stores Docker images. The service can be on the internet, on the company intranet, or on a local machine. For example, nvcr.io is the location of the NGC container registry for nvidia-docker images.
All nvcr.ioDocker images use explicit version-tags to avoid ambiguous versioning which can result from using the latest tag. For example, a locally tagged “latest” version of an image may actually override a different “latest” version in the registry.
in tothe NGC container registry.
# docker login nvcr.io
2. When prompted for your user name, enter the following text:
3. The $oauthtoken username is a special user name that indicates that you will authenticate with an API key and not a username and password.
When prompted for your password, enter your API key as shown in the following example.
Tip: When you get your API key, copy it to the clipboard so that you can paste the API key into the command shell when you are prompted for your password.
Pulling A Container From The NGC container registry Using The Docker CLI
This section is appropriate if you are using via a cloud provider.
Before pulling an nvidia-docker container, ensure that the following prerequisites are met:
- You have read access to the registry space that contains the container
You are logged into the NGC container registry as explained above
- You are a member of the docker group, which enables you to use Docker commands.
To browse the available containers in the NGC container registry use a web browser to log in to your NGC container registry account on the website, https://ngc.nvidia.com.
- Pull the container that you want from the registry. For example, to pull the NAMD container:
# docker pull nvcr.io/nvidia/pytorch:19.08-py3
2. List the Docker images on your system to confirm that the container was pulled.
# docker images
3. For more information pertaining to your specific container, refer to the /workspace/README.
After pulling a container, you can run jobs in the container to run scientific workloads, train neural networks, deploy deep learning models, or perform AI analytics.
Pulling A Container Using The NGC container registry Web Interface
This task assumes:
- You have a cloud instance system and it is connected to the Internet.
- Your instance has Docker and nvidia-docker installed.
- You have access to a browser to the NGC container registry at https://ngc.nvidia.com and your account is activated.
- You want to pull a container onto your cloud instance.
- Log into the NGC container registry at https://ngc.nvidia.com.
- Click Registry in the left navigation. Browse the NGC container registry page to determine which Docker repositories and tags are available to you.
- Click one of the repositories to view information about that container image as well as the available tags that you will use when running the container.
- In the Pull column, click the icon to copy the Docker pull command.
- Open a command prompt and paste the Docker pull command. The pulling of the container image begins. Ensure the pull completes successfully.
- After you have the Docker container file on your local system, load the container into your local Docker registry.
- Verify that the image is loaded into your local Docker registry.
# docker images
For more information pertaining to your specific container, refer to the /workspace/README.md file inside the container.
Running A Container Which was Pulled from NGC Registry
To run a container, you must issue the nvidia-docker run command, specifying the registry, repository, and tags.
- As a user, run the container interactively.
# nvidia-docker run -it --rm –v local_dir:container_dir nvcr.io/nvidia/<repository>:<xx.xx>
The following example runs the August 2019 release (19.08) of the pytorch container in interactive mode. The container is automatically removed when the user exits the container.
# nvidia-docker run -it --rm nvcr.io/nvidia/pytorch:19.08-py3
Key Concepts – nvidia-docker
When you run the nvidia-docker run command:
- The Docker Engine loads the image into a container which runs the software.
- You define the runtime resources of the container by including additional flags and settings that are used with the command. These flags and settings are described in the following sections.
- The GPUs are explicitly defined for the Docker container (defaults to all GPUs, can be specified using NV_GPU environment variable).
Specifying A User
Unless otherwise specified, the user inside the container is the root user.
When running within the container, files created on the host operating system or network volumes can be accessed by the root user. This is unacceptable for some users and they will want to set the ID of the user in the container. For example, to set the user in the container to be the currently running user, issue the following:
# nvidia-docker run -ti --rm -u $(id -u):$(id -g) nvcr.io/nvidia/<repository>:<tag>
Typically, this results in warnings due to the fact that the specified user and group do not exist in the container. You might see a message similar to the following:
groups: cannot find name for group ID 1000I have no name! @c177b61e5a93:/workspace$
The warning can usually be ignored.
Setting The Remove Flag
By default, Docker containers remain on the system after being run. Repeated pull or run operations use up more and more space on the local disk, even after exiting the container. Therefore, it is important to clean up the nvidia-docker containers after exiting.
Note: Do not use the –rm flag if you have made changes to the container that you want to save, or if you want to access job logs after the run finishes.
To automatically remove a container when exiting, add the –rm flag to the run command.
# nvidia-docker run --rm nvcr.io/nvidia/<repository>:<tag>
Setting The Interactive Flag
By default, containers run in batch mode; that is, the container is run once and then exited without any user interaction. Containers can also be run in interactive mode as a service.
To run in interactive mode, add the -ti flag to the run command.
# nvidia-docker run -ti --rm nvcr.io/nvidia/<repository>:<tag>
Setting The Volumes Flag
There are no data sets included with the containers, therefore, if you want to use data sets, you need to mount volumes into the container from the host operating system. For more information, see Manage data in containers.
Typically, you would use either Docker volumes or host data volumes. The primary difference between host data volumes and Docker volumes is that Docker volumes are private to Docker and can only be shared amongst Docker containers. Docker volumes are not visible from the host operating system, and Docker manages the data storage. Host data volumes are any directory that is available from the host operating system. This can be your local disk or network volumes.
Mount a directory /raid/imagedata on the host operating system as /images in the container.
# nvidia-docker run -ti --rm -v /raid/imagedata:/images nvcr.io/nvidia/<repository>:<tag>
Mount a local docker volume named data (must be created if not already present) in the container as /imagedata.
# nvidia-docker run -ti --rm -v data:/imagedata nvcr.io/nvidia/<repository>:<tag>
Setting The Mapping Ports Flag
Applications such as Deep Learning GPU Training System™ (DIGITS) open a port for communications. You can control whether that port is open only on the local system or is available to other computers on the network outside of the local system.
Using DIGITS as an example, in DIGITS 5.0 starting in container image 16.12, by default the DIGITS server is open on port 5000. However, after the container is started, you may not easily know the IP address of that container. To know the IP address of the container, you can choose one of the following ways:
- Expose the port using the local system network stack (–net=host) where port 5000 of the container is made available as port 5000 of the local system.
- Map the port (-p 8080:5000) where port 5000 of the container is made available as port 8080 of the local system.
In either case, users outside the local system have no visibility that DIGITS is running in a container. Without publishing the port, the port is still available from the host, however not from the outside.
Setting The Shared Memory Flag
Certain applications, such as PyTorch™ and the Microsoft® Cognitive Toolkit™ , use shared memory buffers to communicate between processes. Shared memory can also be required by single process applications, such as MXNet™ and TensorFlow™ , which use the NVIDIA® Collective Communications Library ™ (NCCL) (NCCL).
By default, Docker containers are allotted 64MB of shared memory. This can be insufficient, particularly when using all 8 GPUs. To increase the shared memory limit to a specified size, for example 1GB, include the –shm-size=1g flag in your docker run command.
Alternatively, you can specify the –ipc=host flag to re-use the host’s shared memory space inside the container. Though this latter approach has security implications as any data in shared memory buffers could be visible to other containers.
Setting The Restricting Exposure Of GPUs Flag
From inside the container, the scripts and software are written to take advantage of all available GPUs. To coordinate the usage of GPUs at a higher level, you can use this flag to restrict the exposure of GPUs from the host to the container. For example, if you only want GPU 0 and GPU 1 to be seen in the container, you would issue the following:
# NV_GPU=0,1 nvidia-docker run ...
This flag creates a temporary environment variable that restricts which GPUs are used.
Specified GPUs are defined per container using the Docker device-mapping feature, which is currently based on Linux cgroups.
The state of an exited container is preserved indefinitely if you do not pass the –rm flag to the nvidia-docker run command. You can list all of the saved exited containers and their size on the disk with the following command:
# docker ps --all --size --filter Status=exited
The container size on the disk depends on the files created during the container execution, therefore the exited containers take only a small amount of disk space.
You can permanently remove a exited container by issuing:
# docker rm [CONTAINER ID]
By saving the state of containers after they have exited, you can still interact with them using the standard Docker commands. For example:
You can examine logs from a past execution by issuing the docker logs command.
# docker logs 9489d47a054f
You can extract files using the docker cp command.
# docker cp
You can restart a stopped container using the docker restart command.
# docker restart <container name>
You can save your changes by creating a new image using the docker commit command. For more information, see Example 3: Customizing a Container using docker commit.
Note: Use care when committing Docker container changes, as data files created during use of the container will be added to the resulting image. In particular, core dump files and logs can dramatically increase the size of the resulting image.