How to monitor NVIDIA GPU metrics with Elastic Observability
Graphical processing units, or GPUs, aren’t just for PC gaming. Today, GPUs are used to train neural networks, simulate computational fluid dynamics, mine Bitcoin, and process workloads in data centers. And they are at the heart of most high-performance computing systems, making the monitoring of GPU performance in today's data centers just as important as monitoring CPU performance.
With that in mind, let's take a look at how to use Elastic Observability together with NVIDIA’s GPU monitoring tools to observe and optimize GPU performance.
Dependencies
To get NVIDIA GPU metrics up and running, we will need to build NVIDIA GPU monitoring tools from source code (Go). And, yes, we’ll need an NVIDIA GPU. AMD and other GPU types use different Linux drivers and monitoring tools, so we’ll have to cover them in a separate post.
NVIDIA GPUs are available from many cloud providers like Google Cloud and Amazon Web Services (AWS). For this post, we are using an instance running on Genesis Cloud.
Let’s start by installing the NVIDIA Datacenter Manager per the installation section of NVIDIA’s DCGM Getting Started Guide for Ubuntu 18.04. Note: While following the guide, pay special attention replacing the <architecture>
parameter with our own. We can find our architecture using the uname
command.
uname -a
The response tells us that X86_64
is our architecture. So step 1 in the Getting Started Guide would be:
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
There is also a typo in step 2. Remove the trailing >
from $distribution
.
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub
After installation, we should be able to see our GPU details by running the nvidia-smi
command.
NVIDIA gpu-monitoring-tools
To build NVIDIA’s gpu-monitoring-tools, we’ll need to install Golang. So let’s take care of that now.
cd /tmp wget https://golang.org/dl/go1.15.7.linux-amd64.tar.gz sudo mv go1.15.7.linux-amd64.tar.gz /usr/local/ cd /usr/local/ sudo tar -zxf go1.15.7.linux-amd64.tar.gz sudo rm go1.15.7.linux-amd64.tar.gz
OK, it’s time to finish up the NVIDIA setup by installing NVIDIA’s gpu-monitoring-tools from GitHub.
cd /tmp git clone https://github.com/NVIDIA/gpu-monitoring-tools.git cd gpu-monitoring-tools/ sudo env "PATH=$PATH:/usr/local/go/bin" make install
Metricbeat
We are now ready to install Metricbeat. Have a quick check on elastic.co for the latest version of Metricbeat and adjust the version number in the commands below.
cd /tmp wget https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-7.10.2-amd64.deb sudo dpkg -i metricbeat-7.10.2-amd64.deb # 7.10.2 is the version number
Elastic Cloud
Okay, let's get the Elastic Stack up and running. We will need a home for our new GPU monitoring data. For this, we will create a new deployment on Elastic Cloud. If you’re not an existing Elastic Cloud customer, you can sign up for a free 14-day trial. Alternatively, you can set up your own deployment locally.
Next, create a new Elastic Observability deployment on Elastic Cloud.
Once your cloud deployment is up and running, note its Cloud ID and authentication credentials — we’ll need them for our upcoming Metricbeat configuration.
Configuration
Metricbeat’s configuration file is located in /etc/metricbeat/metricbeat.yml. Open it in your favorite editor and edit the cloud.id
and cloud.auth
parameters to match your deployment.
Example Metricbeat configuration changes using the above screenshots:
cloud.id: "staging:dXMtY2VudHJhbDEuZ2NwLmNsb3VkLmVzLmlvJDM4ODZkYmUwMWNjODQ2NDM4YjRlNzg5OWEyZDAwNGM5JDBiMTc0YzYyMTVlYTQwYWQ5M2NmMGY4MjVhNzJmOGRk" cloud.auth: "elastic:J7KYiDku2wP7DFr62zV4zL4y"
Metricbeat's input configuration is modular. The NVIDIA gpu-monitoring-tools publishes the GPU metrics via Prometheus, so let’s go ahead and enable the Prometheus Metricbeat module now.
sudo metricbeat modules enable prometheus
We can confirm our Metricbeat configuration is successful using the Metricbeat test and modules commands.
sudo metricbeat test config
sudo metricbeat test output
sudo metricbeat modules list
If your configuration tests aren’t successful like the examples above, check out our Metricbeat troubleshooting guide.
We finish up the Metricbeat configuration by running its setup command, which will load some default dashboards and set up index mappings. The setup command normally takes a few minutes to finish.
sudo metricbeat setup
Exporting metrics
It is time to start exporting metrics. Let’s start NVIDIA’s dcgm-exporter.
dcgm-exporter --address localhost:9090 # Output INFO[0000] Starting dcgm-exporter INFO[0000] DCGM successfully initialized! INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] Pipeline starting INFO[0000] Starting webserver
Note: You can ignore the DCP warning
The configuration of dcgm-exporter metrics is defined in the file /etc/dcgm-exporter/default-counters.csv, in which 38 different metrics are defined by default. For the complete list of possible values, we can check the DCGM Library API Reference Guide.
In another console, we’ll start Metricbeat.
sudo metricbeat -e
Now you can head over to the Kibana instance and refresh the ‘metricbeat-*’ index pattern. We can do that by navigating to Stack Management > Kibana> Index Patterns and selecting the metricbeat-* index pattern from the list. Then click Refresh field list.
Now our new GPU metrics are available in Kibana. The new field names are prefixed by prometheus.metrics.DCGM_. Here is a snippet showing the new fields from Discover.
Congratulations! We are now ready to analyze our GPU metrics in Elastic Observability.
For example, you can compare GPU and CPU performance in Metrics Explorer:
You can also find GPU utilization hotspots in the Inventory view:
GPU monitoring considerations
We hope you found this post useful. Those are just a few monitoring options, but Elastic Observability lets you tackle all of your goals. Here are a few other examples of good GPU things to monitor per NVIDIA:
- GPU temperature: Check for hot spots
- GPU power usage: Higher than expected power usage => possible HW issues
- Current clock speeds: Lower than expected => power capping or HW problems
And if you ever need to simulate GPU load, you can use the dcgmproftester10 command.
dcgmproftester10 --no-dcgm-validation -t 1004 -d 30
You can take your monitoring further by using Elastic alerting to automate NVIDIA’s recommendations. Then take it to the next level by finding anomalies in your GPU infrastructure with machine learning. If you’re not an existing Elastic Cloud customer and would like to try out the steps in this blog, you can sign up for a free 14-day trial.