The NVIDIA System Management Interface ‘nvidia-smi’ is utilized for monitoring the performance of NVIDIA GPU devices. It shows us GPU usage. Users can get the list of the four GPUs and their UUIDs as below:
[johnchris@g009 ~]$ nvidia-smi --list-gpus
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-d4562c18-3d01-4110-d707-48b82ba5de9d)
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-9c44e4aa-2342-c342-437b-2f435f4aa581)
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-5da48514-1b80-9761-08a6-1818f6a51f67)
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-81adc8d7-7c25-3e6b-e732-d72679baf578)
The command ‘watch -n 1 nvidia-smi’ re-runs GPU usage checks every second. Users can SSH to the GPU nodes if they are running a job on a GPU node. The image below shows a g009 node with 4 Tesla V100 GPUs installed, each with 16GB (total 64G) of GPU memory. Currently, they are running at about 25% capacity (16748 MiB/65356 MiB, which is 16.35GB/64GB), with memory usage and GPU utilization (available compute) at 98%, 100%, 95%, and 96%, respectively.
For all the details on this command see the manual pages: man nvidia-smi
Note: If the utilization is unexpectedly 0%, the user will need to determine why the job is not utilizing the GPU.