0/N nodes are available: N Insufficient nvidia.com/gpu

It could happens that you launch a pod with a GPU as documented here and the pod instead of running remains in pending.

Describing the pod as following:

$kubectl -n <namespace> describe pod <pod_name> >

you could find the error:

0/N nodes are available: N Insufficient nvidia.com/gpu

where N is the number of GPU you have in your cluster.

In this case it is reccomended to:

  • Be sure you enabled the device plugin feature gate

  • Check the output of the following command:

    $kubectl describe gpu-node
    
  • Check the k8s-device-plugin container logs entering the GPU node and giving the following command:

    $docker logs <k8s-device-plugin container name>
    
  • Check the output of the following command on the GPU node:

    $nvidia-smi -a
    
  • Check your docker configuration file on the GPU node and relaunching the docker daemon. (e.g: /etc/docker/daemon.json)

  • Check the kubelet logs on the node:

    $sudo journalctl -r -u kubelet
    

Example

Looking at the k8s-device-plugin container you can find the following ouputs:

Loading NVML
Failed to initialize NVML: could not load NVML library.
If this is a GPU node, did you set the docker default runtime to `nvidia`?
You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

You can solve it by editing the docker configuration file on the GPU node and relaunching the docker daemon.