Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not load NVML library: libnvidia-ml.so.1 in K3S cluster #1011

Open
santurini opened this issue Oct 24, 2024 · 0 comments
Open

Could not load NVML library: libnvidia-ml.so.1 in K3S cluster #1011

santurini opened this issue Oct 24, 2024 · 0 comments

Comments

@santurini
Copy link

I am following @elezar guide on how to enable MPS in a kubernetes cluster (I am using k3s) and after deploying the gpu-operator, the nvidia-device-plugin-ctr fails to start.

Similar Issues

This is similar to #478 so I would ask also @klueska to take a look and try to help me. I also checked that libnvidia-ml.so.1 was present in the machine and actually it is, located here: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1

Executed commands

helm install \
      -n gpu-operator \
      --generate-name \
      --create-namespace \
      --set devicePlugin.enabled=false \
      --set gfd.enabled=false \
      nvidia/gpu-operator

cat << EOF > /tmp/dp-mps-10.yaml
version: v1
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
      replicas: 10
EOF

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.15.0 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set gfd.enabled=true \
    --set config.default=mps10 \
    --set-file config.map.mps10=/tmp/dp-mps-10.yaml

Failing pod logs

I0117 15:43:15.906553 1 main.go:256] Retrieving plugins.
W0117 15:43:15.907261 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0117 15:43:15.907336 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0117 15:43:15.907377 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0117 15:43:15.907388 1 factory.go:115] Incompatible platform detected
E0117 15:43:15.907392 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0117 15:43:15.907396 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0117 15:43:15.907401 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0117 15:43:15.907405 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

NVIDIA libraries

||/ Name                                  Version                     Architecture Description
+++-=====================================-===========================-============-=====================================================
un  libgldispatch0-nvidia                 <none>                      <none>       (no description available)
ii  libnvidia-cfg1-535-server:amd64       535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                    <none>                      <none>       (no description available)
un  libnvidia-compute                     <none>                      <none>       (no description available)
ii  libnvidia-compute-535-server:amd64    535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA libcompute package
ii  libnvidia-container-tools             1.16.2-1                    amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.16.2-1                    amd64        NVIDIA container runtime library
un  libnvidia-decode                      <none>                      <none>       (no description available)
ii  libnvidia-decode-535-server:amd64     535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                      <none>                      <none>       (no description available)
ii  libnvidia-encode-535-server:amd64     535.183.06-0ubuntu0.20.04.1 amd64        NVENC Video Encoding runtime library
un  libnvidia-ml1                         <none>                      <none>       (no description available)
un  nvidia-384                            <none>                      <none>       (no description available)
un  nvidia-390                            <none>                      <none>       (no description available)
un  nvidia-compute-utils                  <none>                      <none>       (no description available)
ii  nvidia-compute-utils-535-server       535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA compute utilities
un  nvidia-container-runtime              <none>                      <none>       (no description available)
un  nvidia-container-runtime-hook         <none>                      <none>       (no description available)
ii  nvidia-container-toolkit              1.15.0-1                    amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base         1.15.0-1                    amd64        NVIDIA Container Toolkit Base
ii  nvidia-dkms-535-server                535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel                    <none>                      <none>       (no description available)
un  nvidia-driver-535-server              <none>                      <none>       (no description available)
un  nvidia-firmware-535-535.183.06        <none>                      <none>       (no description available)
ii  nvidia-firmware-535-server-535.183.06 535.183.06-0ubuntu0.20.04.1 amd64        Firmware files used by the kernel module
un  nvidia-headless                       <none>                      <none>       (no description available)
ii  nvidia-headless-535-server            535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA headless metapackage
ii  nvidia-headless-no-dkms-535-server    535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA headless metapackage - no DKMS
un  nvidia-kernel-common                  <none>                      <none>       (no description available)
ii  nvidia-kernel-common-535-server       535.183.06-0ubuntu0.20.04.1 amd64        Shared files used with the kernel module
un  nvidia-kernel-source                  <none>                      <none>       (no description available)
ii  nvidia-kernel-source-535-server       535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA kernel source package
un  nvidia-opencl-icd                     <none>                      <none>       (no description available)
un  nvidia-persistenced                   <none>                      <none>       (no description available)
un  nvidia-smi                            <none>                      <none>       (no description available)
un  nvidia-utils                          <none>                      <none>       (no description available)
ii  nvidia-utils-535-server               535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA Server Driver support binaries 

NVIDIA-SMI output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:07:00.0 Off |                  N/A |
| 30%   36C    P0             114W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:08:00.0 Off |                  N/A |
| 30%   29C    P0             108W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off | 00000000:45:00.0 Off |                  N/A |
| 30%   34C    P0             116W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off | 00000000:46:00.0 Off |                  N/A |
| 30%   39C    P0             110W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 3090        Off | 00000000:89:00.0 Off |                  N/A |
| 30%   36C    P0             111W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 3090        Off | 00000000:8A:00.0 Off |                  N/A |
| 30%   34C    P0             111W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 3090        Off | 00000000:C5:00.0 Off |                  N/A |
| 30%   42C    P0             110W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 3090        Off | 00000000:C6:00.0 Off |                  N/A |
| 30%   35C    P0             108W / 350W |      0MiB / 24576MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Docker config

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Containerd config

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant