You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am following @elezar guide on how to enable MPS in a kubernetes cluster (I am using k3s) and after deploying the gpu-operator, the nvidia-device-plugin-ctr fails to start.
Similar Issues
This is similar to #478 so I would ask also @klueska to take a look and try to help me. I also checked that libnvidia-ml.so.1 was present in the machine and actually it is, located here: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
I0117 15:43:15.906553 1 main.go:256] Retrieving plugins.
W0117 15:43:15.907261 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0117 15:43:15.907336 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0117 15:43:15.907377 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0117 15:43:15.907388 1 factory.go:115] Incompatible platform detected
E0117 15:43:15.907392 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0117 15:43:15.907396 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0117 15:43:15.907401 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0117 15:43:15.907405 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
NVIDIA libraries
||/ Name Version Architecture Description
+++-=====================================-===========================-============-=====================================================
un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-cfg1-535-server:amd64 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any <none> <none> (no description available)
un libnvidia-compute <none> <none> (no description available)
ii libnvidia-compute-535-server:amd64 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.16.2-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.16.2-1 amd64 NVIDIA container runtime library
un libnvidia-decode <none> <none> (no description available)
ii libnvidia-decode-535-server:amd64 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA Video Decoding runtime libraries
un libnvidia-encode <none> <none> (no description available)
ii libnvidia-encode-535-server:amd64 535.183.06-0ubuntu0.20.04.1 amd64 NVENC Video Encoding runtime library
un libnvidia-ml1 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-compute-utils <none> <none> (no description available)
ii nvidia-compute-utils-535-server 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities
un nvidia-container-runtime <none> <none> (no description available)
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.15.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.15.0-1 amd64 NVIDIA Container Toolkit Base
ii nvidia-dkms-535-server 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package
un nvidia-dkms-kernel <none> <none> (no description available)
un nvidia-driver-535-server <none> <none> (no description available)
un nvidia-firmware-535-535.183.06 <none> <none> (no description available)
ii nvidia-firmware-535-server-535.183.06 535.183.06-0ubuntu0.20.04.1 amd64 Firmware files used by the kernel module
un nvidia-headless <none> <none> (no description available)
ii nvidia-headless-535-server 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA headless metapackage
ii nvidia-headless-no-dkms-535-server 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA headless metapackage - no DKMS
un nvidia-kernel-common <none> <none> (no description available)
ii nvidia-kernel-common-535-server 535.183.06-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
ii nvidia-kernel-source-535-server 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package
un nvidia-opencl-icd <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-utils <none> <none> (no description available)
ii nvidia-utils-535-server 535.183.06-0ubuntu0.20.04.1 amd64 NVIDIA Server Driver support binaries
I am following @elezar guide on how to enable MPS in a kubernetes cluster (I am using k3s) and after deploying the gpu-operator, the
nvidia-device-plugin-ctr
fails to start.Similar Issues
This is similar to #478 so I would ask also @klueska to take a look and try to help me. I also checked that libnvidia-ml.so.1 was present in the machine and actually it is, located here:
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
Executed commands
Failing pod logs
NVIDIA libraries
NVIDIA-SMI output
Docker config
Containerd config
The text was updated successfully, but these errors were encountered: