You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AKS Versions: 1.29.2
type of node pools :3 , system pool, general node pool, and GPU
tried NVIDIA driver plugins: Nvidia device plugin and GPU operator
OS IMAGE: Ubuntu 22.04.4 LTS
KERNEL VERSION: 5.15.0-1068-azure
CONTAINER-RUNTIME: containerd://1.7.15-1
NVIDIA PLUGIN VERSION: - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0, - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0
As per the above document, if the Nividia plugin driver is installed successfully then under the Capacity section, the GPU should list as nvidia.com/gpu: 1. However I did not see that when I described my GPU-enabled node.
I also tried the gpu-operator but that did not help either. I
The text was updated successfully, but these errors were encountered:
I am having a similar issue. The plugin installs and my node has the capacity listed. However, any pod running on the GPU node cannot detect the device. Doing kubectl exec into the plugin pod and running nvidia-smi returns Failed to initialize NVML: Unknown Error running a pod with python and attempting to use torch results in a similar issue.
Here is the setup of my AKS cluster:
AKS Versions: 1.29.2
type of node pools :3 , system pool, general node pool, and GPU
tried NVIDIA driver plugins: Nvidia device plugin and GPU operator
OS IMAGE: Ubuntu 22.04.4 LTS
KERNEL VERSION: 5.15.0-1068-azure
CONTAINER-RUNTIME: containerd://1.7.15-1
NVIDIA PLUGIN VERSION: - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0, - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.0
Documentation used to create the GPU node pool: https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#install-nvidia-device-plugin
Here is the issue:
As per the above document, if the Nividia plugin driver is installed successfully then under the Capacity section, the GPU should list as nvidia.com/gpu: 1. However I did not see that when I described my GPU-enabled node.
I also tried the gpu-operator but that did not help either. I
The text was updated successfully, but these errors were encountered: