gpu pod Pending #852

imenselmi · 2024-07-30T11:26:51Z

I’m trying to prepare GPU worker nodes and enable GPU support on Kubernetes to use GPU nodes. I followed the steps in the README file link , but the pod always remains pending and is not working.Itried to use cuda 10 as tuto and also i changed to 12 and always not working.

1. Quick Debug Information

OS/Version : Ubuntu 22.04.4 LTS (Jammy Jellyfish)
cuda version : 12.2
NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2
*server type : Nvidia L40S : link
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker version 27.1.1, build 6312585
Docker Compose version v2.29.1
CRI-O version: 1.24.6
nvidia-container-toolkit version (1.16.0-1).
kubectl version :
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
minikube version: v1.33.1
helm Version:"v3.15.3"

2. Issue or feature description

Events:
Type Reason Age From Message

Warning FailedScheduling 26m (x150 over 12h) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

3.

kubectl get pods NAME READY STATUS RESTARTS AGE gpu-demo-vectoradd 0/1 Pending 0 12h gpu-operator-test 0/1 Pending 0 13h gpu-operator-test1 0/1 Pending 0 13h gpu-pod 0/1 Pending 0 13h

`kubectl describe pod gpu-pod
Name: gpu-pod
Namespace: default
Priority: 0
Service Account: default
Node:
Labels:
Annotations:
Status: Pending
IP:
IPs:
Containers:
cuda-container:
Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Port:
Host Port:
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ww9jw (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
kube-api-access-ww9jw:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message

Warning FailedScheduling 26m (x150 over 12h) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.`

The text was updated successfully, but these errors were encountered:

FelixMertin · 2024-08-09T12:51:26Z

Did you deploy nvidia-device-plugin via helm? If so, which helm chart are you using? I am currently facing the same problem after upgrading from 0.14.0 to 0.16.1.

elezar · 2024-08-14T11:04:33Z

@imenselmi / @FelixMertin could you please provide the logs for the k8s-device-plugin device-plugin container?

github-actions · 2024-11-13T04:28:03Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu pod Pending #852

gpu pod Pending #852

imenselmi commented Jul 30, 2024

FelixMertin commented Aug 9, 2024

elezar commented Aug 14, 2024

github-actions bot commented Nov 13, 2024

gpu pod Pending #852

gpu pod Pending #852

Comments

imenselmi commented Jul 30, 2024

1. Quick Debug Information

2. Issue or feature description

3.

FelixMertin commented Aug 9, 2024

elezar commented Aug 14, 2024

github-actions bot commented Nov 13, 2024