is there a solution to make all gpu deveices visible for a pod which not requests `nvidia.com/gpu` #239

tingweiwu · 2022-06-06T13:40:14Z

when I use NVIDIA/k8s-device-plugin in my k8s cluster
I set NVIDIA_VISIBLE_DEVICES=all in pod spec

apiVersion: v1
kind: Pod
metadata:
  name: test
  containers:
  - args:
    - -c
    - top -b
    command:
    - /bin/sh
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    image: cuda:10.2-cudnn7-devel-ubuntu18.04
    name: test
    resources:
      limits:
        cpu: 150m
        memory: 200Mi
      requests:
        cpu: 100m
        memory: 200Mi

the devices.list under /sys/fs/cgroup/devices/kubepods/burstable/podxxxxxx/xxxxxx/devices.list has all gpu deveice on this node

I noticed that this GCE container-engine-accelerators doesn’t require using nvidia-docker. so NVIDIA_VISIBLE_DEVICES may doesn't work.
thus, is there a solution to make all gpu deveices visible for a pod which not requests nvidia.com/gpu ?

The text was updated successfully, but these errors were encountered:

DavraYoung · 2023-04-14T13:41:04Z

Check how gke time slicing works. I was able to achieve sharing single gpu on multiple workload.

Here is my terraform:

resource "google_container_node_pool" "gpu" {
  name     = "gpu"
  location = var.zone
  cluster  = var.cluster_name
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
  initial_node_count = 1

  management {
    auto_repair  = "true"
    auto_upgrade = "true"
  }


  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
    ]
    guest_accelerator {
      type  = var.gpu_type
      count = 1
      gpu_sharing_config{
        gpu_sharing_strategy = "TIME_SHARING"
        max_shared_clients_per_gpu = 2
      }

    }
    image_type = "UBUNTU_CONTAINERD"

    labels = {
      env        = var.project
      node-group = "gpu"
      "cloud.google.com/gke-max-shared-clients-per-node" = "2"
    }

    preemptible  = true
    machine_type = "n1-standard-4"
    tags         = ["gke-node", "${var.cluster_name}-gke"]
    metadata     = {
      disable-legacy-endpoints = "true"
    }
  }
}```

Notice: cloud.google.com/gke-max-shared-clients-per-node

VelorumS · 2023-10-05T17:37:43Z

@DavraYoung but how to do it without time-sharing or multi-instance GPUs?

We were able to have all GPUs visible to all Docker containers running on the instance.

And it seems that in k8s the nvidia.com/gpu: 0 worked: http://www.bytefold.com/sharing-gpu-in-kubernetes/

You can set nvidia.com/gpu value to 0 and still workload will be able to see all the GPUs available on the instance. It will also not block the GPU on kubernetes to more workloads can be scheduled on that node.
resources:
       limits:
         nvidia.com/gpu: 0 # This will work fine and will not block your GPU for other workloads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is there a solution to make all gpu deveices visible for a pod which not requests `nvidia.com/gpu` #239

is there a solution to make all gpu deveices visible for a pod which not requests `nvidia.com/gpu` #239

tingweiwu commented Jun 6, 2022

DavraYoung commented Apr 14, 2023 •

edited

Loading

VelorumS commented Oct 5, 2023

is there a solution to make all gpu deveices visible for a pod which not requests nvidia.com/gpu #239

is there a solution to make all gpu deveices visible for a pod which not requests nvidia.com/gpu #239

Comments

tingweiwu commented Jun 6, 2022

DavraYoung commented Apr 14, 2023 • edited Loading

VelorumS commented Oct 5, 2023

is there a solution to make all gpu deveices visible for a pod which not requests `nvidia.com/gpu` #239

is there a solution to make all gpu deveices visible for a pod which not requests `nvidia.com/gpu` #239

DavraYoung commented Apr 14, 2023 •

edited

Loading