Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is there a solution to make all gpu deveices visible for a pod which not requests nvidia.com/gpu #239

Open
tingweiwu opened this issue Jun 6, 2022 · 2 comments

Comments

@tingweiwu
Copy link

when I use NVIDIA/k8s-device-plugin in my k8s cluster
I set NVIDIA_VISIBLE_DEVICES=all in pod spec

apiVersion: v1
kind: Pod
metadata:
  name: test
  containers:
  - args:
    - -c
    - top -b
    command:
    - /bin/sh
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    image: cuda:10.2-cudnn7-devel-ubuntu18.04
    name: test
    resources:
      limits:
        cpu: 150m
        memory: 200Mi
      requests:
        cpu: 100m
        memory: 200Mi

the devices.list under /sys/fs/cgroup/devices/kubepods/burstable/podxxxxxx/xxxxxx/devices.list has all gpu deveice on this node
image

I noticed that this GCE container-engine-accelerators doesn’t require using nvidia-docker. so NVIDIA_VISIBLE_DEVICES may doesn't work.
thus, is there a solution to make all gpu deveices visible for a pod which not requests nvidia.com/gpu ?

@DavraYoung
Copy link

DavraYoung commented Apr 14, 2023

Check how gke time slicing works. I was able to achieve sharing single gpu on multiple workload.

Here is my terraform:

resource "google_container_node_pool" "gpu" {
  name     = "gpu"
  location = var.zone
  cluster  = var.cluster_name
  autoscaling {
    min_node_count = 1
    max_node_count = 5
  }
  initial_node_count = 1

  management {
    auto_repair  = "true"
    auto_upgrade = "true"
  }


  node_config {
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
    ]
    guest_accelerator {
      type  = var.gpu_type
      count = 1
      gpu_sharing_config{
        gpu_sharing_strategy = "TIME_SHARING"
        max_shared_clients_per_gpu = 2
      }

    }
    image_type = "UBUNTU_CONTAINERD"

    labels = {
      env        = var.project
      node-group = "gpu"
      "cloud.google.com/gke-max-shared-clients-per-node" = "2"
    }

    preemptible  = true
    machine_type = "n1-standard-4"
    tags         = ["gke-node", "${var.cluster_name}-gke"]
    metadata     = {
      disable-legacy-endpoints = "true"
    }
  }
}```

Notice: cloud.google.com/gke-max-shared-clients-per-node

@VelorumS
Copy link

VelorumS commented Oct 5, 2023

@DavraYoung but how to do it without time-sharing or multi-instance GPUs?

We were able to have all GPUs visible to all Docker containers running on the instance.

And it seems that in k8s the nvidia.com/gpu: 0 worked: http://www.bytefold.com/sharing-gpu-in-kubernetes/

You can set nvidia.com/gpu value to 0 and still workload will be able to see all the GPUs available on the instance. It will also not block the GPU on kubernetes to more workloads can be scheduled on that node.

resources:
       limits:
         nvidia.com/gpu: 0 # This will work fine and will not block your GPU for other workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants