Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security Context Misconfiguration with vGPU Nodes in NVIDIA Device Plugin Helm Chart #854

Open
sbathgate opened this issue Jul 31, 2024 · 1 comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@sbathgate
Copy link

Description

In the NVIDIA Device Plugin Helm chart (v0.16.1), when using a ConfigMap strategy, all nodes are incorrectly assigned elevated privileges, regardless of their MIG strategy configuration. This is due to a flaw in the template logic that prevents the actual content of the ConfigMap from being evaluated.

Current Behavior

  1. When deploying the chart with migStrategy: none and no ConfigMap, the correct security context is applied.
  2. However, when using a ConfigMap strategy, even with a default migStrategy: none:
    config:
      map:
        default: |-
          version: v1
          flags:
            migStrategy: none
        mig-single: |-
          version: v1
          flags:
            migStrategy: single
      default: default
    All nodes receive elevated privileges, regardless of their actual MIG strategy.

Expected Behavior

Nodes should receive the appropriate security context based on their actual MIG strategy configuration, especially those using the default migStrategy: none.

Impact

This issue unnecessarily elevates privileges on all nodes when using a ConfigMap strategy, potentially compromising security, particularly in mixed GPU environments.

Steps to Reproduce

  1. Set up a cluster with both MIG and vGPU nodes.
  2. Deploy the NVIDIA Device Plugin using any ConfigMap strategy.
  3. Observe that all nodes receive elevated privileges.

Code Analysis

The issue begins in the nvidia-device-plugin.allPossibleMigStrategiesAreNone template. Here's the relevant part:

{{- if .Values.migStrategy -}}
  {{- if ne .Values.migStrategy "none" -}}
    {{- $result = false -}}
  {{- end -}}
{{- else if eq (include "nvidia-device-plugin.hasConfigMap" .) "true" -}}
    {{- $result = false -}}
{{- else -}}
  {{- range $name, $contents := $.Values.config.map -}}
    {{- $config := $contents | fromYaml -}}
    {{- if $config.flags -}}
      {{- if ne $config.flags.migStrategy "none" -}}
        {{- $result = false -}}
      {{- end -}}
    {{- end -}}
  {{- end -}}
{{- end -}}

The initial issue is in this section:

{{- else if eq (include "nvidia-device-plugin.hasConfigMap" .) "true" -}}
    {{- $result = false -}}

The hasConfigMap function is defined as:

{{/*
Check if there is a ConfigMap in use or not
*/}}
{{- define "nvidia-device-plugin.hasConfigMap" -}}
{{- $result := false -}}
{{- if ne (include "nvidia-device-plugin.configMapName" .) "" -}}
  {{- $result = true -}}
{{- end -}}
{{- $result -}}
{{- end }}

where the configMapName function is defined as:

{{- define "nvidia-device-plugin.configMapName" -}}
{{- $result := "" -}}
{{- if .Values.config.name -}}
  {{- $result = .Values.config.name -}}
{{- else if not (empty .Values.config.map) -}}
  {{- $result = printf "%s-%s" (include "nvidia-device-plugin.fullname" .) "configs" -}}
{{- end -}}
{{- $result -}}
{{- end -}}

Root Cause

  1. The hasConfigMap function always returns true if there's any content in config.map within the values.yaml, without examining its contents.
  2. This causes the template to set $result = false whenever a ConfigMap is present, regardless of its content.
  3. The actual checking of the ConfigMap's content (the range loop) is never reached when a ConfigMap is defined.

Additional Context

  • This issue affects all deployments using the ConfigMap strategy, regardless of the actual MIG configurations.
  • The current implementation makes it impossible to use the multi-configmap strategy without applying elevated privileges to all nodes.
  • There is also some misconfigured logic in the nvidia-device-plugin.allPossibleMigStrategiesAreNone template. Here's the relevant part:
{{- else -}}
  {{- range $name, $contents := $.Values.config.map -}}
    {{- $config := $contents | fromYaml -}}
    {{- if $config.flags -}}
      {{- if ne $config.flags.migStrategy "none" -}}
        {{- $result = false -}}
      {{- end -}}
    {{- end -}}
  {{- end -}}
{{- end -}}

This code:

  • Iterates over all configurations in the ConfigMap.
  • Parses each configuration from YAML.
  • Checks if the flags key exists.
  • If any configuration has migStrategy not equal to "none", it sets $result to false.

The problem is that this logic doesn't distinguish between the default configuration and others. It treats all configurations equally. As a result:

  • Even if the default configuration has migStrategy: none,
  • The presence of any other configuration with a different migStrategy (like mig-single in the example)
  • Causes $result to be set to false, leading to elevated privileges for all nodes.

Potential action items

Fundamentally, the issue begins with the described logical misconfiguration, however there remains an underlying issue due to a single Daemonset being generated for all configurations. This subsequently applies the most permissive set of contexts regardless if the node only contains vGPU's. A method to mitigate this would include a unique Daemonset per configuration, that applies the appropriate permissions based on the gpu type. However, this would require a non trivial overhaul to the helm chart.

Copy link

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

1 participant