Security Context Misconfiguration with vGPU Nodes in NVIDIA Device Plugin Helm Chart #854

sbathgate · 2024-07-31T20:21:28Z

Description

In the NVIDIA Device Plugin Helm chart (v0.16.1), when using a ConfigMap strategy, all nodes are incorrectly assigned elevated privileges, regardless of their MIG strategy configuration. This is due to a flaw in the template logic that prevents the actual content of the ConfigMap from being evaluated.

Current Behavior

When deploying the chart with migStrategy: none and no ConfigMap, the correct security context is applied.

However, when using a ConfigMap strategy, even with a default migStrategy: none:

config:
  map:
    default: |-
      version: v1
      flags:
        migStrategy: none
    mig-single: |-
      version: v1
      flags:
        migStrategy: single
  default: default

All nodes receive elevated privileges, regardless of their actual MIG strategy.

Expected Behavior

Nodes should receive the appropriate security context based on their actual MIG strategy configuration, especially those using the default migStrategy: none.

Impact

This issue unnecessarily elevates privileges on all nodes when using a ConfigMap strategy, potentially compromising security, particularly in mixed GPU environments.

Steps to Reproduce

Set up a cluster with both MIG and vGPU nodes.
Deploy the NVIDIA Device Plugin using any ConfigMap strategy.
Observe that all nodes receive elevated privileges.

Code Analysis

The issue begins in the nvidia-device-plugin.allPossibleMigStrategiesAreNone template. Here's the relevant part:

{{- if .Values.migStrategy -}}
  {{- if ne .Values.migStrategy "none" -}}
    {{- $result = false -}}
  {{- end -}}
{{- else if eq (include "nvidia-device-plugin.hasConfigMap" .) "true" -}}
    {{- $result = false -}}
{{- else -}}
  {{- range $name, $contents := $.Values.config.map -}}
    {{- $config := $contents | fromYaml -}}
    {{- if $config.flags -}}
      {{- if ne $config.flags.migStrategy "none" -}}
        {{- $result = false -}}
      {{- end -}}
    {{- end -}}
  {{- end -}}
{{- end -}}

The initial issue is in this section:

{{- else if eq (include "nvidia-device-plugin.hasConfigMap" .) "true" -}}
    {{- $result = false -}}

The hasConfigMap function is defined as:

{{/*
Check if there is a ConfigMap in use or not
*/}}
{{- define "nvidia-device-plugin.hasConfigMap" -}}
{{- $result := false -}}
{{- if ne (include "nvidia-device-plugin.configMapName" .) "" -}}
  {{- $result = true -}}
{{- end -}}
{{- $result -}}
{{- end }}

where the configMapName function is defined as:

{{- define "nvidia-device-plugin.configMapName" -}}
{{- $result := "" -}}
{{- if .Values.config.name -}}
  {{- $result = .Values.config.name -}}
{{- else if not (empty .Values.config.map) -}}
  {{- $result = printf "%s-%s" (include "nvidia-device-plugin.fullname" .) "configs" -}}
{{- end -}}
{{- $result -}}
{{- end -}}

Root Cause

The hasConfigMap function always returns true if there's any content in config.map within the values.yaml, without examining its contents.
This causes the template to set $result = false whenever a ConfigMap is present, regardless of its content.
The actual checking of the ConfigMap's content (the range loop) is never reached when a ConfigMap is defined.

Additional Context

This issue affects all deployments using the ConfigMap strategy, regardless of the actual MIG configurations.
The current implementation makes it impossible to use the multi-configmap strategy without applying elevated privileges to all nodes.
There is also some misconfigured logic in the nvidia-device-plugin.allPossibleMigStrategiesAreNone template. Here's the relevant part:

{{- else -}}
  {{- range $name, $contents := $.Values.config.map -}}
    {{- $config := $contents | fromYaml -}}
    {{- if $config.flags -}}
      {{- if ne $config.flags.migStrategy "none" -}}
        {{- $result = false -}}
      {{- end -}}
    {{- end -}}
  {{- end -}}
{{- end -}}

This code:

Iterates over all configurations in the ConfigMap.
Parses each configuration from YAML.
Checks if the flags key exists.
If any configuration has migStrategy not equal to "none", it sets $result to false.

The problem is that this logic doesn't distinguish between the default configuration and others. It treats all configurations equally. As a result:

Even if the default configuration has migStrategy: none,
The presence of any other configuration with a different migStrategy (like mig-single in the example)
Causes $result to be set to false, leading to elevated privileges for all nodes.

Potential action items

Fundamentally, the issue begins with the described logical misconfiguration, however there remains an underlying issue due to a single Daemonset being generated for all configurations. This subsequently applies the most permissive set of contexts regardless if the node only contains vGPU's. A method to mitigate this would include a unique Daemonset per configuration, that applies the appropriate permissions based on the gpu type. However, this would require a non trivial overhaul to the helm chart.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-30T04:28:22Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security Context Misconfiguration with vGPU Nodes in NVIDIA Device Plugin Helm Chart #854

Security Context Misconfiguration with vGPU Nodes in NVIDIA Device Plugin Helm Chart #854

sbathgate commented Jul 31, 2024

github-actions bot commented Oct 30, 2024

Security Context Misconfiguration with vGPU Nodes in NVIDIA Device Plugin Helm Chart #854

Security Context Misconfiguration with vGPU Nodes in NVIDIA Device Plugin Helm Chart #854

Comments

sbathgate commented Jul 31, 2024

Description

Current Behavior

Expected Behavior

Impact

Steps to Reproduce

Code Analysis

Root Cause

Additional Context

Potential action items

github-actions bot commented Oct 30, 2024