Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A3 Ultra GKE deployment fails due to kubectl misconfig when regional/zonal cluster settings are mixed up #3427

Open
chajath opened this issue Dec 18, 2024 · 4 comments
Labels
bug Something isn't working stale

Comments

@chajath
Copy link
Contributor

chajath commented Dec 18, 2024

Describe the bug

I'm trying to provision A3Ultra cluster with the blueprint modeled after https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/a3ultra-preview/examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml but without kubectl-apply block, as we want to manage k8s resources outside of toolkit.

Steps to reproduce

Steps to reproduce the behavior:

  1. Have an existing GKE cluster in Zonal mode
  2. Try to deploy Regional mode

Expected behavior

terraform plan etc shows complete cluster replacement

Actual behavior

Error: exit status 1

Error: failed to create kubernetes rest client for read of resource: Get "http://localhost/api?timeout=32s": dial tcp [::1]:80: connect: connection refused

  with module.a3ultra-benchmark-pr.module.kubectl_apply.module.kubectl_apply_manifests["9"].kubectl_manifest.apply_doc["0"],
  on .terraform/modules/a3ultra-benchmark-pr/modules/management/kubectl-apply/kubectl/main.tf line 60, in resource "kubectl_manifest" "apply_doc":
  60: resource "kubectl_manifest" "apply_doc" {

Version (gcluster --version)

tested with both v1.43 and experimental

@chajath chajath added the bug Something isn't working label Dec 18, 2024
@chajath
Copy link
Contributor Author

chajath commented Dec 18, 2024

I've added kubectl-apply block but that didn't help

@chajath chajath changed the title A3 Ultra GKE deployment fails due to kubectl misconfig A3 Ultra GKE deployment fails due to kubectl misconfig when regional/zonal cluster settings are mixed up Dec 19, 2024
@ighosh98
Copy link
Contributor

ighosh98 commented Dec 26, 2024

This is not possible today, but we have an open PR #3406 to address the feature request.
Once this PR is merged, you can pull develop and use ./gcluster deploy examples/gke-a3-ultragpu/gke-a3-ultragpu.yaml --force to solve your issue or wait for the release when this is pushed to main

@ankitkinra
Copy link
Contributor

I think the issue here was different and related to the cluster being provisioned as ZONAL vs REGIONAL which might be an assumption in the kubectl-apply module.

Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

3 participants