Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E K8s Context Protection #4391

Open
reedjosh opened this issue Nov 25, 2024 · 19 comments
Open

E2E K8s Context Protection #4391

reedjosh opened this issue Nov 25, 2024 · 19 comments
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@reedjosh
Copy link

reedjosh commented Nov 25, 2024

What do you want to happen?

The initial E2E suite created via kubebuilder does not protect against using a production kubernetes context.

This can result in removing monitoring and cert manager from production clusters.

Tilt for example only allows the user to run against a context matching kind-* by default, and requires the user to intentionally add other contexts.

🙏 please add this protection. It is hard to otherwise ensure it doesn't crop up across a company.

Extra Labels

No response

@reedjosh reedjosh added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 25, 2024
@camilamacedo86
Copy link
Member

camilamacedo86 commented Nov 26, 2024

Hi @reedjosh,

Thank you for raise this issue.

Could you please let me know if you are using a project with the latest scaffold? I'm just asking because it was already addressed.

  • See that in the Makefile, we check if kind is up and running:

# TODO(user): To use a different vendor for e2e tests, modify the setup under 'tests/e2e'.
# The default setup assumes Kind is pre-installed and builds/loads the Manager Docker image locally.
# Prometheus and CertManager are installed by default; skip with:
# - PROMETHEUS_INSTALL_SKIP=true
# - CERT_MANAGER_INSTALL_SKIP=true
.PHONY: test-e2e
test-e2e: manifests generate fmt vet ## Run the e2e tests. Expected an isolated environment using Kind.
@command -v kind >/dev/null 2>&1 || { \
echo "Kind is not installed. Please install Kind manually."; \
exit 1; \
}
@kind get clusters | grep -q 'kind' || { \
echo "No Kind cluster is running. Please start a Kind cluster before running the e2e tests."; \
exit 1; \
}
go test ./test/e2e/ -v -ginkgo.v

  • Then, see the suite test scaffold that before all we try to load the image. If the tests are executed against a cluster that is not kind it should fail. Kind is not intended to be used in production so, it will not run against a production cluster without intentional changes

// TODO(user): If you want to change the e2e test vendor from Kind, ensure the image is
// built and available before running the tests. Also, remove the following block.
By("loading the manager(Operator) image on Kind")
err = utils.LoadImageToKindClusterWithName(projectImage)
ExpectWithOffset(1, err).NotTo(HaveOccurred(), "Failed to load the manager(Operator) image into Kind")
// The tests-e2e are intended to run on a temporary cluster that is created and destroyed for testing.
// To prevent errors when tests run in environments with Prometheus or CertManager already installed,
// we check for their presence before execution.
// Setup Prometheus and CertManager before the suite if not skipped and if not already installed
if !skipPrometheusInstall {
By("checking if prometheus is installed already")
isPrometheusOperatorAlreadyInstalled = utils.IsPrometheusCRDsInstalled()
if !isPrometheusOperatorAlreadyInstalled {
_, _ = fmt.Fprintf(GinkgoWriter, "Installing Prometheus Operator...\n")
Expect(utils.InstallPrometheusOperator()).To(Succeed(), "Failed to install Prometheus Operator")
} else {
_, _ = fmt.Fprintf(GinkgoWriter, "WARNING: Prometheus Operator is already installed. Skipping installation...\n")
}
}
if !skipCertManagerInstall {
By("checking if cert manager is installed already")
isCertManagerAlreadyInstalled = utils.IsCertManagerCRDsInstalled()
if !isCertManagerAlreadyInstalled {
_, _ = fmt.Fprintf(GinkgoWriter, "Installing CertManager...\n")
Expect(utils.InstallCertManager()).To(Succeed(), "Failed to install CertManager")
} else {
_, _ = fmt.Fprintf(GinkgoWriter, "WARNING: CertManager is already installed. Skipping installation...\n")
}
}
})

  • Moreover, we added checks to try to avoid re-install Prometheus and CertManager, as you can see more comments to clarify how it works and help users.

If you are using this version, then we might add a logic to exit(1) when it is not possible to load the image.

We are looking forward to your reply.

@camilamacedo86 camilamacedo86 added the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 26, 2024
@damsien
Copy link
Contributor

damsien commented Nov 26, 2024

What if the KUBECONFIG point to a production cluster and the kind cluster exists? I think the image will still be loaded in the cluster (since we set kind load docker-image --name <CLUSTER_NAME> to tell on which cluster we want to load the image) while having the KUBECONFIG pointing another cluster.

@monteiro-renato
Copy link
Contributor

Slight tangent, but this check is probably not working as intended.
When you create a cluster with for example kind create cluster -n my-cluster, the result of kind get clusters will not have the kind- prefix; it will display the same name that was passed to the -n flag. e.g.

$ kind get clusters
linkerd-cluster
my-cluster

The prefix kind- is, however, added to the kubeconfig file. e.g.

cat ~/.kube/config | yq '.clusters.[].name'
"kind-linkerd-cluster"
"kind-my-cluster"

@damsien
Copy link
Contributor

damsien commented Nov 26, 2024

@monteiro-renato I think that's how they ensure that the cluster is not a production cluster. Because usually production clusters are not prefixed with kind-. If you want to run the tests, you must have a kind- prefixed local cluster.

@monteiro-renato
Copy link
Contributor

Yea, I can see it now. When I got that error a while back I found it weird since I knew that I had a kind cluster running and with the correct context configured.
I'm usually quite paranoid when it comes to tools that default to w.e is configured on the kubeconfig so I prefer to create separate envs (container or VM) so that I can have peace of mind knowing that I will never target a cluster I'm not expecting.
But yea, I guess it's just that the error message could be a bit more explicit about it's intent.

@monteiro-renato
Copy link
Contributor

But yea, I think OP's concern is still valid. The validation should probably be done against the context configured in the kubeconfig instead of relying on the output of a kind command.

@camilamacedo86
Copy link
Member

camilamacedo86 commented Nov 26, 2024

Hi @monteiro-renato, @damsien, @reedjosh,

How can we validate the context or name to determine whether it’s a production environment for the user?
This goes beyond what we can reasonably enforce. Moreover, if users execute any Makefile target against a production environment—such as make deploy, make install, or make undeploy—they may encounter serious issues.

Additionally, frameworks generally don’t handle this kind of validation. For example, tools like Terraform, Helm, etc., don’t verify whether a command targets a production environment. At some level, the user must understand the commands they’re running and the clusters they’re targeting.

On top of that:

  • Cluster admins should safeguard production environments by applying RBAC to restrict access for developers.
  • Developers should avoid configuring production context in their local dev environments an just leave it there. They should ensure that actions targeting production are separated from the development environment to prevent accidental operations. This is not a critical need specifically for e2e tests but applies to all possible actions that developers might execute using kubectl.

It was already discussed. We cannot validate the context configured by developers when they run the commands and say, "Ah, it's production, and it is not for the e2e tests," just as we do not perform such validation for any Makefile targets or features.

I hope that clarifies. I am looking forward to knowing if @reedjosh is or is not using a project with all changes related to this request. (as described in #4391 (comment)) I really would appreciate it if @reedjosh could share this information with us.

Thank you.

@monteiro-renato
Copy link
Contributor

Hey @camilamacedo86,

We could create a kind cluster based on the project's name and then use kind's get kubeconfig subcommand to generate the kubeconfig file to use in the e2e tests. The tests would set the KUBECONFIG env var at the start of the tests to point to the location of the kubeconfig file generated by kind.

@Sijoma
Copy link
Contributor

Sijoma commented Feb 6, 2025

Hi @camilamacedo86 , I agree that its pretty hard to prevent this & kubebuilder does not need to protect against all of this.

I think one improvement that is not too hard to achieve would be to extend the utils.Kubectl to also have the --context=value only set once and added to all kubectl calls.

I've personally ran my tests against the wrong context as it was not specified.

How did this happen? I've started a long running e2e test and then switched context in my shell.
This cannot happen if the utils.Kubectl always specifies the --context .

With other clients (controller-runtime / e2e-testframework clients) this is not so much an issue as we create them before the tests with a hardcoded context.

@reedjosh
Copy link
Author

reedjosh commented Feb 12, 2025

Could you please let me know if you are using a project with the latest scaffold? I'm just asking because it was already addressed.

I'm sorry for the delayed response. I believe we were using 4.30 at the time. We started the project and saw the issue around the middle of Nov. And we were using controllergen v0.16.x.

Without digging into the code I think that the current method is still broken, and implementing protections and workarounds like Tilt does would be ideal.

allow_k8s_context

Tilt checks k8s context names for the kind- prefix. If it's missing, it won't run.

The allow_k8s_contexts is then provided in case someone is running minikube or something else without the kind- prefix

@reedjosh
Copy link
Author

reedjosh commented Feb 12, 2025

This goes beyond what we can reasonably enforce. Moreover, if users execute any Makefile target against a production environment—such as make deploy, make install, or make undeploy—they may encounter serious issues.

I understand the thinking here, but make deploy does seem like something that would modify a cluster.

make test or make e2e for a tool that nearly every k8s operator builder uses (including those brand new to the ecosystem) does not make it obvious that if you're accidentally logged into a meaningful cluster, things will get wiped out.

@jravetch
Copy link

Had a similar experience when running make test and make test-e2e, wasn't clear the context wasn't kind and the namespace was deleted.

@camilamacedo86
Copy link
Member

camilamacedo86 commented Feb 13, 2025

Hi folks,

As discussed in Slack and another issue, adding --context=value and does not seem like a good approach

However, due to the complexities involved in avoiding this scenario and the fact that it still creates confusion for users, I was thinking about how to simplify it and solve the concerns raised. Additionally, since we now scaffold GitHub Actions to run E2E tests, we can improve this by:

  1. Removing the Golang code that installs CertManager from the E2E tests. (ref: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4/test/e2e/e2e_suite_test.go#L66-L81)
  2. Updating the E2E tests to validate the sample projects and docs tutorials for installing CertManager. (https://github.com/kubernetes-sigs/kubebuilder/blob/master/.github/workflows/test-e2e-book.yml and https://github.com/kubernetes-sigs/kubebuilder/blob/master/.github/workflows/test-e2e-samples.yml)
  3. Modifying the GitHub Actions scaffold to install CertManager explicitly before running tests. (ref: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4/.github/workflows/test-e2e.yml)
  4. Removing unnecessary complexities from the Makefile targets to simplify the test setup. (here: https://github.com/kubernetes-sigs/kubebuilder/blob/master/testdata/project-v4/Makefile#L64-L78)

This approach would make the test framework more predictable, reduce complexity, and align with best practices for managing dependencies in CI/CD pipelines as address all your concerns.

If someone would like to work on this one, please feel free !!!

@monteiro-renato, @damsien, @reedjosh, @reedjosh. WDYT? Please feel free to contribute if you wish.

@camilamacedo86 camilamacedo86 added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed triage/needs-information Indicates an issue needs more information in order to work on it. kind/feature Categorizes issue or PR as related to a new feature. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Feb 13, 2025
@reedjosh
Copy link
Author

I think that sounds great! Only final concern is that this also seems to install and thus remove metrics. Would that be subject to the same changes?

Thank you for your continued work in making kubebuilder an excellent project!

@camilamacedo86
Copy link
Member

camilamacedo86 commented Feb 13, 2025

Hi @reedjosh,

Following the approach outlined in the comment above (#4391 (comment)), external dependencies are no longer installed via the e2e tests. Instead, third-party dependencies are now handled through GitHub Actions, which seems to be the best approach.

This means that when you run make test-e2e, the project will be built, loaded with Kind, and the tests will be executed. Any required dependencies, such as CertManager or Prometheus, are managed within the GitHub Actions workflow.

Let me know if you have any questions and if the above answers your concerns.

@Sijoma
Copy link
Contributor

Sijoma commented Feb 13, 2025

As discussed in Slack and another issue, adding --context=value and does not seem like a good approach

Do you have a link to that? I would like to understand the reasoning.

The protection is really important to also have at runtime. E2e tests are often quite slow - so there is a risk someone switches the context. We for example already protect against many cases at startup (this one is easy).

As far as removing CertManager setup etc. - I personally like this setup although we so far setup everything before testing with kind/make.
It would still mean that you potentially run our E2e tests in the wrong context (production!) which might do something undesirable on the resources your tests modify.

@camilamacedo86
Copy link
Member

camilamacedo86 commented Feb 13, 2025

Hi @Sijoma,

Let’s go step by step to clarify the situation.


What is the problem?

When running make test-e2e, it installs Prometheus and CertManager.
If the current context is pointing to a cluster where these components are already installed, it may cause conflicts.


Why is this issue already resolved in the latest scaffold?

The latest scaffold has addressed this issue with the following changes:

Prometheus is no longer installed.
CertManager is only installed if:

Given these updates, I don’t see how the original issue persists.
The scaffold now ensures that CertManager is only installed when necessary, and no kind-prefix configuration is required.
That’s why I suggested checking if you’re using the latest version.


Why don’t we set --context=value by default in the Makefile?

Following is a summary:

This has been discussed on other places like slack, Issue 4116 - Comment,Issue 4391 - Comment

1. How Do We Determine a Production Environment?

  • There is no universal way to validate whether a cluster is a production environment.
  • If we enforce a context check here, we would also need to check other Makefile targets (make deploy, make install, make undeploy), which could also impact production environments.
  • The same accidental scenario can occur with ANY target. Why was this never requested for other targets? Why should we add it only for make test-e2e?

2. Other Tools Do Not Enforce This

  • Terraform, Helm, and Kubernetes (kubectl) do NOT prevent users from running commands against production environments.
  • It is ultimately the user’s responsibility to ensure they are targeting the correct cluster.

3. Checking Context Does Not Solve the Issue

  • Developers can still accidentally run tests against a production cluster by triggering e2e tests directly, without using make test-e2e.
  • Using the Makefile is not mandatory. Many developers (including myself) trigger tests directly from an IDE.
  • In CI workflows, the Makefile target is typically used, but developers often run tests locally in different ways. Therefore, the suggestion of --context=value does NOT solve the issue described here.

"This change would negatively impact users, as it would introduce unnecessary complexities into Kubebuilder that all users do not commonly require. It would also create inconsistent behaviour compared to other targets and deviate from standard practices followed by other tools. Additionally, it offers NO little real protection against misconfiguration and does not effectively address the underlying issue."


Alternative Approach – GitHub Actions for Managing Dependencies

If some users still think this is an issue, an alternative approach has already been proposed:
Ensure that GitHub Actions install the required dependencies instead.

📌 Discussed here: Issue 4391 - Comment

How This Works:

  • External dependencies (e.g., CertManager, Prometheus) are no longer installed via e2e tests.
  • Instead, they are managed within GitHub Actions.
  • When running make test-e2e:
    • The project is built
    • It is loaded with Kind
    • The tests are executed
  • All required dependencies are handled by GitHub Actions, preventing unnecessary installations in local environments.

Pros of This Approach

  • Simplifies the overall setup in the Makefile and Go test code

Cons of This Approach

  • Make harder for devs run the e2e tests locally and debug they will need to have a pre-setup. Then alternatively, we would need to study the possibility to have a target setup-e2e test instead of install Prometheus via GitHub Actions

I hope this clarifies everything. Looking forward to your thoughts!

Thank you! 🚀

@camilamacedo86 camilamacedo86 added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 13, 2025
@camilamacedo86 camilamacedo86 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/feature Categorizes issue or PR as related to a new feature. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Feb 15, 2025
@camilamacedo86
Copy link
Member

camilamacedo86 commented Feb 15, 2025

Hi @Sijoma, @reedjosh, @damsien, @jravetch, @monteiro-renato,

After further consideration, I’ve updated the pros and cons of the alternative solution proposed in the comment above: link to comment, where we discuss centralizing everything.

Honestly, I believe the best course of action here is:

  • Closing this issue as resolved – The problem originally raised can no longer occur in the latest versions.

  • Rejecting the proposed solution to modify the Makefile target to add context – This does not address the problem at all. As outlined earlier, E2E tests can be triggered via an IDE (which is often the case), meaning the same issue would still occur. Additionally, this approach would introduce unnecessary complexity, deviate from common standards, and not provide significant benefits to general users. However, end users are always free to modify their own scaffold if needed. (more info: E2E K8s Context Protection #4391 (comment))

  • Considering an enhancement properly – If we want to improve this, I think the best approach is first to analyze the pros and cons carefully. So far, it seems that the best way is for a proposal (PEP) to be required. Referencing the PEP template would be helpful. Frequent back-and-forth changes to the scaffold can lead to a poor user experience—these changes only affect new projects, but users still need to update, and modifying it impacts workflows and pipeline configurations. Any potential change here, I think, would be the best approach; we have a design doc, consider all pros and cons, and ask for a community feedback process before moving forward.

So, I would suggest we close this issue based on the above considerations.
That said, I believe the best way forward would be:

  • (A) Open a new issue with clear steps to reproduce the scenario based on the testdata samples (master branch), OR/AND
  • (B) Open a PR proposing changes using the PEP template.

Let me know your thoughts!

@Sijoma
Copy link
Contributor

Sijoma commented Feb 15, 2025

Totally fine for me 👍 . It's a never ending problem. Also agree that adding the context to the Makefile does not help at all.

Rejecting the proposed solution to modify the Makefile target to add context

Just to clarify - I personally meant adding the --context on the Code by extending the utils.Kubectl to specify a context for the whole test suite. Same for the Kubernetes Context Config. This could then be loaded from the environment once on startup of the test suite.

This would also protect you when you use the IDE to run tests (like I do too).

We wrap the utils.Kubectl from Kubebuilder like so.
Image

Thank you @camilamacedo86 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

6 participants