Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Retina e2e scale test #720

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

alexcastilio
Copy link
Contributor

Description

Create Retina E2E Scale tests using test/e2e framework.

Related Issue

If this pull request is related to any issue, please mention it here. Additionally, make sure that the issue is assigned to you before submitting this pull request.

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes made.

Additional Notes

Add any additional notes or context about the pull request here.


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

@alexcastilio alexcastilio added the area/infra Test, Release, or CI Infrastructure label Sep 10, 2024
@alexcastilio alexcastilio self-assigned this Sep 10, 2024
@alexcastilio alexcastilio requested a review from a team as a code owner September 10, 2024 16:09
@alexcastilio alexcastilio marked this pull request as draft September 10, 2024 16:10
Copy link
Member

@timraymond timraymond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that this is a draft, and I ordinarily don't review drafts since things are subject to change... but I figured these would be good suggestions early-on if you're planning to write some more here.

test/e2e/framework/types/runner.go Outdated Show resolved Hide resolved
test/e2e/framework/types/runner.go Outdated Show resolved Hide resolved
test/e2e/scale_test.go Outdated Show resolved Hide resolved
test/e2e/scale_test.go Outdated Show resolved Hide resolved
@@ -62,6 +62,20 @@ func DeleteTestInfra(subID, clusterName, location string) *types.Job {
return job
}

func InstallRetina(kubeConfigFilePath, chartPath string) *types.Job {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this? Can you refactor InstallAndTestRetinaBasicMetrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred not to change InstallAndTestRetinaBasicMetrics since this might be in use by someone else and the scope seems to be different.

The plan for the new pipeline is to Create Infra, Install Retina, Scale Up and Run Tests, so my thinking is to create different jobs for each stage so each job could be reused somewhere else. So in the future instead of using InstallAndTestRetinaBasicMetrics, one could [re-]use InstallRetina, Test Basic Metrics, and perhaps add some other set of tests. Does that make sense?

test/e2e/framework/types/background_test.go Outdated Show resolved Hide resolved
test/e2e/framework/types/runner.go Outdated Show resolved Hide resolved
test/e2e/framework/types/scenarios_test.go Outdated Show resolved Hide resolved
@alexcastilio alexcastilio force-pushed the create-scale-test-cluster branch 5 times, most recently from 6662faf to 69189ce Compare September 11, 2024 09:39
Copy link

This PR will be closed in 7 days due to inactivity.

@github-actions github-actions bot added the meta/waiting-for-author Blocked and waiting on the author label Oct 12, 2024
@alexcastilio alexcastilio force-pushed the create-scale-test-cluster branch 7 times, most recently from 83abb41 to 6ef0d3b Compare October 16, 2024 13:24
@alexcastilio alexcastilio removed the meta/waiting-for-author Blocked and waiting on the author label Oct 16, 2024
@alexcastilio alexcastilio marked this pull request as ready for review October 16, 2024 14:12
@alexcastilio alexcastilio force-pushed the create-scale-test-cluster branch 2 times, most recently from 2904623 to 84b6bd2 Compare October 16, 2024 14:30
@alexcastilio alexcastilio force-pushed the create-scale-test-cluster branch 3 times, most recently from 695817d to dce97a3 Compare October 16, 2024 16:03
"k8s.io/client-go/tools/clientcmd"
)

type CreateNetworkPolicies struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a create network policy in the stdlib, can we either move this there or converge on to one way of creating netpols?
https://github.com/microsoft/retina/blob/main/test/e2e/framework/kubernetes/create-network-policy.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that could be done with some refactoring.

kubernetes.CreateDenyAllNetworkPolicy step creates a specific NetworkPolicy (NP) that is encapsulated in a way where the only Namespace and one podSelector label could be provided.

scaletest.CreateNetworkPolicies step creates a number of NPs that are also specific with some podSelector labels needed for the scale-tests.

The first one would need to have more fields to allow more customization on generated NP. But it would also require to check and change the code everywhere it's currently being used.

If it were done that way, how should this step that generates a single NP be reused to generate many NPs? Inside the ScaleTest function that generates the job (file test/e2e/jobs/scale.go) add a loop to add this same Step several times? Or a new step should be created encapsulating these multiple NP creations inside of it? (I'm not sure if the framework currently allows creation/reuse of steps inside of steps.)

Anyway, if this is the best approach, maybe this could be done in a separate refactoring task. Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reasoning, while there might be a way to accommodate both scenarios, that may be out of scope for this pr given current limitations. Existing cases are verifying granular drop behavior unlike this suite, can shelve for later

"k8s.io/client-go/tools/clientcmd"
)

type CreateResources struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to reuse something like this?

https://github.com/microsoft/retina/blob/main/test/e2e/framework/kubernetes/create-resource.go#L23

pidgeonholing these relatively common actions to specific tests is going to result in a lot of redundant code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's being reused inside Run() method (line 66). This step puts together the generation of all objects required for this scale test and then it reuses the CreateResource function you mention to apply that to the cluster.

What would you suggest to reuse it in a better way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought stems from the fact that creating/destroying Kubernetes resources is a core component of these suites, and if there was a way of making this CreateResources step able to be shared amongst other suites.

In other such cases we've seen people create resources that require pod/job/daemonset to be ready, but the creating resources aspect of the test just creates the resource in apiserver and moves on without waiting for job or pod, which then results in faulty tests. Then it is necessary to add a wait for pod ready like the added step in this pr and/or verify stable cluster, etc.

I don't have a clear solution for this at the moment so I won't block, but it seems everybody has different way of creating test resources, so just starting a dialog around the topic


var (
kapingerPort = int32(8080)
KapingerDeployment = appsv1.Deployment{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is a dupe of this?
https://github.com/microsoft/retina/blob/main/test/e2e/framework/kubernetes/create-kapinger-deployment.go

already it's proven difficult when we tweak kapinger to track down all of the places that use it, having another place where a DeploymentSpec is define makes that more difficult

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just have not seen it before. Re-using previous implementation

// Returning an error will cause the test to fail
func (po *ValidateAndPrintOptions) Run() error {

log.Printf(`Starting to scale with folowing options:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use %+v formatting directive to print struct keys and values

Suggested change
log.Printf(`Starting to scale with folowing options:
log.Printf(`Starting to scale with following options %+v", po.Options)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

KubeConfigFilePath string
}

// Useful when wanting to do parameter checking, for example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is a bit redundant when copy pasted to every struct :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment comes from the interface. I use a plugin to automatically create the methods from the interface that I want to implement, so it also gets the comment from it :)

)

var (
NetworkPolicy = netv1.NetworkPolicy{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this doesn't necessarily seem scale test specific, let's put it in a more common directory and/or reuse with existing create deny netpol steps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put it in the folder test/e2e/framework/kubernetes/templates and existing create deny netpol step can be changed in another task to reuse it and perform changes needed. Is that ok?

)

var (
KapingerClusterRole = rbacv1.ClusterRole{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+same for this, let's stick to one way of deploying kapinger and tweak that if we need to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

type CreateNamespace struct {
Namespace string
KubeConfigFilePath string
DryRun bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any strong use cases for dry run? In a lot of these cases so far all core step code is gated by this conditional, and when dry run is set to true, it just pulls kubeconfig

For what it's worth the prevalidate step is effectively a dry run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script in which this code is based on has a flag to stop execution after generating objects and before applying them to cluster, to allow inspection of generated objects. The idea here is to print manifests to logs without applying them. Perhaps this could be removed. @vipul-21 what do you think about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @huntergregory for his thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this for simplicity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was primarily used to calculate how many IP sets et cetera that npm will create

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DryRun removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra Test, Release, or CI Infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants