Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-janitor-boskos: add clean time and process time metrics #75

Merged

Conversation

cpanato
Copy link
Member

@cpanato cpanato commented Feb 24, 2021

...to collect the response time when cleaning and for the entire process when doing the janitor process

Make the histogram and created the exponential buckets, I've spoken in the past with Frederic Branczyk - @brancz and he suggests following this approach since it is very similar to what I've implemented in the past then used the same 😄

this is part of #13

opening this PR for feedback and see if sounds good, and then will do it for the other janitors

/assign @ixdy

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 24, 2021
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 24, 2021
Copy link
Contributor

@ixdy ixdy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This generally LGTM, just one question.


logrus.WithFields(logrus.Fields{"name": res.Name, "duration": duration.Seconds()}).Info("Finished cleaning")
collectMetric(start, res.Name, "clean")
logrus.WithFields(logrus.Fields{"name": res.Name, "duration": time.Since(start).Seconds()}).Info("Finished cleaning")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should include sweepCount somehow as well? might be interesting to understand if a longer duration is due to multiple sweeps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds a great idea, I will update this PR over the weekend
thanks for your review and feedback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, but not 100% if it is correct, when you have time can you please take a look and let me know? thanks!

cmd/aws-janitor-boskos/main.go Show resolved Hide resolved
@cpanato cpanato force-pushed the GH-13-aws-janitor-boskos branch from 95e6d25 to 41c5115 Compare February 28, 2021 16:00
logrus.WithFields(logrus.Fields{"name": res.Name, "duration": duration.Seconds()}).Info("Finished cleaning")
sweepsGauge.WithLabelValues(res.Name).Set(float64(countSweeps))
collectMetric(start, res.Name, "clean")
logrus.WithFields(logrus.Fields{"name": res.Name, "duration": time.Since(start).Seconds(), "sweeps": countSweeps}).Info("Finished cleaning")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this code again, I now realize I was misunderstanding it before - I thought that the specified sweepCount was the maximum number of attempted sweeps (i.e. retries if there was an error), but actually, I see that it always sweeps that many times, regardless of error or success.

We could keep the metric here, though it's not nearly as meaningful as I thought it was (since it'll basically remain the same for a given configuration). if we keep it, we can just pass sweepCount instead of using the separate countSweeps variable though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can drop that for now and in a follow-up do a small refactor in this part to only retry when it failed, and then we add back the metric. WDTY?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think part of the reason it does multiple passes is that the AWS janitor typically does not return errors if it fails to delete resources, under the assumption that a later pass might successfully delete the resource (e.g. if there is some lingering dependency that gets cleaned up at a later stage).

This probably isn't ideal, but we'd need to investigate further and fix things to achieve the desired "retry only on error" behavior.

TL;DR we probably want to leave the existing behavior of always sweeping multiple times. The metrics you're adding can help us determine if this is taking too long and we should look into optimizing this process. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ixdy sounds good :)

update the code to reflect your feedback, thanks again

@cpanato cpanato force-pushed the GH-13-aws-janitor-boskos branch from 41c5115 to b8a8098 Compare March 3, 2021 13:45
to collect the response time when cleaning and for the entire process
when doing the janitor process
@cpanato cpanato force-pushed the GH-13-aws-janitor-boskos branch from b8a8098 to d040320 Compare March 3, 2021 13:47
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cpanato, ixdy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2021
@k8s-ci-robot k8s-ci-robot merged commit 7fe7571 into kubernetes-sigs:master Mar 3, 2021
@cpanato cpanato deleted the GH-13-aws-janitor-boskos branch March 4, 2021 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants