-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws-janitor-boskos: add clean time and process time metrics #75
aws-janitor-boskos: add clean time and process time metrics #75
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This generally LGTM, just one question.
cmd/aws-janitor-boskos/main.go
Outdated
|
||
logrus.WithFields(logrus.Fields{"name": res.Name, "duration": duration.Seconds()}).Info("Finished cleaning") | ||
collectMetric(start, res.Name, "clean") | ||
logrus.WithFields(logrus.Fields{"name": res.Name, "duration": time.Since(start).Seconds()}).Info("Finished cleaning") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should include sweepCount somehow as well? might be interesting to understand if a longer duration is due to multiple sweeps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds a great idea, I will update this PR over the weekend
thanks for your review and feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, but not 100% if it is correct, when you have time can you please take a look and let me know? thanks!
95e6d25
to
41c5115
Compare
cmd/aws-janitor-boskos/main.go
Outdated
logrus.WithFields(logrus.Fields{"name": res.Name, "duration": duration.Seconds()}).Info("Finished cleaning") | ||
sweepsGauge.WithLabelValues(res.Name).Set(float64(countSweeps)) | ||
collectMetric(start, res.Name, "clean") | ||
logrus.WithFields(logrus.Fields{"name": res.Name, "duration": time.Since(start).Seconds(), "sweeps": countSweeps}).Info("Finished cleaning") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this code again, I now realize I was misunderstanding it before - I thought that the specified sweepCount
was the maximum number of attempted sweeps (i.e. retries if there was an error), but actually, I see that it always sweeps that many times, regardless of error or success.
We could keep the metric here, though it's not nearly as meaningful as I thought it was (since it'll basically remain the same for a given configuration). if we keep it, we can just pass sweepCount
instead of using the separate countSweeps
variable though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can drop that for now and in a follow-up do a small refactor in this part to only retry when it failed, and then we add back the metric. WDTY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think part of the reason it does multiple passes is that the AWS janitor typically does not return errors if it fails to delete resources, under the assumption that a later pass might successfully delete the resource (e.g. if there is some lingering dependency that gets cleaned up at a later stage).
This probably isn't ideal, but we'd need to investigate further and fix things to achieve the desired "retry only on error" behavior.
TL;DR we probably want to leave the existing behavior of always sweeping multiple times. The metrics you're adding can help us determine if this is taking too long and we should look into optimizing this process. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ixdy sounds good :)
update the code to reflect your feedback, thanks again
41c5115
to
b8a8098
Compare
to collect the response time when cleaning and for the entire process when doing the janitor process
b8a8098
to
d040320
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cpanato, ixdy The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
...to collect the response time when cleaning and for the entire process when doing the janitor process
Make the histogram and created the exponential buckets, I've spoken in the past with
Frederic Branczyk - @brancz
and he suggests following this approach since it is very similar to what I've implemented in the past then used the same 😄this is part of #13
opening this PR for feedback and see if sounds good, and then will do it for the other janitors
/assign @ixdy