Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recommended Prometheus dashboards for Go. #809

Merged
merged 2 commits into from
Aug 14, 2024

Conversation

bwplotka
Copy link
Contributor

@bwplotka bwplotka commented Aug 12, 2024

This work was done in prep for our talk on GopherCon UK about Go Runtime Metrics.

image

image

Feedback welcome to the dashboard data, layout and style!

Essentially it has all metric we are maintaining in client_golang (most popular Go metric SDK). Exposed metrics also align with Go Team recommendation golang/go#67120

This work was done in prep for our talk on GopherCon UK about Go Runtime Metrics.

Feedback welcome to the dashboard data, layout and style!

Essentially it has all metric we are maintaining in client_golang (most popular Go metric SDK).
Exposed metrics also align with Go Team recommendation golang/go#67120

Signed-off-by: bwplotka <[email protected]>
@bwplotka
Copy link
Contributor Author

Open questions:

  • This is based on GKE.. but should work on GCE technically? I didn't check though.
  • Metrics for Sched Latency and Runtime Configuration options are not yet very common (we work on adopting this in OSS as we speak). This means likely those graphs could not work OOT. I think that's fine, those are new metrics, but maybe would create support burden?
  • I applied some grouping that makes sense to me, kind of works with auto-grouping feature (which I don't know how works technically), so I am bit guessing here on what's the recommended grouping I should use.
  • Similarly I used mix of "global" filters vs filters as vars. Not really sure when I should used vars vs global filters, so I guessed a bit, feedback welcome (:

@yqlu yqlu self-requested a review August 12, 2024 13:47
Copy link
Collaborator

@yqlu yqlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on GKE.. but should work on GCE technically? I didn't check though.

Cool, do you mean via Ops Agent writing to prometheus_target? Not sure off the top of my head how the labels (e.g. cluster_name, namespace_name) are populated in that case...

Metrics for Sched Latency and Runtime Configuration options are not yet very common (we work on adopting this in OSS as we speak). This means likely those graphs could not work OOT. I think that's fine, those are new metrics, but maybe would create support burden?

We have some precedence for this, e.g. for NVIDIA DCGM. The best practice here is to try and be explicit via the section or chart titles whenever some graphs may not be populated.

image

I applied some grouping that makes sense to me, kind of works with auto-grouping feature (which I don't know how works technically), so I am bit guessing here on what's the recommended grouping I should use.

Can you elaborate on what do you mean grouping? Do you mean grouping the charts into the collapsible group widgets ("Version", "Memory", etc) or the sum by (X, Y, Z) on the PromQL queries?

Similarly I used mix of "global" filters vs filters as vars. Not really sure when I should used vars vs global filters, so I guessed a bit, feedback welcome (:

Given that all of your charts are on prometheus_target metrics, I expect both to behave similarly. Here's how you would pick between the two:

  • template vars are nice when you want to opt-in and have the filter apply to chart A but not chart B
  • template vars are nice when not all the labels line up (e.g. you have a cluster_name template variable, but you want to apply it to a GKE system metric with a label called cluster, or a log).
  • the expansion of template vars is explicit when you inspect the query, but global vars are more implicit
    image

Hope this helps! I left a handful of formatting comments!

dashboards/go/go-runtime-view-prometheus.json Outdated Show resolved Hide resolved
dashboards/go/go-runtime-view-prometheus.json Outdated Show resolved Hide resolved
"widget": {
"title": "Runtime Configuration",
"collapsibleGroup": {
"collapsed": true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double-checking if it's intentional which sections you have collapsed or uncollapsed by default on page load

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, those are useful only in certain, less often cases (but still important enough to have those on dashboard as per golang/go#67120)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sounds good. Consider if you should keep it where it is or move it to the bottom (below Concurrency and Memory, which are default open and presumably more general / widely applicable). I'll leave it up to you!

@johnbryan
Copy link
Collaborator

Metrics for Sched Latency and Runtime Configuration options are not yet very common (we work on adopting this in OSS as we speak). This means likely those graphs could not work OOT. I think that's fine, those are new metrics, but maybe would create support burden?

We have some precedence for this, e.g. for NVIDIA DCGM. The best practice here is to try and be explicit via the section or chart titles whenever some graphs may not be populated.

Another option (your call if this is right in this situation!) is to add an explanatory text widget (like this concept)

redis-mock

Signed-off-by: bwplotka <[email protected]>
@bwplotka
Copy link
Contributor Author

Thanks all for prompt review! Pushed changes to address small nits, but for our discussion:

A) On GKE vs GCE:

Cool, do you mean via Ops Agent writing to prometheus_target? Not sure off the top of my head how the labels (e.g. cluster_name, namespace_name) are populated in that case...

Yes, or really using GMP fork or anything (even OSS Prometheus). Cluster and namespaces wouldn't be set (or it will be fake), but that does not mean this dashboard wouldn't work, no?

B) On new or optional metrics

We have some precedence for this, e.g. for NVIDIA DCGM. The best practice here is to try and be explicit via the section or chart titles whenever some graphs may not be populated.
Another option (your call if this is right in this situation!) is to add an explanatory text widget (like this concept)

Great! Added some widgets.

C) On grouping (detail)

By auto-grouping I mean this feature:

image

I assume this allows me to have fine grained grouping e.g. per instance (see resulted graph, and PromQL with sum by(project_id, location, cluster, job, namespace, instance):

image

But users can quickly change this for more high level view with Group by per e.g. job (useful if you have thousands instances among dozen of jobs):

image

And it kind of works, but it's implicit (query unchanged, yet clearly another "sum" group by was added on top)
image

I think this make me ok to have instance grouping everywhere and let this auto-grouping/aggregation do the magic for the higher levels 👍🏽 Just was curious what's the practice there (e.g. sometimes it does not make sense to aggregate)

D) Template vars

Have you experience issues with those "untemplated" global vars? They seems buggy e.g with them set to some value I can see some graphs filtering works, some not.

On the other hand they are handy as they filter correctly GKE workload deployments 🙈

I can reliably repro this bug on my dashboard here:

image

Cluster is set, yet I see Instanced by Version filtered by cluster correctly, but table is not (literally same PromQL used!). Then if you edit Instances by Version filter is gone and you see all clusters again... I assume it's bug so maybe let's keep those cluster and location as "global" vars, but would love to learn the specific of how it's implemented 🤔

@yqlu
Copy link
Collaborator

yqlu commented Aug 14, 2024

(C) / (D):

You can actually open the network tab to inspect the structure of the network requests being sent to the GCM API :)

As you can see, when you set template variables, the PromQL is being expanded / interpolated browser-side, which is why the application is visible. When you set global "group bys" and "filters", it's being applied as a drilldownState on top of the PromQL. I'm not an expert on that part but you can ask the CMP team for more details!

image

(D): Strange, I just did this and the cluster applied to both correctly the chart and the table side by side. If this is reproducing reliably for you, can you record a screencast and file against buganizer component 133331? We can figure out internally what is going on (whether it's an experiment flag, etc).

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refresh the screenshots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, forgot, thanks!

@yqlu yqlu merged commit 2217845 into GoogleCloudPlatform:master Aug 14, 2024
2 checks passed
@bwplotka bwplotka deleted the go-dashboard branch August 14, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants