Add recommended Prometheus dashboards for Go. #809

bwplotka · 2024-08-12T08:20:39Z

This work was done in prep for our talk on GopherCon UK about Go Runtime Metrics.

Feedback welcome to the dashboard data, layout and style!

Essentially it has all metric we are maintaining in client_golang (most popular Go metric SDK). Exposed metrics also align with Go Team recommendation golang/go#67120

This work was done in prep for our talk on GopherCon UK about Go Runtime Metrics. Feedback welcome to the dashboard data, layout and style! Essentially it has all metric we are maintaining in client_golang (most popular Go metric SDK). Exposed metrics also align with Go Team recommendation golang/go#67120 Signed-off-by: bwplotka <[email protected]>

bwplotka · 2024-08-12T12:37:00Z

Open questions:

This is based on GKE.. but should work on GCE technically? I didn't check though.
Metrics for Sched Latency and Runtime Configuration options are not yet very common (we work on adopting this in OSS as we speak). This means likely those graphs could not work OOT. I think that's fine, those are new metrics, but maybe would create support burden?
I applied some grouping that makes sense to me, kind of works with auto-grouping feature (which I don't know how works technically), so I am bit guessing here on what's the recommended grouping I should use.
Similarly I used mix of "global" filters vs filters as vars. Not really sure when I should used vars vs global filters, so I guessed a bit, feedback welcome (:

yqlu

This is based on GKE.. but should work on GCE technically? I didn't check though.

Cool, do you mean via Ops Agent writing to prometheus_target? Not sure off the top of my head how the labels (e.g. cluster_name, namespace_name) are populated in that case...

Metrics for Sched Latency and Runtime Configuration options are not yet very common (we work on adopting this in OSS as we speak). This means likely those graphs could not work OOT. I think that's fine, those are new metrics, but maybe would create support burden?

We have some precedence for this, e.g. for NVIDIA DCGM. The best practice here is to try and be explicit via the section or chart titles whenever some graphs may not be populated.

I applied some grouping that makes sense to me, kind of works with auto-grouping feature (which I don't know how works technically), so I am bit guessing here on what's the recommended grouping I should use.

Can you elaborate on what do you mean grouping? Do you mean grouping the charts into the collapsible group widgets ("Version", "Memory", etc) or the sum by (X, Y, Z) on the PromQL queries?

Similarly I used mix of "global" filters vs filters as vars. Not really sure when I should used vars vs global filters, so I guessed a bit, feedback welcome (:

Given that all of your charts are on prometheus_target metrics, I expect both to behave similarly. Here's how you would pick between the two:

template vars are nice when you want to opt-in and have the filter apply to chart A but not chart B
template vars are nice when not all the labels line up (e.g. you have a cluster_name template variable, but you want to apply it to a GKE system metric with a label called cluster, or a log).
the expansion of template vars is explicit when you inspect the query, but global vars are more implicit

Hope this helps! I left a handful of formatting comments!

dashboards/go/go-runtime-view-prometheus.json

yqlu · 2024-08-12T19:03:40Z

dashboards/go/go-runtime-view-prometheus.json

+        "widget": {
+          "title": "Runtime Configuration",
+          "collapsibleGroup": {
+            "collapsed": true


Just double-checking if it's intentional which sections you have collapsed or uncollapsed by default on page load

Correct, those are useful only in certain, less often cases (but still important enough to have those on dashboard as per golang/go#67120)

OK sounds good. Consider if you should keep it where it is or move it to the bottom (below Concurrency and Memory, which are default open and presumably more general / widely applicable). I'll leave it up to you!

johnbryan · 2024-08-12T19:45:41Z

Metrics for Sched Latency and Runtime Configuration options are not yet very common (we work on adopting this in OSS as we speak). This means likely those graphs could not work OOT. I think that's fine, those are new metrics, but maybe would create support burden?

We have some precedence for this, e.g. for NVIDIA DCGM. The best practice here is to try and be explicit via the section or chart titles whenever some graphs may not be populated.

Another option (your call if this is right in this situation!) is to add an explanatory text widget (like this concept)

Signed-off-by: bwplotka <[email protected]>

bwplotka · 2024-08-14T10:45:28Z

Thanks all for prompt review! Pushed changes to address small nits, but for our discussion:

A) On GKE vs GCE:

Cool, do you mean via Ops Agent writing to prometheus_target? Not sure off the top of my head how the labels (e.g. cluster_name, namespace_name) are populated in that case...

Yes, or really using GMP fork or anything (even OSS Prometheus). Cluster and namespaces wouldn't be set (or it will be fake), but that does not mean this dashboard wouldn't work, no?

B) On new or optional metrics

We have some precedence for this, e.g. for NVIDIA DCGM. The best practice here is to try and be explicit via the section or chart titles whenever some graphs may not be populated.
Another option (your call if this is right in this situation!) is to add an explanatory text widget (like this concept)

Great! Added some widgets.

C) On grouping (detail)

By auto-grouping I mean this feature:

I assume this allows me to have fine grained grouping e.g. per instance (see resulted graph, and PromQL with sum by(project_id, location, cluster, job, namespace, instance):

But users can quickly change this for more high level view with Group by per e.g. job (useful if you have thousands instances among dozen of jobs):

And it kind of works, but it's implicit (query unchanged, yet clearly another "sum" group by was added on top)

I think this make me ok to have instance grouping everywhere and let this auto-grouping/aggregation do the magic for the higher levels 👍🏽 Just was curious what's the practice there (e.g. sometimes it does not make sense to aggregate)

D) Template vars

Have you experience issues with those "untemplated" global vars? They seems buggy e.g with them set to some value I can see some graphs filtering works, some not.

On the other hand they are handy as they filter correctly GKE workload deployments 🙈

I can reliably repro this bug on my dashboard here:

Cluster is set, yet I see Instanced by Version filtered by cluster correctly, but table is not (literally same PromQL used!). Then if you edit Instances by Version filter is gone and you see all clusters again... I assume it's bug so maybe let's keep those cluster and location as "global" vars, but would love to learn the specific of how it's implemented 🤔

yqlu · 2024-08-14T14:23:43Z

(C) / (D):

You can actually open the network tab to inspect the structure of the network requests being sent to the GCM API :)

As you can see, when you set template variables, the PromQL is being expanded / interpolated browser-side, which is why the application is visible. When you set global "group bys" and "filters", it's being applied as a drilldownState on top of the PromQL. I'm not an expert on that part but you can ask the CMP team for more details!

(D): Strange, I just did this and the cluster applied to both correctly the chart and the table side by side. If this is reproducing reliably for you, can you record a screencast and file against buganizer component 133331? We can figure out internally what is going on (whether it's an experiment flag, etc).

yqlu · 2024-08-14T14:26:25Z

dashboards/go/go-runtime-view-prometheus-part1.png

Can you refresh the screenshots?

ah, forgot, thanks!

bwplotka force-pushed the go-dashboard branch from a9a2437 to dc68c68 Compare August 12, 2024 09:25

bwplotka force-pushed the go-dashboard branch from dc68c68 to ae5629d Compare August 12, 2024 12:33

yqlu self-requested a review August 12, 2024 13:47

yqlu reviewed Aug 12, 2024

View reviewed changes

Addressed comments.

495ee12

Signed-off-by: bwplotka <[email protected]>

yqlu reviewed Aug 14, 2024

View reviewed changes

yqlu approved these changes Aug 14, 2024

View reviewed changes

yqlu merged commit 2217845 into GoogleCloudPlatform:master Aug 14, 2024
2 checks passed

bwplotka deleted the go-dashboard branch August 14, 2024 20:35

bwplotka mentioned this pull request Aug 14, 2024

fix(go-runtime): Updated Go Runtime dashboard screenshots #810

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add recommended Prometheus dashboards for Go. #809

Add recommended Prometheus dashboards for Go. #809

bwplotka commented Aug 12, 2024 •

edited

Loading

bwplotka commented Aug 12, 2024

yqlu left a comment

yqlu Aug 12, 2024

bwplotka Aug 14, 2024

yqlu Aug 14, 2024

johnbryan commented Aug 12, 2024

bwplotka commented Aug 14, 2024

yqlu commented Aug 14, 2024

yqlu Aug 14, 2024

bwplotka Aug 14, 2024

Add recommended Prometheus dashboards for Go. #809

Add recommended Prometheus dashboards for Go. #809

Conversation

bwplotka commented Aug 12, 2024 • edited Loading

bwplotka commented Aug 12, 2024

yqlu left a comment

Choose a reason for hiding this comment

yqlu Aug 12, 2024

Choose a reason for hiding this comment

bwplotka Aug 14, 2024

Choose a reason for hiding this comment

yqlu Aug 14, 2024

Choose a reason for hiding this comment

johnbryan commented Aug 12, 2024

bwplotka commented Aug 14, 2024

yqlu commented Aug 14, 2024

yqlu Aug 14, 2024

Choose a reason for hiding this comment

bwplotka Aug 14, 2024

Choose a reason for hiding this comment

bwplotka commented Aug 12, 2024 •

edited

Loading