[META] Query profiler support in Query Insights Dashboards #104

ansjcy · 2025-02-07T23:01:21Z

Is your feature request related to a problem?

Currently, OpenSearch users lack native UI tooling to analyze query performance bottlenecks, as outlined in OpenSearch-Dashboards Issue #571. This gap forces developers to rely on the profiling API and manual log analysis when debugging slow queries. Also the absence of integrated profiling capabilities within Query Insights (metadata, historical similar bad queries, recommendation etc) prevents users from connecting top n queries with more real time granular execution details and "how to improve the profiled queries".

What solution would you like?

We propose a Profiling UI integrated with the Query Insights dashboard to create an end-to-end performance analysis experience. We also want to support displaying historical similar rogue queries in the profiler utilizing the Top N Queries data. Furthermore, we intend to bridge the gap between understanding an issue and knowing what to do to resolve it in the profiler - the profiler should surface actionable recommendations to users on how to resolve and improve problematic queries, such as rewriting queries, adjusting underlying index configurations, or enabling specific OpenSearch features.

More specifically, with the below mock profiler page,

we want to enable users to:

Deep Inspection on any queries
- Profile queries across multiple execution layers with easy to use interfaces.
- Visualize time distribution across query execution phases (With easy to use visualization designs, will discuss further in the comments threads).
- Add more metrics to the profiling API to enable analyzing resource utilization (CPU, memory, I/O) at different component levels.
Context-Aware Navigation from top n queries page
- One-click profiling from Top Queries list to the profiling page for further root-cause analysis.
- Correlate profiled queries with existing similar queries in historical top n queries. (as shown in the "Similar Queries In Top N Historical Queries" in the above mock screen. Arguably, we should also provide visualizations for the comparisons as well, like time series heatmap, parallel coordinate plot for closly comparing different phases/dimensions etc, similar to the chart shown below)
Intelligent Recommendations
- Surface optimization suggestions from the Recommendations Engine alongside profiling data
- Display actionable insights with confidence scoring and expected impact metrics
- Enable direct application of recommendations through UI controls (future scope)

Subtasks

TO BE ADDED

What alternatives have you considered?

External Profiling Tools
While third-party APM solutions exist, they require complex instrumentation and lack native integration with OpenSearch query metadata.

Do you have any additional context?

ansjcy · 2025-02-11T01:57:30Z

There are also several other possible visualizations we can use for the profiler output beside the flame graph. Let me try to explain them in the comment threads and let's discuss further from here!

Note: I'll use the below fake OpenSearch profiler output to create the mock visualizations :)
fake_opensearch_profile_output.json

ansjcy · 2025-02-11T02:08:33Z

One possibility is to use Gantt chart to visualize the profiler results. Example Gantt chart: https://shybovycha.github.io/2020/08/02/gantt-chart-part2.html.

The motivation behind using Gantt Chart is, the flame graph is okay for showing nested execution, but it’s not great for time-based analysis, which is what we actually care about when profiling query performance - we want to know how the search is flowing in different shards. If we can attach timestamp in each phase of the profiler output, we can structured them in a timeline narrative, in this case Gantt chart makes way more sense than a flame graph. The benefits are:

Clear execution breakdown per shard – Instead of showing just nested execution, a Gantt chart maps each shard’s phases (query, fetch, aggregation) to a timeline, making it easy to compare performance across shards.
Better at spotting bottlenecks – If one phase is running longer than expected, it’ll stand out immediately on the chart. This is hard to see in a flame graph, where time is not explicitly mapped on an axis.
More intuitive parallel execution view – the profiler can runs across multiple shards at the same time, but the flame graph doesn’t show this well. A Gantt chart will show when each shard starts and finishes, so we can visually inspect query parallelism.

Here’s a rough idea of how it could look:

In the above mock chart:

X-axis → Time (milliseconds)
Y-axis → Shard IDs
Bars → Each phase (query, fetch, aggregation) as a separate segment

As for the implementation, we can use D3.js for rendering the Gantt chart. One potential challenge is we also need to enhance the profiler API to give us certain execution timestamps.
We should also consider adding:

Hover tooltips – Show CPU/memory usage + time taken for each phase. like this mock:

Click to expand one phase to see child tasks, Or simply use a tool tip to show info about the child tasks.

Interactive filtering – Toggle different phases to analyze performance.

If everyone’s on board, we can work on a quick poc using real profiler data, and perform user studies with the POC. I really think this will improve how we analyze profiler results and make it easier to debug slow queries.
Thoughts? 🚀

ansjcy · 2025-02-11T02:28:31Z

HeatMap with TreeMap:

If we want to focus on analyzing/comparing resource usage (i.e. CPU usage, Memory Usages) for different shards and potentially identify hotspot in a profiled query, we can also use heatmap with drill down supported by TreeMaps. With other visualizations, we can see execution times like in the flame graph (or the proposed Gantt chart), but we don’t have a great way to analyze resource consumption across shards (in a comparative way). A heatmap would make it super easy to spot shards that are consuming excessive resources.
Example heatmap: https://d3-graph-gallery.com/heatmap

Here’s an mock example of what it could look like for the profiler (with CPU as the metric):

In the above example:

X-axis → Shard IDs
Y-axis → Resource type (CPU usage, Memory usage)
Also we can add a Tooltip on hover to show exact CPU/memory usage and some more details for each shard

To drill down a shard, we can support click and zoom in for each cell into a treeMap like below:

The benefits of using heatmap & treemap includes:

Clear hotspot detection – If some shards are using way more CPU or memory, they’ll be instantly visible in a heatmap.
Better at identifying imbalanced workloads – Right now, we don’t have a good way to compare CPU/memory usage across shards in a profiled query. A heatmap will highlight shards that are underperforming due to resource constraints.
Scales well – Unlike bar charts, a heatmap can handle dozens of shards without cluttering the UI.

Again, to decide whether this is the ideal visualization, we need to do some user studies with a POC.

kkhatua · 2025-02-11T06:58:03Z

@ansjcy
I like the idea of Gantt charts, but I dont believe the profiler output JSON actually provides wall clocks.

We need to make an assumption that some of the operations are happening in sequence (e.g. can_match followed by query followed by fetch), while some operations ... like at a shard level are happening in parallel. This might not be true if a search request was run with custom value for max_concurrent_shard_requests.

We should probably look at what fields are relevant to graphing and the rest can probably be embedded in the hover tooltip or some other mechanism to bubble up the details on demand.

ansjcy added enhancement New feature or request untriaged labels Feb 7, 2025

ansjcy changed the title ~~[FEATURE] Support query profiler in query insights dashboards.~~ [FEATURE] Query profiler support in Query Insights Dashboards Feb 7, 2025

ansjcy changed the title ~~[FEATURE] Query profiler support in Query Insights Dashboards~~ [META] Query profiler support in Query Insights Dashboards Feb 11, 2025

ansjcy removed the untriaged label Feb 11, 2025

kkhatua assigned dzane17 Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] Query profiler support in Query Insights Dashboards #104

[META] Query profiler support in Query Insights Dashboards #104

ansjcy commented Feb 7, 2025 •

edited

Loading

ansjcy commented Feb 11, 2025 •

edited

Loading

ansjcy commented Feb 11, 2025 •

edited

Loading

ansjcy commented Feb 11, 2025 •

edited

Loading

kkhatua commented Feb 11, 2025

[META] Query profiler support in Query Insights Dashboards #104

[META] Query profiler support in Query Insights Dashboards #104

Comments

ansjcy commented Feb 7, 2025 • edited Loading

Is your feature request related to a problem?

What solution would you like?

Subtasks

What alternatives have you considered?

Do you have any additional context?

ansjcy commented Feb 11, 2025 • edited Loading

ansjcy commented Feb 11, 2025 • edited Loading

ansjcy commented Feb 11, 2025 • edited Loading

HeatMap with TreeMap:

kkhatua commented Feb 11, 2025

ansjcy commented Feb 7, 2025 •

edited

Loading

ansjcy commented Feb 11, 2025 •

edited

Loading

ansjcy commented Feb 11, 2025 •

edited

Loading

ansjcy commented Feb 11, 2025 •

edited

Loading