-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for total samples calculation in query stats. #173
base: main
Are you sure you want to change the base?
Conversation
Thanks for the PR and for reaching out early with a draft implementation. Pointing out the change to the core Can we instead inject a |
Thanks a lot for the guidance @fpetkovski. Injecting the stats object sounds like a good idea. We would need a mechanism to collect these stats and populate them in the compatibilityQuery object. As I understand, we will inject the Stats object when we build operators, and then increment using atomic operations as we collect samples. To confirm, is the suggestion to share the same stats object across compatibility query and underlying operator will perform calculation? Another issue I observed is that concurrencyOperator does not have a |
That's a good question, and I am not sure what's the best approach would be here. The advantage of sharing the object is keeping the code simpler, and I don't expect the atomic increments to have a significant performance impact. But, yes, that would be the idea, and it's worth keeping an eye on performance regressions with the benchmarks we have in the repo.
In theory each finished query will make sure the context is cancelled. Once the context is cancelled, all operators will terminate and I would expect memory to get released. The concurrency operator will also drain the channel before it terminates: https://github.com/thanos-community/promql-engine/blob/main/execution/exchange/concurrent.go#L93. Have you seen issues with the current termination mechanism in practice? |
I looked further into the suggestion of incrementing stats using a atomic operator @fpetkovski but ran into an issue with incrementing stats in vector and matrix object, specifically with the Prometheus One mechanism for us to solve this is to return samples considered in the I do wonder if this calculation will get a bit dicey when we consider distributed execution, as we will need to subtract duplicated samples in the dedup Operator. I would like thanos/promql-engine to support query samples for distributed Execution in future. (I have not completely thought the distributed path though and not sure how easy it will be support the sample calculation in distributed mode, tbh).
In my testing, I saw that tests were calculating less sample values compared to vanilla Prometheus. Further debugging suggested that goroutines in |
Thanks for the elaboration @sahnib. I am out sick this week but I will revisit this again on Monday. |
@sahnib would you mind adding some failing tests as part of the PR so I can get a better insight into the problem? I am not familiar with how query stats work in the Prometheus engine and getting some concrete examples would help quite a bit. |
82d7f49
to
c4cffba
Compare
Yeah, no worries @fpetkovski. I updated the PR with changes based on atomic counters. If I exclude the changes in step_invariant.go file (https://github.com/thanos-community/promql-engine/pull/173/files#diff-15039040719a84e49e402377c62ed1606af8f83e9d278e3e2b695b8eacfa90c2) and run the |
I took a stab at this to get an insight into the mismatch we get with the Prometheus engine: https://github.com/thanos-community/promql-engine/compare/main...fpetkovski:engine-query-stats?expand=1. I also took a look at the documentation for the
Looking at the test failures, I see that most of them are caused by the step invariant optimization, or a short-circuit in the binary operator. This in our case produces lower values for Here are my thoughts on it:
So to summarize, I would focus on making sure we count decoded samples correctly instead of trying to match what Prometheus does in this case. Let me know what you think. |
Thanks for your input @fpetkovski. I was a bit tied up last week and could not get to this. The feedback makes sense to me. However, I looked further into Prometheus, and it seems like there was a decision made to not account for optimizations that happen inside the query engine specifically. (The comment for
We can choose to diverge from Prometheus here, but I think that the decision to exclude optimizations was made to allow users rely on total Samples regardless of the query engine used. Let me know what do you think. |
Given that with this implementation we want to focus on performance, adding a drain method would likely go against that goal and defeat the purpose of adding various optimizations. I am okay with counting samples from the step invariant operator if we can do it by injecting a flag when the downstream operator is a vector/matrix selector. We should be able to do that here: https://github.com/thanos-community/promql-engine/blob/main/execution/execution.go#L225-L234 |
Thanks @fpetkovski. Your comments make sense to me. The PR got accidentally closed due to GitHub workflow syncing my main branch (apologies for that).
I am aligned here. One of the primary motivations for the new engine is to improve performance. Unless we can come up with some way to add these samples without evaluation, we should skip it. I will exclude the change to run these operators, and modify tests accordingly.
Thanks for your input. I will try this and get back to you. This will help us avoid the interface change, so I like the idea. |
Perhaps a simpler approach would be the 2nd option #106 (comment) here to avoid extending the interface? |
@fpetkovski I have removed the changes to drain the operators in favor of performance. However, I tried the proposal to
@GiedriusS thanks for your input. Currently, we are handling this by passing down queryStats object inside the vector/matrix selector objects. Passing around In order to help us move in some direction, I have removed the changes in stepInvariant Operator for now (until we settle on a mechanism to aggregate these samples in query Stats). @fpetkovski @GiedriusS Let me know if the subset of changes look good. Would it be possible for us to jump on a call to get to a resolution on options between passing around |
00c5c6d
to
9fbbbe2
Compare
@sahnib We spoke with Giedrius in Slack about this. Let's extend the interface methods by with adding an As we spoke during community hours, we can omit the step invariant operator for now and deal with it in a separate issue. |
Thanks @fpetkovski I will make these changes in the next couple days. |
bb4a6ec
to
10e53ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to me!
execution/scan/matrix_selector.go
Outdated
@@ -299,6 +321,7 @@ loop: | |||
} | |||
// Values in the buffer are guaranteed to be smaller than maxt. | |||
if t >= mint { | |||
currSamples += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are just adding 1 to this variable but not using it anywhere? Also, what about selectExtPoints
? 🤔 I'm surprised that the linter hasn't caught this 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. It's a remnant of my refactoring :). I have removed it.
The actual code to count samples is abstracted out in func (o *matrixSelector) countSamples(points []promql.Point) int64
function. [called at line 156 in the matrix_selector.go file] This prevents having this logic in both selectExtPoints
and selectPoints
.
…m vector operators during query execution. The OperatorTracer is passed to the VectorOperator during Series() and Next() operations, and is a container to aggregate o11y information. Currently, the vector selector and matrix selector operators use this tracer for calculating query samples.
Thanks for your PR, we will discuss this topic a bit more with @PradyumnaKrishna @saswatamcode as this closely relates to our operator tracing project and come back to you. |
We are working on exposing Explain() first through the Thanos UI so will come to your PR in a week or so, even if we won't merge this as-is it is definitely a very good example that we will start our work from! |
This PR adds support for calculating total samples during a query, and returning this result as query statistics.
Issue: #148
Prometheus defines
totalQueryableSamples
as the total number of samples read out of the underlying Queryable instance inclusive of any samples that are buffered between range steps. The test cases have been modified to compare thetotalQueryableSamples
value between prometheus engine and thanos promql-engine, and ensures they are the same.I am still working on ensuring the samples are calculated correctly in the distributed engine, however filing PR early allows us to ensure we have an alignment on the interface changes.
There are 2 methods added to the
model.VectorOperator
interface.Stats()
: This method provides the query Stats after this operator has been evaluated.Drain()
: Drains the operator so that all running go-routines are finished. This is required to be able to calculate total samples in the query.The base operators like vectorSelector, matrixSelector calculate query samples in the
Next()
function. Higher level operators rely on the operators used inside them as necessary. Some higher level operators likededup
need to re-calculate total samples because the underlying operators can potentially double count some series.We can extend this logic to enforce
maxSamples
limit at Operator level in the future.Signed-off-by: sahnib [email protected]