-
-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intersection Performance #2531
Comments
There is also the two phase approach of Lucene. |
Yes, I had a look at that.
This should work good for PhraseQueries, and it seems this was designed with them in mind, since they have kind of two phases (match on the docset + match on the positions). The change and handling of the API would also probably have more overhead. |
Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection. Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist. `cost` allows to address the different DocSet types and their cost model and is used to determine the DocSet that drives the intersection. E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit. They both have a higher cost than their size_hint would suggest. Improves `size_hint` estimation for intersection and union, by having a estimation based on random distribution with a co-location factor. Refactor range query benchmark. Closes #2531 *Future Work* Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries) Evaluate replacing `seek` with `seek_exact` to reduce code complexity
Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection. Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist. `cost` allows to address the different DocSet types and their cost model and is used to determine the DocSet that drives the intersection. E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit. They both have a higher cost than their size_hint would suggest. Improves `size_hint` estimation for intersection and union, by having a estimation based on random distribution with a co-location factor. Refactor range query benchmark. Closes #2531 *Future Work* Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries) Evaluate replacing `seek` with `seek_exact` to reduce code complexity
The current intersection code looks like this (simplified)
left
is the doc_set with the smallestsize_hint
in the group. It is advanced first then we seek onright
.seek
is not only going to the passed candidate, but searches for the next hit in that docset.The problem with this approach is that we have Docset types with vastly different cost for seek which is not reflected here.
E.g. fast field range queries may do a full scan, if there is no next hit. phrase queries load the positions to check if a we have a hit.
Ideally we would use the "cheapest" docset as the driver for the intersection, and all others just check if they have a hit for the passed candidate.
Some docsets, like a precomputed bitset are excellent drivers, but some, like fast field based range, queries aren't.
Cheapness is typically related to number of docs in the docset only for same type of docsets. Between different docsets, we may have something like cost category (can be covered by e.g. using the high bits in u64).
To reflect that a cost based metric would make sense and an api that doesn't advance past
target
.Proposal: Seek Exact + Cost Based
Related: #2266
The text was updated successfully, but these errors were encountered: