Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intersection Performance #2531

Open
PSeitz opened this issue Oct 25, 2024 · 2 comments · May be fixed by #2538
Open

Intersection Performance #2531

PSeitz opened this issue Oct 25, 2024 · 2 comments · May be fixed by #2538
Assignees

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Oct 25, 2024

The current intersection code looks like this (simplified)

fn advance(&mut self) -> DocId {
    let (left, right) = (&mut self.left, &mut self.right);
    let mut candidate = left.advance();

    loop {
        let right_doc = right.seek(candidate);
        candidate = left.seek(right_doc);
        if candidate == right_doc {
            return candidate;
        }
    }
}

left is the doc_set with the smallest size_hint in the group. It is advanced first then we seek on right.
seek is not only going to the passed candidate, but searches for the next hit in that docset.

The problem with this approach is that we have Docset types with vastly different cost for seek which is not reflected here.

E.g. fast field range queries may do a full scan, if there is no next hit. phrase queries load the positions to check if a we have a hit.

Ideally we would use the "cheapest" docset as the driver for the intersection, and all others just check if they have a hit for the passed candidate.
Some docsets, like a precomputed bitset are excellent drivers, but some, like fast field based range, queries aren't.
Cheapness is typically related to number of docs in the docset only for same type of docsets. Between different docsets, we may have something like cost category (can be covered by e.g. using the high bits in u64).

To reflect that a cost based metric would make sense and an api that doesn't advance past target.

Proposal: Seek Exact + Cost Based

pub trait DocSet: Send {
    ...
    /// Returns a best-effort hint of the
    /// length of the docset.
    fn size_hint(&self) -> u32;

    /// Returns a best-effort hint of the
    /// cost to drive the docset.
    fn cost(&self) -> u64{
        self.size_hint()
    }

    /// Seeks to the target if necessary and checks if the target is an exact match.
    ///
    /// Some implementations may choose to advance past the target if beneficial for performance.
    /// The return value is `true` if the target is in the docset.
    fn seek_exact(&mut self, target: DocId) -> bool {
        let current_doc = self.doc();
        if current_doc < target {
            self.seek(target);
        }
        self.doc() == target
    }
   ...
}

Related: #2266

@PSeitz PSeitz self-assigned this Oct 25, 2024
@fulmicoton
Copy link
Collaborator

There is also the two phase approach of Lucene.

@PSeitz
Copy link
Contributor Author

PSeitz commented Oct 25, 2024

Yes, I had a look at that. TwoPhaseIterator::approximation returns a DocIdSetIterator where

The returned DocIdSetIterator is a superset of the matching documents, and each match 
needs to be confirmed with matches() in order to know whether it matches or not.

This should work good for PhraseQueries, and it seems this was designed with them in mind, since they have kind of two phases (match on the docset + match on the positions).
But for the FastFieldRangeQuery this would still forward the docset and produce the same issue as we have now.

The change and handling of the API would also probably have more overhead.

PSeitz added a commit that referenced this issue Nov 7, 2024
Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection.
Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist.

`cost` allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.

Improves `size_hint` estimation for intersection and union, by having a
estimation based on random distribution with a co-location factor.

Refactor range query benchmark.

Closes #2531

*Future Work*

Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries)
Evaluate replacing `seek` with `seek_exact` to reduce code complexity
@PSeitz PSeitz linked a pull request Nov 7, 2024 that will close this issue
PSeitz added a commit that referenced this issue Nov 7, 2024
Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection.
Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist.

`cost` allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.

Improves `size_hint` estimation for intersection and union, by having a
estimation based on random distribution with a co-location factor.

Refactor range query benchmark.

Closes #2531

*Future Work*

Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries)
Evaluate replacing `seek` with `seek_exact` to reduce code complexity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants