Intersection Performance #2531

PSeitz · 2024-10-25T11:15:00Z

The current intersection code looks like this (simplified)

fn advance(&mut self) -> DocId {
    let (left, right) = (&mut self.left, &mut self.right);
    let mut candidate = left.advance();

    loop {
        let right_doc = right.seek(candidate);
        candidate = left.seek(right_doc);
        if candidate == right_doc {
            return candidate;
        }
    }
}

left is the doc_set with the smallest size_hint in the group. It is advanced first then we seek on right.
seek is not only going to the passed candidate, but searches for the next hit in that docset.

The problem with this approach is that we have Docset types with vastly different cost for seek which is not reflected here.

E.g. fast field range queries may do a full scan, if there is no next hit. phrase queries load the positions to check if a we have a hit.

Ideally we would use the "cheapest" docset as the driver for the intersection, and all others just check if they have a hit for the passed candidate.
Some docsets, like a precomputed bitset are excellent drivers, but some, like fast field based range, queries aren't.
Cheapness is typically related to number of docs in the docset only for same type of docsets. Between different docsets, we may have something like cost category (can be covered by e.g. using the high bits in u64).

To reflect that a cost based metric would make sense and an api that doesn't advance past target.

Proposal: Seek Exact + Cost Based

pub trait DocSet: Send {
    ...
    /// Returns a best-effort hint of the
    /// length of the docset.
    fn size_hint(&self) -> u32;

    /// Returns a best-effort hint of the
    /// cost to drive the docset.
    fn cost(&self) -> u64{
        self.size_hint()
    }

    /// Seeks to the target if necessary and checks if the target is an exact match.
    ///
    /// Some implementations may choose to advance past the target if beneficial for performance.
    /// The return value is `true` if the target is in the docset.
    fn seek_exact(&mut self, target: DocId) -> bool {
        let current_doc = self.doc();
        if current_doc < target {
            self.seek(target);
        }
        self.doc() == target
    }
   ...
}

Related: #2266

The text was updated successfully, but these errors were encountered:

fulmicoton · 2024-10-25T12:06:32Z

There is also the two phase approach of Lucene.

PSeitz · 2024-10-25T12:56:22Z

Yes, I had a look at that. TwoPhaseIterator::approximation returns a DocIdSetIterator where

The returned DocIdSetIterator is a superset of the matching documents, and each match 
needs to be confirmed with matches() in order to know whether it matches or not.

This should work good for PhraseQueries, and it seems this was designed with them in mind, since they have kind of two phases (match on the docset + match on the positions).
But for the FastFieldRangeQuery this would still forward the docset and produce the same issue as we have now.

The change and handling of the API would also probably have more overhead.

Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection. Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist. `cost` allows to address the different DocSet types and their cost model and is used to determine the DocSet that drives the intersection. E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit. They both have a higher cost than their size_hint would suggest. Improves `size_hint` estimation for intersection and union, by having a estimation based on random distribution with a co-location factor. Refactor range query benchmark. Closes #2531 *Future Work* Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries) Evaluate replacing `seek` with `seek_exact` to reduce code complexity

PSeitz self-assigned this Oct 25, 2024

PSeitz mentioned this issue Oct 28, 2024

Range query optim quickwit-oss/quickwit#3194

Open

PSeitz linked a pull request Nov 7, 2024 that will close this issue

seek_exact + cost based intersection #2538

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intersection Performance #2531

Intersection Performance #2531

PSeitz commented Oct 25, 2024

fulmicoton commented Oct 25, 2024

PSeitz commented Oct 25, 2024

Intersection Performance #2531

Intersection Performance #2531

Comments

PSeitz commented Oct 25, 2024

Proposal: Seek Exact + Cost Based

fulmicoton commented Oct 25, 2024

PSeitz commented Oct 25, 2024