Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse reverse index #2505

Open
NickDarvey opened this issue Oct 4, 2024 · 6 comments
Open

Reverse reverse index #2505

NickDarvey opened this issue Oct 4, 2024 · 6 comments

Comments

@NickDarvey
Copy link

NickDarvey commented Oct 4, 2024

I'm wanting to look up the terms for a given document, that is Document -> Field -> Terms. Something like what term vectors provide in Lucene. (However, I see positions are already stored a little different in Tantivy.)

The use case is things like analysing the term distributions in a document, (for text classification, summarization, highlighting query terms) and copying an individual, indexed documents to another index.

I'm thinking of this like a HashMap<DocId, Vec<Term>> where Term (somehow) is a reference to the Term in the termdict and there will be one of these HashMap<,> reverse reverse indexes per reverse index segment so we (somehow) need to participate in the merge process. I notice that Lucene.NET has an interface 'IntervedDocConsumer' which is how term vectors (and something called 'freqprox') hook into the indexing chain so maybe that's a place to draw inspiration.
Edit: It looks like Recorder might be the right interface in Tantivy for writing to this new index. For example, TermFrequenceRecorder.

Can you share any initial thoughts in how you might approach this? Even the very first things that come to your mind will likely greatly accelerate me if I am to try and extend Tantivy to support this kind of index.

@NickDarvey
Copy link
Author

NickDarvey commented Oct 7, 2024

Some notes as I poke around Tantivy.

Storing terms as a fastfield

I thought this might be implemented with fastfields (columnar storage) which "is designed for the fast random access of some document fields given a document id". However, I can't see a way to actually reading a fastfield for a given document id. Maybe I want something row-oriented instead?

Edit: Oh!

Perhaps as a minimal first pass, I could just mark my existing text field as a fast field and specify a tokenizer. Then, wrap a column reader like FacetReader does, to allow fetching by a document id.

ReferenceValueLeaf::Str(val) => {
if let Some(tokenizer) =
&mut self.per_field_tokenizer[field.field_id() as usize]
{
let mut token_stream = tokenizer.token_stream(val);
token_stream.process(&mut |token: &Token| {
self.columnar_writer
.record_str(doc_id, field_name, &token.text);
})
} else {
self.columnar_writer.record_str(doc_id, field_name, val);
}
}

This method would mean my text is getting tokenized twice though, and I'd be storing the whole term(?) rather than just a term ordinal. #1325, implementation of fastfield for strings might be relevant here. (However, it looks like the codebase has changed a lot since this PR. For example, the postings writer no longer seems to pass an 'unordered_term_id' to the fastfield module.)

Getting the terms per document

The token stream for a document is processed into terms in PostingsWriter.index_test

token_stream.process(&mut |token: &Token| {
// We skip all tokens with a len greater than u16.
if token.text.len() > MAX_TOKEN_LEN {
warn!(
"A token exceeding MAX_TOKEN_LEN ({}>{}) was dropped. Search for \
MAX_TOKEN_LEN in the documentation for more information.",
token.text.len(),
MAX_TOKEN_LEN
);
return;
}
term_buffer.truncate_value_bytes(end_of_path_idx);
term_buffer.append_bytes(token.text.as_bytes());
let start_position = indexing_position.end_position + token.position as u32;
end_position = end_position.max(start_position + token.position_length as u32);
self.subscribe(doc_id, start_position, term_buffer, ctx);
num_tokens += 1;
});

PostingsWriters must implement a subscribe function to handle the doc and term.

fn subscribe(&mut self, doc: DocId, position: u32, term: &Term, ctx: &mut IndexingContext) {
debug_assert!(term.serialized_term().len() >= 4);
self.total_num_tokens += 1;
let (term_index, arena) = (&mut ctx.term_index, &mut ctx.arena);
term_index.mutate_or_create(term.serialized_term(), |opt_recorder: Option<Rec>| {
if let Some(mut recorder) = opt_recorder {
let current_doc = recorder.current_doc();
if current_doc != doc {
recorder.close_doc(arena);
recorder.new_doc(doc, arena);
}
recorder.record_position(position, arena);
recorder
} else {
let mut recorder = Rec::default();
recorder.new_doc(doc, arena);
recorder.record_position(position, arena);
recorder
}
});
}

SpecializedPostingsWriter<>, for example, instantiates and calls a recorder for each term. However, a recorder does not (currently) have access to the specific term.

/// `Recorder` is in charge of recording relevant information about
/// the presence of a term in a document.
///
/// Depending on the [`TextOptions`](crate::schema::TextOptions) associated
/// with the field, the recorder may record:
/// * the document frequency
/// * the document id
/// * the term frequency
/// * the term positions
pub(crate) trait Recorder: Copy + Default + Send + Sync + 'static {
/// Returns the current document
fn current_doc(&self) -> u32;
/// Starts recording information about a new document
/// This method shall only be called if the term is within the document.
fn new_doc(&mut self, doc: DocId, arena: &mut MemoryArena);
/// Record the position of a term. For each document,
/// this method will be called `term_freq` times.
fn record_position(&mut self, position: u32, arena: &mut MemoryArena);
/// Close the document. It will help record the term frequency.
fn close_doc(&mut self, arena: &mut MemoryArena);
/// Pushes the postings information to the serializer.
fn serialize(
&self,
arena: &MemoryArena,
serializer: &mut FieldSerializer<'_>,
buffer_lender: &mut BufferLender,
);
/// Returns the number of document containing this term.
///
/// Returns `None` if not available.
fn term_doc_freq(&self) -> Option<u32>;
#[inline]
fn has_term_freq(&self) -> bool {
true
}
}

Exposing as an option

This could be exposed as another (or a different kind of) IndexRecordOption which should work if we implement a new recorder or need implement a whole different PostingsWriter.

FieldType::Str(ref text_options) => text_options
.get_indexing_options()
.map(|indexing_options| match indexing_options.index_option() {
IndexRecordOption::Basic => {
SpecializedPostingsWriter::<DocIdRecorder>::default().into()
}
IndexRecordOption::WithFreqs => {
SpecializedPostingsWriter::<TermFrequencyRecorder>::default().into()
}
IndexRecordOption::WithFreqsAndPositions => {
SpecializedPostingsWriter::<TfAndPositionRecorder>::default().into()
}
})
.unwrap_or_else(|| SpecializedPostingsWriter::<DocIdRecorder>::default().into()),

Alternatively, it might be nice to enable this per document. For example, so I can just keep this kinda index for the latest ~20% of documents. In which case, maybe this could be implemented as a new field type.

@PSeitz
Copy link
Contributor

PSeitz commented Oct 7, 2024

You could load the document from the docstore and tokenize the text to get the terms

@NickDarvey
Copy link
Author

You could load the document from the docstore and tokenize the text to get the terms

Ah, but I am not storing this (quite large) text field

@PSeitz
Copy link
Contributor

PSeitz commented Oct 7, 2024

The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage

@NickDarvey
Copy link
Author

NickDarvey commented Oct 7, 2024

The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage

Did it used to be shared? I see your contribution here is passing an UnorderedTermId to a column writer.

Edit: Maybe that's what this is about #1705 (comment)

@PSeitz
Copy link
Contributor

PSeitz commented Oct 7, 2024

It can't be shared anymore since a different tokenizer can be defined now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants