Reverse reverse index #2505

NickDarvey · 2024-10-04T05:41:06Z

I'm wanting to look up the terms for a given document, that is Document -> Field -> Terms. Something like what term vectors provide in Lucene. (However, I see positions are already stored a little different in Tantivy.)

The use case is things like analysing the term distributions in a document, (for text classification, summarization, highlighting query terms) and copying an individual, indexed documents to another index.

I'm thinking of this like a HashMap<DocId, Vec<Term>> where Term (somehow) is a reference to the Term in the termdict and there will be one of these HashMap<,> reverse reverse indexes per reverse index segment so we (somehow) need to participate in the merge process. I notice that Lucene.NET has an interface 'IntervedDocConsumer' which is how term vectors (and something called 'freqprox') hook into the indexing chain so maybe that's a place to draw inspiration.
Edit: It looks like Recorder might be the right interface in Tantivy for writing to this new index. For example, TermFrequenceRecorder.

Can you share any initial thoughts in how you might approach this? Even the very first things that come to your mind will likely greatly accelerate me if I am to try and extend Tantivy to support this kind of index.

The text was updated successfully, but these errors were encountered:

NickDarvey · 2024-10-07T04:23:15Z

Some notes as I poke around Tantivy.

Storing terms as a fastfield

I thought this might be implemented with fastfields (columnar storage) which "is designed for the fast random access of some document fields given a document id". ~~However, I can't see a way to actually reading a fastfield for a given document id. Maybe I want something row-oriented instead?~~

Edit: Oh!

Perhaps as a minimal first pass, I could just mark my existing text field as a fast field and specify a tokenizer. Then, wrap a column reader like FacetReader does, to allow fetching by a document id.

tantivy/src/fastfield/writer.rs

Lines 134 to 146 in 2f5a269

    
           ReferenceValueLeaf::Str(val) => { 
        
               if let Some(tokenizer) = 
        
                   &mut self.per_field_tokenizer[field.field_id() as usize] 
        
               { 
        
                   let mut token_stream = tokenizer.token_stream(val); 
        
                   token_stream.process(&mut |token: &Token| { 
        
                       self.columnar_writer 
        
                           .record_str(doc_id, field_name, &token.text); 
        
                   }) 
        
               } else { 
        
                   self.columnar_writer.record_str(doc_id, field_name, val); 
        
               } 
        
           }

This method would mean my text is getting tokenized twice though, and I'd be storing the whole term(?) rather than just a term ordinal. #1325, implementation of fastfield for strings might be relevant here. (However, it looks like the codebase has changed a lot since this PR. For example, the postings writer no longer seems to pass an 'unordered_term_id' to the fastfield module.)

Getting the terms per document

The token stream for a document is processed into terms in PostingsWriter.index_test

tantivy/src/postings/postings_writer.rs

Lines 138 to 155 in 2f5a269

    
                   token_stream.process(&mut |token: &Token| { 
        
                       // We skip all tokens with a len greater than u16. 
        
                       if token.text.len() > MAX_TOKEN_LEN { 
        
                           warn!( 
        
                               "A token exceeding MAX_TOKEN_LEN ({}>{}) was dropped. Search for \ 
        
                                MAX_TOKEN_LEN in the documentation for more information.", 
        
                               token.text.len(), 
        
                               MAX_TOKEN_LEN 
        
                           ); 
        
                           return; 
        
                       } 
        
                       term_buffer.truncate_value_bytes(end_of_path_idx); 
        
                       term_buffer.append_bytes(token.text.as_bytes()); 
        
                       let start_position = indexing_position.end_position + token.position as u32; 
        
                       end_position = end_position.max(start_position + token.position_length as u32); 
        
                       self.subscribe(doc_id, start_position, term_buffer, ctx); 
        
                       num_tokens += 1; 
        
                   });

PostingsWriters must implement a subscribe function to handle the doc and term.

tantivy/src/postings/postings_writer.rs

Lines 201 to 221 in 2f5a269

    
           fn subscribe(&mut self, doc: DocId, position: u32, term: &Term, ctx: &mut IndexingContext) { 
        
               debug_assert!(term.serialized_term().len() >= 4); 
        
               self.total_num_tokens += 1; 
        
               let (term_index, arena) = (&mut ctx.term_index, &mut ctx.arena); 
        
               term_index.mutate_or_create(term.serialized_term(), |opt_recorder: Option<Rec>| { 
        
                   if let Some(mut recorder) = opt_recorder { 
        
                       let current_doc = recorder.current_doc(); 
        
                       if current_doc != doc { 
        
                           recorder.close_doc(arena); 
        
                           recorder.new_doc(doc, arena); 
        
                       } 
        
                       recorder.record_position(position, arena); 
        
                       recorder 
        
                   } else { 
        
                       let mut recorder = Rec::default(); 
        
                       recorder.new_doc(doc, arena); 
        
                       recorder.record_position(position, arena); 
        
                       recorder 
        
                   } 
        
               }); 
        
           }

SpecializedPostingsWriter<>, for example, instantiates and calls a recorder for each term. However, a recorder does not (currently) have access to the specific term.

tantivy/src/postings/recorder.rs

Lines 49 to 85 in 2f5a269

    
           /// `Recorder` is in charge of recording relevant information about 
        
           /// the presence of a term in a document. 
        
           /// 
        
           /// Depending on the [`TextOptions`](crate::schema::TextOptions) associated 
        
           /// with the field, the recorder may record: 
        
           ///   * the document frequency 
        
           ///   * the document id 
        
           ///   * the term frequency 
        
           ///   * the term positions 
        
           pub(crate) trait Recorder: Copy + Default + Send + Sync + 'static { 
        
               /// Returns the current document 
        
               fn current_doc(&self) -> u32; 
        
               /// Starts recording information about a new document 
        
               /// This method shall only be called if the term is within the document. 
        
               fn new_doc(&mut self, doc: DocId, arena: &mut MemoryArena); 
        
               /// Record the position of a term. For each document, 
        
               /// this method will be called `term_freq` times. 
        
               fn record_position(&mut self, position: u32, arena: &mut MemoryArena); 
        
               /// Close the document. It will help record the term frequency. 
        
               fn close_doc(&mut self, arena: &mut MemoryArena); 
        
               /// Pushes the postings information to the serializer. 
        
               fn serialize( 
        
                   &self, 
        
                   arena: &MemoryArena, 
        
                   serializer: &mut FieldSerializer<'_>, 
        
                   buffer_lender: &mut BufferLender, 
        
               ); 
        
               /// Returns the number of document containing this term. 
        
               /// 
        
               /// Returns `None` if not available. 
        
               fn term_doc_freq(&self) -> Option<u32>; 
        
               #[inline] 
        
               fn has_term_freq(&self) -> bool { 
        
                   true 
        
               } 
        
           }

Exposing as an option

This could be exposed as another (or a different kind of) IndexRecordOption which should work if we implement a new recorder or need implement a whole different PostingsWriter.

tantivy/src/postings/per_field_postings_writer.rs

Lines 33 to 46 in 2f5a269

    
           FieldType::Str(ref text_options) => text_options 
        
               .get_indexing_options() 
        
               .map(|indexing_options| match indexing_options.index_option() { 
        
                   IndexRecordOption::Basic => { 
        
                       SpecializedPostingsWriter::<DocIdRecorder>::default().into() 
        
                   } 
        
                   IndexRecordOption::WithFreqs => { 
        
                       SpecializedPostingsWriter::<TermFrequencyRecorder>::default().into() 
        
                   } 
        
                   IndexRecordOption::WithFreqsAndPositions => { 
        
                       SpecializedPostingsWriter::<TfAndPositionRecorder>::default().into() 
        
                   } 
        
               }) 
        
               .unwrap_or_else(|| SpecializedPostingsWriter::<DocIdRecorder>::default().into()),

Alternatively, it might be nice to enable this per document. For example, so I can just keep this kinda index for the latest ~20% of documents. In which case, maybe this could be implemented as a new field type.

PSeitz · 2024-10-07T05:12:00Z

You could load the document from the docstore and tokenize the text to get the terms

NickDarvey · 2024-10-07T05:14:39Z

You could load the document from the docstore and tokenize the text to get the terms

Ah, but I am not storing this (quite large) text field

PSeitz · 2024-10-07T05:21:33Z

The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage

NickDarvey · 2024-10-07T05:26:47Z

The fast field (columnar) version should work too I think, but the dictionary is not shared between the inverted index and the columnar storage

Did it used to be shared? I see your contribution here is passing an UnorderedTermId to a column writer.

Edit: Maybe that's what this is about #1705 (comment)

PSeitz · 2024-10-07T07:02:53Z

It can't be shared anymore since a different tokenizer can be defined now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverse reverse index #2505

Reverse reverse index #2505

NickDarvey commented Oct 4, 2024 •

edited

Loading

NickDarvey commented Oct 7, 2024 •

edited

Loading

PSeitz commented Oct 7, 2024

NickDarvey commented Oct 7, 2024

PSeitz commented Oct 7, 2024 •

edited

Loading

NickDarvey commented Oct 7, 2024 •

edited

Loading

PSeitz commented Oct 7, 2024

Reverse reverse index #2505

Reverse reverse index #2505

Comments

NickDarvey commented Oct 4, 2024 • edited Loading

NickDarvey commented Oct 7, 2024 • edited Loading

Storing terms as a fastfield

Getting the terms per document

Exposing as an option

PSeitz commented Oct 7, 2024

NickDarvey commented Oct 7, 2024

PSeitz commented Oct 7, 2024 • edited Loading

NickDarvey commented Oct 7, 2024 • edited Loading

PSeitz commented Oct 7, 2024

NickDarvey commented Oct 4, 2024 •

edited

Loading

NickDarvey commented Oct 7, 2024 •

edited

Loading

PSeitz commented Oct 7, 2024 •

edited

Loading

NickDarvey commented Oct 7, 2024 •

edited

Loading