bug: `bm25_query_to_svector` has an additional norm #16

kemingy · 2024-07-12T04:21:33Z

Lines 216 to 221 in 3126173

    
           // https://github.com/pinecone-io/pinecone-text/issues/69 
        
           let sum = x.values().copied().sum::<f32>(); 
        
           let mut result = "{".to_string(); 
        
           for (index, value) in x.into_iter() { 
        
               result.push_str(&format!("{}:{}, ", index + 1, value / sum)); 
        
           }

This is for:

$$score(D, Q) = \sum^{D\land Q}_t (TF_{norm} * IDF /\sum^Q_t IDF)$$

But if the score (BM25) is pre-computed on the document level, it will become:

$$score(D, Q) = \sum^{D\land Q}_t (TF_{norm} * IDF /\sum^D_t IDF)$$

This means that for each query, the score is not only affected by token $\in D \land Q$, but also token $\in D \notin Q$.

solutions

create another function that computes the BM25 score directly instead of using the element-wise multiplication of document/query sparse vector
or delete the query IDF norm

kemingy mentioned this issue Jul 12, 2024

feat: support inverted index for sparse vector tensorchord/pgvecto.rs#517

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: `bm25_query_to_svector` has an additional norm #16

bug: `bm25_query_to_svector` has an additional norm #16

kemingy commented Jul 12, 2024

bug: bm25_query_to_svector has an additional norm #16

bug: bm25_query_to_svector has an additional norm #16

Comments

kemingy commented Jul 12, 2024

solutions

bug: `bm25_query_to_svector` has an additional norm #16

bug: `bm25_query_to_svector` has an additional norm #16