Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disabling compression for SSTable based dictionaries #2475

Open
rfairfax opened this issue Aug 12, 2024 · 3 comments
Open

Support disabling compression for SSTable based dictionaries #2475

rfairfax opened this issue Aug 12, 2024 · 3 comments

Comments

@rfairfax
Copy link

rfairfax commented Aug 12, 2024

Is your feature request related to a problem? Please describe.
We have a setup where we use a bytes fast field to store serialized data. As part of querying a collector retrieves this data via BytesCollumn::ord_to_bytes. We are seeing limited performance that flame graphs show as being dominated (>90%) by this path:

tantivy_sstable::dictionary::Dictionary<TSSTable>::ord_to_term
tantivy_sstable::delta::DeltaReader<TValueReader>::advance
tantivy_sstable::block_reader::BlockReader::read_block
zstd::bulk::decompressor::Decompressor::decompress_to_buffer

From a quick look at the code it appears that random access to the fast field requires decompressing the block holding the term on each ord lookup, which is why the CPU is dominated with decompression activity. As the number of documents being collected increases the efficiency can drop quickly as the same block is decompressed again and again.

Describe the solution you'd like
Ideally there would be a setting that would disable SSTable compression for a given fast field where the caller is willing to pay for more space to get faster random access time. This could be the default for dictionaries built as part of fast fields where fast random access is the desired outcome but an opt-in would also work.

[Optional] describe alternatives you've considered
We experimented with walking the entire dictionary to only decompress a given block once, but there are simply too many unique terms to make this faster. Even with range limits on the dictionary stream this is slower than the random access today.

We also considered storing this field and using the doc store where we can control the compression via index settings, but we need the fast field for some range filtering so we're hoping to avoid duplicate storage and the overhead of retrieving the entire document.

Lastly, one final alternative that may help is caching blocks, like what the StoreReader does. With a fake test where the only field stored is the single BytesColumn we found that going through the store reader was 3x faster than the fast field column, largely because of the cache reducing the amount of decompression needed.

@RNabel
Copy link

RNabel commented Aug 25, 2024

One other option that may help is zstd's skippable frames which could also be used to speed up random access at the cost of (presumably) a worse compression ratio.

@fulmicoton
Copy link
Collaborator

I think it makes sense to add an option in the schema to disable compression.

@trinity-1686a
Copy link
Contributor

sstable block codec already has an 8bit field encoding whether a block is compressed, so it should be a matter of passing a bool to DeltaWriter, and modifying this heuristic. We could also experiment with negative compression level here, this would disable zstd's entropy coding, which likely improve decompression speed, at the cost of less compression

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants