Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Codec Between SPARSE and DENSE: CHIMERA #2440

Open
PSeitz opened this issue Jun 18, 2024 · 2 comments
Open

Possible Codec Between SPARSE and DENSE: CHIMERA #2440

PSeitz opened this issue Jun 18, 2024 · 2 comments

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Jun 18, 2024

Issue Outline

In the threshold between DENSE and SPARSE (5120 elements), compression is not great and select queries on SPARSE are relatively slow due to the binary_search.

Here is what a codec between sparse and dense could look like to increase performance and compression.

CHIMERA codec

Conceptually it is similar of what we do top level by creating blocks of size u16::MAX, but one level lower on the block level. It combines the index part of the DENSE (vals before) codec with the encoding of the SPARSE codec, therefore the name CHIMERA.
On a side note this actually a variation of datastructure I'm testing to speed up term aggregations.

A block is u16::MAX elements. From this value space we create 256 Blocks each with up to 256 elements.

Create an index with an entry for each block [VALS BEFORE;u16; ..x256].
Since we have only 256 elements in a chimera block we can use u8, as apposed to u16 in the SPARSE codec, to encode elements.
The binary search would have at most 8 steps as opposed to 13 steps with the SPARSE codec. We can also assume that a binary_search can on average be executed on ~2 cachelines, in contrast to ~11 cachelines (number of binary_search steps).

VALS BEFORE is the same as in the dense codec to speed up lookups. It also is used to compute the number of elements in a block and where they are encoded as every element is exactly 1 byte.

rank
direct access block + binary search in block

select
direct access value in data vec + binary_search on the metadata to find the block. We could have a select cursor similar to the existing one.

CHIMERA codec cost

2 bytes metadata per block: 256 * 2 = 512 bytes fixed codec + 1byte per element.

Thresholds

The current threshold is 5120 elements to switch between SPARSE and DENSE.

512 elements in a block before SPARSE is more compact than CHIMERA.

10240 bytes fixed cost for the DENSE codec (based on the threshold of 5120).
=> 9728 Elements threshold to DENSE codec after which DENSE is more compact than CHIMERA.

Selection range for maximum compression is therefore 512 - 9728 elements.

Conclusion

CHIMERA codec should provide higher compression and better performance compared to SPARSE codec.
The performance is probably lower compared to DENSE codec.

Further Considerations

A block of 256 could have different formats: FULL, BIZARRO, BITVEC. But this would require extra metadata additional to VALS BEFORE

Related issue: #2431

@fulmicoton
Copy link
Collaborator

fulmicoton commented Jun 19, 2024

At the threshold we would have an average of 20 elements per chimera block. Maybe linear search should be considered too?

Adding codecs could have some weird hidden switch-dispatch cost. It might be worth double checking how it goes.

@PSeitz
Copy link
Contributor Author

PSeitz commented Jul 1, 2024

At the threshold we would have an average of 20 elements per chimera block. Maybe linear search should be considered too?

Yes, linear search would probably be better here.

Adding codecs could have some weird hidden switch-dispatch cost. It might be worth double checking how it goes.

In places where we fetch data in blocks (aggregation, sort in quickwit) we could do some optimization. The enum would have 3 instead of 2, so I think it should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants