Possible Codec Between SPARSE and DENSE: CHIMERA #2440

PSeitz · 2024-06-18T08:05:19Z

Issue Outline

In the threshold between DENSE and SPARSE (5120 elements), compression is not great and select queries on SPARSE are relatively slow due to the binary_search.

Here is what a codec between sparse and dense could look like to increase performance and compression.

CHIMERA codec

Conceptually it is similar of what we do top level by creating blocks of size u16::MAX, but one level lower on the block level. It combines the index part of the DENSE (vals before) codec with the encoding of the SPARSE codec, therefore the name CHIMERA.
On a side note this actually a variation of datastructure I'm testing to speed up term aggregations.

A block is u16::MAX elements. From this value space we create 256 Blocks each with up to 256 elements.

Create an index with an entry for each block [VALS BEFORE;u16; ..x256].
Since we have only 256 elements in a chimera block we can use u8, as apposed to u16 in the SPARSE codec, to encode elements.
The binary search would have at most 8 steps as opposed to 13 steps with the SPARSE codec. We can also assume that a binary_search can on average be executed on ~2 cachelines, in contrast to ~11 cachelines (number of binary_search steps).

VALS BEFORE is the same as in the dense codec to speed up lookups. It also is used to compute the number of elements in a block and where they are encoded as every element is exactly 1 byte.

rank
direct access block + binary search in block

select
direct access value in data vec + binary_search on the metadata to find the block. We could have a select cursor similar to the existing one.

CHIMERA codec cost

2 bytes metadata per block: 256 * 2 = 512 bytes fixed codec + 1byte per element.

Thresholds

The current threshold is 5120 elements to switch between SPARSE and DENSE.

512 elements in a block before SPARSE is more compact than CHIMERA.

10240 bytes fixed cost for the DENSE codec (based on the threshold of 5120).
=> 9728 Elements threshold to DENSE codec after which DENSE is more compact than CHIMERA.

Selection range for maximum compression is therefore 512 - 9728 elements.

Conclusion

CHIMERA codec should provide higher compression and better performance compared to SPARSE codec.
The performance is probably lower compared to DENSE codec.

Further Considerations

A block of 256 could have different formats: FULL, BIZARRO, BITVEC. But this would require extra metadata additional to VALS BEFORE

Related issue: #2431

The text was updated successfully, but these errors were encountered:

fulmicoton · 2024-06-19T06:29:58Z

At the threshold we would have an average of 20 elements per chimera block. Maybe linear search should be considered too?

Adding codecs could have some weird hidden switch-dispatch cost. It might be worth double checking how it goes.

PSeitz · 2024-07-01T13:48:03Z

At the threshold we would have an average of 20 elements per chimera block. Maybe linear search should be considered too?

Yes, linear search would probably be better here.

Adding codecs could have some weird hidden switch-dispatch cost. It might be worth double checking how it goes.

In places where we fetch data in blocks (aggregation, sort in quickwit) we could do some optimization. The enum would have 3 instead of 2, so I think it should be fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Codec Between SPARSE and DENSE: CHIMERA #2440

Possible Codec Between SPARSE and DENSE: CHIMERA #2440

PSeitz commented Jun 18, 2024 •

edited

Loading

fulmicoton commented Jun 19, 2024 •

edited

Loading

PSeitz commented Jul 1, 2024

Possible Codec Between SPARSE and DENSE: CHIMERA #2440

Possible Codec Between SPARSE and DENSE: CHIMERA #2440

Comments

PSeitz commented Jun 18, 2024 • edited Loading

Issue Outline

CHIMERA codec

CHIMERA codec cost

Thresholds

Conclusion

Further Considerations

fulmicoton commented Jun 19, 2024 • edited Loading

PSeitz commented Jul 1, 2024

PSeitz commented Jun 18, 2024 •

edited

Loading

fulmicoton commented Jun 19, 2024 •

edited

Loading