Reduce the size of the lookup tables #14

lemire · 2017-12-04T22:13:05Z

The current lookup tables are quite large. Finding a way to substantially reduce their memory usage without adversally affecting performance would be a worthy goal.

aqrit · 2018-07-17T02:06:35Z

The table for the encoding shuffle can be reduced by 75%.
This would cost 1 cycle per loop, I believe.

Same idea as the small tables in simd_prune/despacer projects.
Bit-6 and bit-7 of the key byte can be ignored if the corresponding input bytes (the last four bytes) are always written to the output. Unwanted output byte(s) will be overwritten by the next chunk, anyways.

Thus the table only needs 64x16 bytes instead of 256x16 bytes.

aqrit · 2018-07-19T19:46:07Z

Just mask off the top 2-bits of the key byte

https://gist.github.com/aqrit/9272c47b3f1ce23c565a7210b6935102#file-svb_alt-c-L398

lemire · 2018-07-19T20:40:58Z

Damn!!! You are right.

lemire · 2018-07-20T00:00:14Z

Update: I was thinking about the encoder in my reply.

Coming back to it, I don't see how it can work. The last two bits give us how many bytes are non-null in the last of 4 integers.

Suppose you don't have this information... then you will fill the last four bytes with data. It is not true that "unwanted output byte(s) will be overwritten by the next chunk, anyways." You need some way to set to zero the unwanted bytes.

If you have some way to make it work, please share.

Your general idea is directly applicable to other problems, however. But I don't see it here.

aqrit · 2018-07-20T01:16:31Z

You need some way to set to zero the unwanted bytes.

The encoder never needs to zero bytes. Ever.

Suppose you don't have this information...

The info is still saved for use by the decoder.

here is a working drop-in-replacement:
https://gist.github.com/aqrit/746d2f5e4ad1909230e2283272333dc1

please test.

lemire · 2018-07-20T02:43:35Z

Oh! OH ! For the encoder. I see. Yes.

lemire · 2018-07-20T02:43:54Z

Good point. I was thinking about the decoder.

lemire · 2018-07-20T02:48:29Z

@KWillets @vkazanov : It seems that @aqrit has a fast vectorized encoder that uses smaller tables (and thus less cache).

aqrit · 2018-08-05T03:04:16Z

The decoding shuffle table size could be reduced by 75% ... (256x16 vs 256x4)
at a cost of 1 cycle per 128-bits of output (AFAIK)

The shuffle control masks get compressed to 32-bit integers.
Masks are decompressed (32->128) by two instructions.
However, the table index now scales for "free",
saving 1 instruction.

gist

lemire · 2018-08-05T13:44:17Z

Looks clever. I'll investigate soon.

lemire · 2018-08-22T18:32:17Z

Completed with efb310d

Your name has been added as author.

aqrit · 2018-08-22T21:27:58Z

Neat.
Though, I've still got things to try on my todo list:

Transpose of elements during delta encoding to boost delta decode speed?
Back of the envelope calculations -- for a block of 14 xmmwords - indicate that'd shave 30 instructions off from decoding (per block)... but one would have to wait until all 14 xmmwords were unpacked to begin the delta decode step, so...?
Using AVX2, check the speed of generating the 'compact' (32-bit) decode shuffle control on-the-fly? Obviously this requires a format change, the 2-bit keys would have to be interleaved (1 per byte) instead of packed (all 4 in one byte).
New tail loop on SSSE3 decoder if original count was 16 or greater.

dataPtr -= 4;
for(...){
    size_t k = keys & 3;
    keys >>= 2;
    dataPtr = dataPtr + k + 1;
    uint32_t dw = *((uint32_t*)dataPtr); 
    dw >>= ((k ^ 3) * 8); 
}

Add zigzag delta transform

lemire · 2018-08-22T23:45:17Z

+1

KWillets · 2018-12-09T19:32:24Z

I had a thought related to this for the decoder which might be interesting. It adds a few steps vs. a raw lookup, but it may allow scaling to larger register sizes with only logarithmic time growth, and sqrt-sized shuffle tables vs. the original LUT.

The idea is to split the blob of compressed bytes into 8-byte "lanes" based only on the length of the first two elements, and then into 4-byte words based on the lengths of the first and third. We use zero-filling to allow the second halves to be processed as fixed 8 or 4-byte fields.

Basic Algorithm: Given a stream of bytes and a control byte {L1,L2,L3,L4}, deposit L1 bytes in the first work, L2 in the 2nd, etc., zero-filling the remaining bytes in each word.

Load L1+L2+L3+L4 bytes into a 16-byte register and zero-fill the remaining bytes.
Lookup up and apply a shuffle from L1+L2, which keeps the first L1+L2 bytes in place and moves the following 8 bytes to [8..15] (zero-filling the first 8 as needed).
Lookup and apply a shuffle from (L1,L3) which is similar in each 8-byte lane: Keep the first L1 and L3 bytes in place and move the following 4 bytes to [4..7] and [12..15] respectively, zero-filling the first and third words.

The lookup tables in steps 1 and 2 are size 7 and 16, respectively, although the lookups based on L1+L2 and (L1, L3) may need an intermediate table to extract those values quickly.

Step 1 may also be skipped by making the shuffle in step 2 include the length of the second half (49 rather than 7 shuffles), so that the shuffle can work directly on the byte stream without pre-masking.

For SSE this is slower, but for AVX etc. it's O(log(register width)) steps, and eg the shuffle tables for 8-at-a-time max out at 256 elements rather than 64k (sqrt of the original method).

lemire · 2018-12-10T15:12:48Z

Interesting.

KWillets · 2018-12-10T17:11:39Z

I was looking at this for UTF-8 conversion as well, but it seems easier when the control byte is already available. utf-8 needs a lot of pmovmskb/pdep/tzcnt's to get the lengths.

aqrit · 2018-12-11T16:38:57Z

Step #1 is expensive w/AVX2. AVX2 can't shuffle bytes across 128-bit lanes, only dwords/qwords/owords.
It makes me doubt that this approach would be better than the "compressed shuffle table" I posted earlier in this thread.

lemire added the help wanted label Dec 4, 2017

lemire closed this as completed Aug 22, 2018

KWillets reopened this Dec 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the size of the lookup tables #14

Reduce the size of the lookup tables #14

lemire commented Dec 4, 2017

aqrit commented Jul 17, 2018

aqrit commented Jul 19, 2018

lemire commented Jul 19, 2018

lemire commented Jul 20, 2018 •

edited

Loading

aqrit commented Jul 20, 2018 •

edited

Loading

lemire commented Jul 20, 2018

lemire commented Jul 20, 2018

lemire commented Jul 20, 2018

aqrit commented Aug 5, 2018 •

edited

Loading

lemire commented Aug 5, 2018

lemire commented Aug 22, 2018

aqrit commented Aug 22, 2018

lemire commented Aug 22, 2018

KWillets commented Dec 9, 2018

lemire commented Dec 10, 2018

KWillets commented Dec 10, 2018

aqrit commented Dec 11, 2018

Reduce the size of the lookup tables #14

Reduce the size of the lookup tables #14

Comments

lemire commented Dec 4, 2017

aqrit commented Jul 17, 2018

aqrit commented Jul 19, 2018

lemire commented Jul 19, 2018

lemire commented Jul 20, 2018 • edited Loading

aqrit commented Jul 20, 2018 • edited Loading

lemire commented Jul 20, 2018

lemire commented Jul 20, 2018

lemire commented Jul 20, 2018

aqrit commented Aug 5, 2018 • edited Loading

lemire commented Aug 5, 2018

lemire commented Aug 22, 2018

aqrit commented Aug 22, 2018

lemire commented Aug 22, 2018

KWillets commented Dec 9, 2018

lemire commented Dec 10, 2018

KWillets commented Dec 10, 2018

aqrit commented Dec 11, 2018

lemire commented Jul 20, 2018 •

edited

Loading

aqrit commented Jul 20, 2018 •

edited

Loading

aqrit commented Aug 5, 2018 •

edited

Loading