Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the size of the lookup tables #14

Open
lemire opened this issue Dec 4, 2017 · 17 comments
Open

Reduce the size of the lookup tables #14

lemire opened this issue Dec 4, 2017 · 17 comments

Comments

@lemire
Copy link
Member

lemire commented Dec 4, 2017

The current lookup tables are quite large. Finding a way to substantially reduce their memory usage without adversally affecting performance would be a worthy goal.

@aqrit
Copy link
Collaborator

aqrit commented Jul 17, 2018

The table for the encoding shuffle can be reduced by 75%.
This would cost 1 cycle per loop, I believe.

Same idea as the small tables in simd_prune/despacer projects.
Bit-6 and bit-7 of the key byte can be ignored if the corresponding input bytes (the last four bytes) are always written to the output. Unwanted output byte(s) will be overwritten by the next chunk, anyways.

Thus the table only needs 64x16 bytes instead of 256x16 bytes.

@aqrit
Copy link
Collaborator

aqrit commented Jul 19, 2018

Just mask off the top 2-bits of the key byte

https://gist.github.com/aqrit/9272c47b3f1ce23c565a7210b6935102#file-svb_alt-c-L398

@lemire
Copy link
Member Author

lemire commented Jul 19, 2018

Damn!!! You are right.

@lemire
Copy link
Member Author

lemire commented Jul 20, 2018

Update: I was thinking about the encoder in my reply.

Coming back to it, I don't see how it can work. The last two bits give us how many bytes are non-null in the last of 4 integers.

Suppose you don't have this information... then you will fill the last four bytes with data. It is not true that "unwanted output byte(s) will be overwritten by the next chunk, anyways." You need some way to set to zero the unwanted bytes.

If you have some way to make it work, please share.

Your general idea is directly applicable to other problems, however. But I don't see it here.

@aqrit
Copy link
Collaborator

aqrit commented Jul 20, 2018

You need some way to set to zero the unwanted bytes.

The encoder never needs to zero bytes. Ever.

Suppose you don't have this information...

The info is still saved for use by the decoder.

here is a working drop-in-replacement:
https://gist.github.com/aqrit/746d2f5e4ad1909230e2283272333dc1

please test.

@lemire
Copy link
Member Author

lemire commented Jul 20, 2018

Oh! OH ! For the encoder. I see. Yes.

@lemire
Copy link
Member Author

lemire commented Jul 20, 2018

Good point. I was thinking about the decoder.

@lemire
Copy link
Member Author

lemire commented Jul 20, 2018

@KWillets @vkazanov : It seems that @aqrit has a fast vectorized encoder that uses smaller tables (and thus less cache).

@aqrit
Copy link
Collaborator

aqrit commented Aug 5, 2018

The decoding shuffle table size could be reduced by 75% ... (256x16 vs 256x4)
at a cost of 1 cycle per 128-bits of output (AFAIK)

The shuffle control masks get compressed to 32-bit integers.
Masks are decompressed (32->128) by two instructions.
However, the table index now scales for "free",
saving 1 instruction.

gist

@lemire
Copy link
Member Author

lemire commented Aug 5, 2018

Looks clever. I'll investigate soon.

@lemire
Copy link
Member Author

lemire commented Aug 22, 2018

Completed with efb310d

Your name has been added as author.

@lemire lemire closed this as completed Aug 22, 2018
@aqrit
Copy link
Collaborator

aqrit commented Aug 22, 2018

Neat.
Though, I've still got things to try on my todo list:

  1. Transpose of elements during delta encoding to boost delta decode speed?
    Back of the envelope calculations -- for a block of 14 xmmwords - indicate that'd shave 30 instructions off from decoding (per block)... but one would have to wait until all 14 xmmwords were unpacked to begin the delta decode step, so...?

  2. Using AVX2, check the speed of generating the 'compact' (32-bit) decode shuffle control on-the-fly? Obviously this requires a format change, the 2-bit keys would have to be interleaved (1 per byte) instead of packed (all 4 in one byte).

  3. New tail loop on SSSE3 decoder if original count was 16 or greater.

dataPtr -= 4;
for(...){
    size_t k = keys & 3;
    keys >>= 2;
    dataPtr = dataPtr + k + 1;
    uint32_t dw = *((uint32_t*)dataPtr); 
    dw >>= ((k ^ 3) * 8); 
}
  1. Add zigzag delta transform

@lemire
Copy link
Member Author

lemire commented Aug 22, 2018

+1

@KWillets
Copy link
Collaborator

KWillets commented Dec 9, 2018

I had a thought related to this for the decoder which might be interesting. It adds a few steps vs. a raw lookup, but it may allow scaling to larger register sizes with only logarithmic time growth, and sqrt-sized shuffle tables vs. the original LUT.

The idea is to split the blob of compressed bytes into 8-byte "lanes" based only on the length of the first two elements, and then into 4-byte words based on the lengths of the first and third. We use zero-filling to allow the second halves to be processed as fixed 8 or 4-byte fields.

Basic Algorithm: Given a stream of bytes and a control byte {L1,L2,L3,L4}, deposit L1 bytes in the first work, L2 in the 2nd, etc., zero-filling the remaining bytes in each word.

  1. Load L1+L2+L3+L4 bytes into a 16-byte register and zero-fill the remaining bytes.

  2. Lookup up and apply a shuffle from L1+L2, which keeps the first L1+L2 bytes in place and moves the following 8 bytes to [8..15] (zero-filling the first 8 as needed).

  3. Lookup and apply a shuffle from (L1,L3) which is similar in each 8-byte lane: Keep the first L1 and L3 bytes in place and move the following 4 bytes to [4..7] and [12..15] respectively, zero-filling the first and third words.

The lookup tables in steps 1 and 2 are size 7 and 16, respectively, although the lookups based on L1+L2 and (L1, L3) may need an intermediate table to extract those values quickly.

Step 1 may also be skipped by making the shuffle in step 2 include the length of the second half (49 rather than 7 shuffles), so that the shuffle can work directly on the byte stream without pre-masking.

For SSE this is slower, but for AVX etc. it's O(log(register width)) steps, and eg the shuffle tables for 8-at-a-time max out at 256 elements rather than 64k (sqrt of the original method).

@KWillets KWillets reopened this Dec 9, 2018
@lemire
Copy link
Member Author

lemire commented Dec 10, 2018

Interesting.

@KWillets
Copy link
Collaborator

I was looking at this for UTF-8 conversion as well, but it seems easier when the control byte is already available. utf-8 needs a lot of pmovmskb/pdep/tzcnt's to get the lengths.

@aqrit
Copy link
Collaborator

aqrit commented Dec 11, 2018

Step #1 is expensive w/AVX2. AVX2 can't shuffle bytes across 128-bit lanes, only dwords/qwords/owords.
It makes me doubt that this approach would be better than the "compressed shuffle table" I posted earlier in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants