-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the size of the lookup tables #14
Comments
The table for the encoding shuffle can be reduced by 75%. Same idea as the small tables in simd_prune/despacer projects. Thus the table only needs 64x16 bytes instead of 256x16 bytes. |
Just mask off the top 2-bits of the key byte https://gist.github.com/aqrit/9272c47b3f1ce23c565a7210b6935102#file-svb_alt-c-L398 |
Damn!!! You are right. |
Update: I was thinking about the encoder in my reply. Coming back to it, I don't see how it can work. The last two bits give us how many bytes are non-null in the last of 4 integers. Suppose you don't have this information... then you will fill the last four bytes with data. It is not true that "unwanted output byte(s) will be overwritten by the next chunk, anyways." You need some way to set to zero the unwanted bytes. If you have some way to make it work, please share. Your general idea is directly applicable to other problems, however. But I don't see it here. |
The encoder never needs to zero bytes. Ever.
The info is still saved for use by the decoder. here is a working drop-in-replacement: please test. |
Oh! OH ! For the encoder. I see. Yes. |
Good point. I was thinking about the decoder. |
The decoding shuffle table size could be reduced by 75% ... (256x16 vs 256x4) The shuffle control masks get compressed to 32-bit integers. |
Looks clever. I'll investigate soon. |
Completed with efb310d Your name has been added as author. |
Neat.
|
+1 |
I had a thought related to this for the decoder which might be interesting. It adds a few steps vs. a raw lookup, but it may allow scaling to larger register sizes with only logarithmic time growth, and sqrt-sized shuffle tables vs. the original LUT. The idea is to split the blob of compressed bytes into 8-byte "lanes" based only on the length of the first two elements, and then into 4-byte words based on the lengths of the first and third. We use zero-filling to allow the second halves to be processed as fixed 8 or 4-byte fields. Basic Algorithm: Given a stream of bytes and a control byte {L1,L2,L3,L4}, deposit L1 bytes in the first work, L2 in the 2nd, etc., zero-filling the remaining bytes in each word.
The lookup tables in steps 1 and 2 are size 7 and 16, respectively, although the lookups based on L1+L2 and (L1, L3) may need an intermediate table to extract those values quickly. Step 1 may also be skipped by making the shuffle in step 2 include the length of the second half (49 rather than 7 shuffles), so that the shuffle can work directly on the byte stream without pre-masking. For SSE this is slower, but for AVX etc. it's O(log(register width)) steps, and eg the shuffle tables for 8-at-a-time max out at 256 elements rather than 64k (sqrt of the original method). |
Interesting. |
I was looking at this for UTF-8 conversion as well, but it seems easier when the control byte is already available. utf-8 needs a lot of pmovmskb/pdep/tzcnt's to get the lengths. |
Step #1 is expensive w/AVX2. AVX2 can't shuffle bytes across 128-bit lanes, only dwords/qwords/owords. |
The current lookup tables are quite large. Finding a way to substantially reduce their memory usage without adversally affecting performance would be a worthy goal.
The text was updated successfully, but these errors were encountered: