-
-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up JPEG Decoder color conversion #1121
Comments
@brianpopow is packing useful for other codecs? |
You can get a nice easy win just by using the intrinsics for your float->byte conversions, because the float->int instructions include rounding, and the packing instructions are saturating so you can skip the clamping step. That's as simple as this: You can speed up Y'CbCr->RGB conversion with intrinsics as well, but I was only able to make the SIMD version faster than scalar with AVX. Any matrix multiplication can be accelerated with |
Unrelated: @saucecontrol your repo is seriously impressive, incredible work there! 👏 I have a couple questions:
|
Thanks! I did those benchmarks with the current dev at the time (1.0.0-dev003181). I did a quick run yesterday with 1.0.0-unstable0806, and GC allocations were way down but speed was a bit worse. I see @JimBobSquarePants has a new PR that improves perf. Hopefully that gets merged this weekend? I plan on re-running the whole suite since I've got a new MagicScaler version releasing today. Would be good to have all the ImageSharp improvements in for that. |
Those numbers will always be misleading because System.Drawing is just a thin wrapper around native code (GDI+). That's why I threw in a more comprehensive test that also measured unmanaged memory allocations. |
@saucecontrol thanks for the suggestions! Introducing a For Y'CbCr -> RGB, we already have a SIMD path that compiles to AVX2 under the hood, and it outperforms our previous scalar approaches a lot. It's the slow scalar packing that hurts the most at the moment. It will be way much better to do the |
You're certainly welcome to borrow any of my code you find useful. My Y'CbCr conversion is here: You'll notice the scalar float version reads/writes directly to/from the buffers rather than going through a
Just realized in looking at the code again, I was doing an extra round of shuffles just to be able to use I see another quick win possibility in your SIMD Y'CbCr->RGB path if you just hoist those constant |
WebP lossy format is similar to JPEG as in the image is also represented with YUV and split up into macroblocks. The difference is, that the DCT coefficients are represented as byte rather than float. All the decoding is done with integer operations. For the other formats we currently have, i also can not see this applied. Unfortunately this optimization looks rather specific to JPEG at the moment. |
I'll give this a shot. A couple of questions:
|
Awesome! Not sure I understand what you mean by the first question but |
@JimBobSquarePants i get the 32/24 behaviour but not sure what it means by default impl
|
Ah right.... So we have a "default" PixelOperations class which we then override with implementations specific to individual pixel formats. The default version will use vectors and the various members of Whereas the explicit versions contain optimizations that take advantage of the size and layout of the individual pixel formats. |
In other words: the default implementation contains generic code, that utilizes the However instead of extending that interface (more codebloat for low benefits), I suggest to convert |
Considering implementing this. One question: how was (and is) it possible to implement this without immediate cast from Right now we have support for: We either need to allocate 2 extra buffers for the grayscale or we can cast to |
Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!
As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.
/cc @Sergio0694 @saucecontrol
Current pipeline
Summary of steps currently done by
ConvertColorsInto
:[D]: Data representation
(T): Bulk transformation between data representations
(case a) Y+Cb+Cr planes --> Single
Rgba32
bufferfloat
jpeg color channels normalized to 0-255 (3 xBuffer2D<float>
, Y+Cb+Cr)Vector4
bufferMemory<Vector4>
Vector4
buffer to anRgba32
buffer. In theRgba32
case case, the input buffer could be handled as homogenousfloat
buffer, where all individualfloat
values should be converted tobyte
-s. The conversion is implemented inBulkConvertNormalizedFloatToByteClampOverflows
, utilizing AVX2 conversion and narrowing operations throughVector<T>
Rgba32
buffer(case b) Y+Cb+Cr planes --> Single
Rgb24
bufferfloat
jpeg color channels normalized to 0-255 (3 xBuffer2D<float>
, Y+Cb+Cr)Vector4
bufferMemory<Vector4>
Vector4
buffer to anRgba32
buffer, utilizingBulkConvertNormalizedFloatToByteClampOverflows
, utilizing AVX2 conversion and narrow operations throughVector<T>
Rgba32
bufferPixelOperations<Rgb24>.FromRgba32()
(sub-optimal, extra transformation!)Rgb24
bufferOptimized pipeline
(default
Rgb24
case) Y+Cb+Cr planes --> SingleRgb24
bufferfloat
jpeg color channels normalized to 0-255 (3 xBuffer2D<float>
, Y+Cb+Cr)Buffer2D<float>
, R+G+B)float
buffers tobyte
buffers usingSimdUtils.BulkConvertNormalizedFloatToByteClampOverflows
Buffer2D<byte>
, R+G+BRgb24
bufferRgb24
buffer(
TPixel
case) Y+Cb+Cr planes --> SingleTPixel
bufferfloat
jpeg color channels normalized to 0-255 (3 xBuffer2D<float>
, Y+Cb+Cr)Rgb24
caseMemory<Rgb24>
Rgb24
buffer toTPixel
buffer usingPixelOperations<T>
TPixel
bufferThe magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to
Image<Rgb24>
(case b) we can omit an unnecessary step.API proposal for packing
The best thing is that we can handle this big task incrementally:
PixelOperations<T>
by new packing operationsJpegImagePostProcessor
as described in the Optimized pipeline paragraphThe packing API is pretty straightforward:
We can define a default implementations in the base
PixelOperations<TPixel>
class, and specialize it forRgba32
andRgb24
. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.Note
It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.
The text was updated successfully, but these errors were encountered: