Speed up JPEG Decoder color conversion #1121

antonfirsov · 2020-02-22T15:22:37Z

Let's finally beat System.Drawing on the JPEG Load->Resize->Save scenario!

As discussed in #1064, it's finally possible thanks to the Intel SIMD intrinsics in .NET Core 3.1. Opening an issue so we can track this work, and hopefully get some help & feedback from the community.

/cc @Sergio0694 @saucecontrol

Current pipeline

Summary of steps currently done by ConvertColorsInto:

[D]: Data representation
(T): Bulk transformation between data representations

(case a) Y+Cb+Cr planes --> Single `Rgba32` buffer


[D]	*3 Planes of WH sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)**
(T)	Color convert and pack into a single `Vector4` buffer
[D]	Floating point RGBA data as `Memory<Vector4>`
(T)	Convert the `Vector4` buffer to an `Rgba32` buffer. In the `Rgba32` case case, the input buffer could be handled as homogenous `float` buffer, where all individual `float` values should be converted to `byte`-s. The conversion is implemented in `BulkConvertNormalizedFloatToByteClampOverflows`, utilizing AVX2 conversion and narrowing operations through `Vector<T>`
[D]	The result image as an `Rgba32` buffer

(case b) Y+Cb+Cr planes --> Single `Rgb24` buffer


[D]	*3 Planes of WH sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)**
(T)	Color convert and pack into a single `Vector4` buffer
[D]	Floating point RGBA data as `Memory<Vector4>`
(T)	Convert the `Vector4` buffer to an `Rgba32` buffer, utilizing `BulkConvertNormalizedFloatToByteClampOverflows`, utilizing AVX2 conversion and narrow operations through `Vector<T>`
[D]	Temporary `Rgba32` buffer
(T)	`PixelOperations<Rgb24>.FromRgba32()` (sub-optimal, extra transformation!)
[D]	The result image as an `Rgb24` buffer

Optimized pipeline

(default `Rgb24` case) Y+Cb+Cr planes --> Single `Rgb24` buffer


D1	*3 Planes of WH sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)**
(T)	Color convert, the 3 planes, and write them back to the originating buffers
D2	3 Planes of `Buffer2D<float>`, R+G+B)
(T)	Narrow the `float` buffers to `byte` buffers using `SimdUtils.BulkConvertNormalizedFloatToByteClampOverflows`
D3	3 Planes of `Buffer2D<byte>`, R+G+B
(T)	PACK the separate image planes (color channels) into a single `Rgb24` buffer
D4	The result image as an `Rgb24` buffer

(`TPixel` case) Y+Cb+Cr planes --> Single `TPixel` buffer


D1	*3 Planes of WH sized `float` jpeg color channels normalized to 0-255 (3 x `Buffer2D<float>`, Y+Cb+Cr)**
(T)	All the steps from the default `Rgb24` case
D4	`Memory<Rgb24>`
(T)	Convert the `Rgb24` buffer to `TPixel` buffer using `PixelOperations<T>`
D5	The result image as an `TPixel` buffer

The magic is mostly in the D3->D4 transition, because of the fact that we can now do the pixel packing with shuffle and permute intrinsics when those are available. The other fun thing is that if we decode to Image<Rgb24> (case b) we can omit an unnecessary step.

API proposal for packing

The best thing is that we can handle this big task incrementally:

First, extend PixelOperations<T> by new packing operations
Then, adapt the changes in JpegImagePostProcessor as described in the Optimized pipeline paragraph

The packing API is pretty straightforward:

public class PixelOperations<TPixel>
{
    // ...
    
    public void PackFromRgbPlanes(
           Configuration configuration,
	   ReadOnlySpan<byte> redChannel, 
	   ReadOnlySpan<byte> greenChannel, 
	   ReadOnlySpan<byte> blueChannel,
	   Span<TPixel> destination);
}

We can define a default implementations in the base PixelOperations<TPixel> class, and specialize it for Rgba32 and Rgb24. Optional hardcore task is to T4 a SIMD implementation it for all the RGB(A)-like formats.

Note

It is possible to optimize the conversion even further by doing D1->D3 in a single step, but I consider it a very hard task both implementation and architecture-wise, and prefer incremental evolution instead.

The text was updated successfully, but these errors were encountered:

antonfirsov · 2020-02-22T15:35:19Z

@brianpopow is packing useful for other codecs?

saucecontrol · 2020-02-22T19:04:50Z

You can get a nice easy win just by using the intrinsics for your float->byte conversions, because the float->int instructions include rounding, and the packing instructions are saturating so you can skip the clamping step. That's as simple as this:

https://github.com/saucecontrol/PhotoSauce/blob/a9bd6e5162d2160419f0cf743fd4f536c079170b/src/MagicScaler/Magic/Processors/ConvertersFloat.cs#L451-L508

You can speed up Y'CbCr->RGB conversion with intrinsics as well, but I was only able to make the SIMD version faster than scalar with AVX. Any matrix multiplication can be accelerated with DPPS (dot product), but since you really only need 4 of the coefficients from the Y'CbCr->RGB matrix, it doesn't work out any faster unless you can operate on the wider vector.

Sergio0694 · 2020-02-22T19:14:26Z

Unrelated: @saucecontrol your repo is seriously impressive, incredible work there! 👏

I have a couple questions:

What version of ImageSharp did you use to make those benchmarks? I'm particularly impressed by the very high memory usage it has compared to other libs, especially in the parallel benchmark (207x compared to System.Drawing? Am I missing something here? 😶).
If you have time and are interested, I'd be curious to see your updated results for the benchmark where ImageSharp fails with the OOM error - @antonfirsov's recent PR should've fixed that.

saucecontrol · 2020-02-22T19:22:36Z

Thanks! I did those benchmarks with the current dev at the time (1.0.0-dev003181). I did a quick run yesterday with 1.0.0-unstable0806, and GC allocations were way down but speed was a bit worse. I see @JimBobSquarePants has a new PR that improves perf. Hopefully that gets merged this weekend? I plan on re-running the whole suite since I've got a new MagicScaler version releasing today. Would be good to have all the ImageSharp improvements in for that.

saucecontrol · 2020-02-22T19:34:00Z

207x compared to System.Drawing? Am I missing something here?

Those numbers will always be misleading because System.Drawing is just a thin wrapper around native code (GDI+). That's why I threw in a more comprehensive test that also measured unmanaged memory allocations.

antonfirsov · 2020-02-22T23:51:07Z

@saucecontrol thanks for the suggestions! Introducing a System.Runtime.InteropServices AVX2 path for our float -> byte conversion is definitely a long hanging fruit. If we could base our implementation on your code it would be super easy.

For Y'CbCr -> RGB, we already have a SIMD path that compiles to AVX2 under the hood, and it outperforms our previous scalar approaches a lot. It's the slow scalar packing that hurts the most at the moment. It will be way much better to do the float -> byte conversion first, and pack 32 bytes at a single step with AVX2.

saucecontrol · 2020-02-23T00:35:26Z

You're certainly welcome to borrow any of my code you find useful. My Y'CbCr conversion is here:

https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/PlanarConversionTransform.cs#L148-L236

You'll notice the scalar float version reads/writes directly to/from the buffers rather than going through a Vector4. Setting the individual fields of Vector4 is always slow because it means a minimum of three shuffle ops, and Intel processors only have a single port that can service those shuffles. It ends up being faster to do the memory read and write since there are more ports to service those and they'll be in cache anyway.

With AVX, it's possible to overcome the extra shuffle overhead by using a DotProduct that operates on two pixels at once, but even that is only profitable if you have access to the full AVX instruction set. I couldn't outdo the scalar code with System.Numerics.

Just realized in looking at the code again, I was doing an extra round of shuffles just to be able to use DotProduct when it's only a gain on the Green output. And with FMA, it's not even a gain there. I simplified the code, and added an SSE version which ends up just over twice as fast as scalar. I haven't had time to draw out ascii diagrams of how that code works, but I'm happy to answer questions if you have any.

I see another quick win possibility in your SIMD Y'CbCr->RGB path if you just hoist those constant Vector<float> values that hold the coefficients out of the loop. RyuJIT has never been smart enough to do that for you, and each of those requires a load/broadcast instruction pair.

brianpopow · 2020-02-25T18:36:28Z

@brianpopow is packing useful for other codecs?

WebP lossy format is similar to JPEG as in the image is also represented with YUV and split up into macroblocks. The difference is, that the DCT coefficients are represented as byte rather than float. All the decoding is done with integer operations.

For the other formats we currently have, i also can not see this applied. Unfortunately this optimization looks rather specific to JPEG at the moment.

john-h-k · 2020-05-15T10:14:01Z

I'll give this a shot. A couple of questions:

what is the default behaviour?
For RGB32, does it just assume alpha value of 1?

JimBobSquarePants · 2020-05-15T12:06:38Z

Awesome! Not sure I understand what you mean by the first question but Rgba32 would use an alpha of 255.

john-h-k · 2020-05-15T13:00:09Z

We can define a default implementations in the base PixelOperations class, and specialize it for Rgba32 and Rgb24

@JimBobSquarePants i get the 32/24 behaviour but not sure what it means by default impl

Rgba32 would use an alpha of 255.
Silly me, was thinking of floating point max alpha 😁

JimBobSquarePants · 2020-05-15T13:06:21Z

Ah right.... So we have a "default" PixelOperations class which we then override with implementations specific to individual pixel formats.

The default version will use vectors and the various members of IPixel interface
https://github.com/SixLabors/ImageSharp/tree/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats

Whereas the explicit versions contain optimizations that take advantage of the size and layout of the individual pixel formats.
https://github.com/SixLabors/ImageSharp/tree/master/src/ImageSharp/PixelFormats/PixelImplementations/Generated

antonfirsov · 2020-05-15T16:44:36Z

In other words: the default implementation contains generic code, that utilizes the IPixel<T> interface thanks to the constraints:
https://github.com/SixLabors/ImageSharp/blob/b720219bb5d01f8443408819574444cc19baf595/src/ImageSharp/PixelFormats/PixelOperations%7BTPixel%7D.Generated.cs

However instead of extending that interface (more codebloat for low benefits), I suggest to convert TPixel from Rgb24.

br3aker · 2021-07-15T14:03:49Z

Considering implementing this. One question: how was (and is) it possible to implement this without immediate cast from float to byte?

Right now we have support for:
YCbCr (3 channels) - rgb compatible
YccK (4 channels) - rgb compatible
Cmyk (4 channels) - rgb compatible
Grayscale (1 channel) - not rgb compatible
Rgb (3 channels) - rgb compatible

We either need to allocate 2 extra buffers for the grayscale or we can cast to byte immediately after colorspace -> rgb conversion. Second option would yield 'pixels' with size of 3 bytes which ideally fit to the single color buffer of float(s).

antonfirsov mentioned this issue Feb 22, 2020

Analysis & Discussion: Jpeg & Resize processing pipelines, improvement opportunities #1064

Open

antonfirsov mentioned this issue Mar 10, 2020

Undo jpeg perf regression, add various optimizations #1143

Merged

4 tasks

JimBobSquarePants added formats:jpeg area:performance labels Apr 24, 2020

JimBobSquarePants added this to the 1.0.0 milestone Apr 24, 2020

JimBobSquarePants added the help needed label Jun 16, 2020

JimBobSquarePants modified the milestones: 1.0.0, Future Jul 16, 2020

antonfirsov mentioned this issue Nov 6, 2020

Vectorize (AVX2) JPEG Color Converter #1411

Merged

saucecontrol mentioned this issue Nov 12, 2020

Add PackFromRgbPlanes AVX2 vectorised implementation for Rgba32 and Rgba24 pixels #1242

Closed

antonfirsov mentioned this issue Dec 5, 2020

Implement PackFromRgbPlanes for Rgba32 and Rgb24 #1462

Merged

4 tasks

antonfirsov mentioned this issue Jun 21, 2021

JpegDecoder: post-process baseline spectral data per MCU-row #1597

Closed

antonfirsov mentioned this issue Oct 3, 2021

Decode Jpeg into Rgb24 when pixel type not specified #1773

Merged

4 tasks

JimBobSquarePants closed this as completed in #1773 Oct 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up JPEG Decoder color conversion #1121

Speed up JPEG Decoder color conversion #1121

antonfirsov commented Feb 22, 2020 •

edited

Loading

antonfirsov commented Feb 22, 2020 •

edited

Loading

saucecontrol commented Feb 22, 2020 •

edited

Loading

Sergio0694 commented Feb 22, 2020

saucecontrol commented Feb 22, 2020

saucecontrol commented Feb 22, 2020

antonfirsov commented Feb 22, 2020

saucecontrol commented Feb 23, 2020 •

edited

Loading

brianpopow commented Feb 25, 2020

john-h-k commented May 15, 2020

JimBobSquarePants commented May 15, 2020

john-h-k commented May 15, 2020

JimBobSquarePants commented May 15, 2020

antonfirsov commented May 15, 2020

br3aker commented Jul 15, 2021 •

edited

Loading

Speed up JPEG Decoder color conversion #1121

Speed up JPEG Decoder color conversion #1121

Comments

antonfirsov commented Feb 22, 2020 • edited Loading

Current pipeline

(case a) Y+Cb+Cr planes --> Single Rgba32 buffer

(case b) Y+Cb+Cr planes --> Single Rgb24 buffer

Optimized pipeline

(default Rgb24 case) Y+Cb+Cr planes --> Single Rgb24 buffer

(TPixel case) Y+Cb+Cr planes --> Single TPixel buffer

API proposal for packing

Note

antonfirsov commented Feb 22, 2020 • edited Loading

saucecontrol commented Feb 22, 2020 • edited Loading

Sergio0694 commented Feb 22, 2020

saucecontrol commented Feb 22, 2020

saucecontrol commented Feb 22, 2020

antonfirsov commented Feb 22, 2020

saucecontrol commented Feb 23, 2020 • edited Loading

brianpopow commented Feb 25, 2020

john-h-k commented May 15, 2020

JimBobSquarePants commented May 15, 2020

john-h-k commented May 15, 2020

JimBobSquarePants commented May 15, 2020

antonfirsov commented May 15, 2020

br3aker commented Jul 15, 2021 • edited Loading

antonfirsov commented Feb 22, 2020 •

edited

Loading

(case a) Y+Cb+Cr planes --> Single `Rgba32` buffer

(case b) Y+Cb+Cr planes --> Single `Rgb24` buffer

(default `Rgb24` case) Y+Cb+Cr planes --> Single `Rgb24` buffer

(`TPixel` case) Y+Cb+Cr planes --> Single `TPixel` buffer

antonfirsov commented Feb 22, 2020 •

edited

Loading

saucecontrol commented Feb 22, 2020 •

edited

Loading

saucecontrol commented Feb 23, 2020 •

edited

Loading

br3aker commented Jul 15, 2021 •

edited

Loading