Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis & Discussion: Jpeg & Resize processing pipelines, improvement opportunities #1064

Open
antonfirsov opened this issue Dec 15, 2019 · 3 comments

Comments

@antonfirsov
Copy link
Member

antonfirsov commented Dec 15, 2019

Introduction

Apart from the API simplification, the main intent of #907 was to enable new optimizations: it's possible to eliminate a bunch of unnecessary processing steps from the most common YCbCr Jpeg thumbnail making use-case. As it turned out in #1062, simply changing the pixel type to Rgba24 is not sufficient, we need to implement the processing pipeline optimizations enabled by the .NET Core 3.0 Hardware Intrinsic API, especially by the shuffle and permutation intrinsics which are allowowing fast conversion between different pixel type representations and component orders (eg. Rgba32 <--> Rgb24), as well as fast conversion between Planar/SOA and Packed/AOS pixel representations. The latter is important because raw Jpeg data consists of 3 planes representing the YCbCr data, while an ImageSharp Image is always packed.

This analyisis:

  1. Kicks off by explaining the causes of the Rgb24 slowdown in Add La16 and La32 IPixel formats. #1062
  2. Defines Processing Pipelines as a chains of Data States and Transformations
  3. Presents a deep overview of the current floating point Jpeg and Resize pipelines, showing incremental improvement opportunities. Note: the Resize pipeline is still TODO, and it will remain so for a couple of days/weeks. This should not prevent you from getting the big picture though.
  4. Roughly explains the challenges of adding integer SIMD operations to the Jpeg pipeline

Please let me know, if some pieces are still hard to follow. It's worth to check out all URL-s while reading.

TLDR
If you want to hear some good news before reading through the whole thing, jump to the Conclusion part 😄

Why is Rgb24 post processing slow in our current code?

YCbCr -> TPixel conversions, the generic case

JpegImagePostprocessor is processing the YCbCr data in two steps:

  1. Color convert AND pack the Y + Cb + Cr image planes to Vector4 RGBA buffers. The two operations are carried out together by the matching JpegColorConverter. With the YCbCr colorspace which has only 3 components, this is already a sub-optimal, since the 4th alpha component (Vector4.W) is redundant. Vector4 packing is done with non-vectorized code.
  2. Convert the Vector4 buffer to pixel buffer, using the pixel specific implementation.

Rgba32 vs Rgb24

The difference is that PixelOperations<Rgba32>.FromVector4() does not need to do any component shuffling, only expanding byte values to float-s, while in PixelOperations<Rgba32>.FromVector4(), we first convert the float buffers to Rgba32 buffers (fast), which is followed by an Rgba32 -> Rgb24 conversion using the sub-optimal default conversion implementation. This operation:

  • Could be significantly optimized by utilizing byte shuffling SIMD intrinsics.
  • Is in fact unnecessary. By extending JpegColorConverter with a method to pack data into Vector3 buffers, we could convert Vector3 data into Rgb24 data exactly the same way we do the Vector4 -> Rgba32 conversion.

Definition of Processing Pipelines

Personally, my memory is terrible and I always need to reverse engineer my own code when we want to understand what's happening and make decisions. Lack of comments and confusing terminology is also misleading. To get a good overview, it's really important to step back and abstract away implementation details, by thinking about our algorithms as PIPELINES composed of Data States and Transformations, where

  • [D] Data States (nodes) are representations of pixel data buffers in a specific form
  • (T) Transformations (edges) are specific SIMD or scalar implementations of algorithms

This representation is only good for analyzing data flow for a specific configuration, eg. a well defined input image format + decoder configuration + output pixel type. To visualize the junctions, we need DAG-s 🤓.

Current floating point YCbCr Jpeg Color Processing & Resize pipelines, improvement opportunities

Presumtions:

  • The executing runtime is > netcoreapp2.1 (enables Vector.Widen)
  • The executing CPU supports the AVX2 instruction set, implying that Vector<T>-s are in fact AVX2 registers and Vector<T> intrinsics are JIT-ed to AVX2 instructions
  • Vector4 operations are JIT-ed to SSE2 instructions

(I.) Converting raw jpeg spectral data to YCbCr planes

Converting raw jpeg spectral data to YCbCr planes, done by CopyBlocksToColorBuffer
[D] 3 planes of quantized spectral Int16 jpeg components (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) AVX2 Int16 -> Int32 widening and Int32 -> float conversion, both using Vector<T>, implemented in Block8x8F.LoadFrom(Block8x8)
[D] 3 planes of quantized spectral float jpeg components (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) Dequantization by SSE2 multiplication: Block8x8F.MultiplyInplace(DequantiazationTable)
[D] 3 planes of DEquantized spectral float jpeg components (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) SSE2 floating point IDCT
[D] 3 Planes of float jpeg color channels (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) AVX2 normalization and rounding using Vector<T>. Rounding is needed for better libjpeg compatibility
[D] 3 Planes of SUBSAMPLED float jpeg color channels normalized to 0-255 (3 x Buffer2D<Block8x8>, Y+Cb+Cr)
(T) Chroma supersampling. No SIMD, fully scalar code, full of ugly optimizations to make it at least cache friendly. Done by Block8x8.CopyTo()) (super misleading name!)
[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)

(II. a) Converting the Y+Cb+Cr planes to an Rgba32 buffer

Y+Cb+Cr planes -> Rgba32 buffer, done by ConvertColorsInto
[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert and pack into a single Vector4 buffer
[D] Floating point RGBA data as Memory<Vector4>
(T) Convert the Vector4 buffer to an Rgba32 buffer. In the Rgba32 case case, the input buffer could be handled as homogenous float buffer, where all individual float values should be converted to byte-s. The conversion is implemented in BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrowing operations through Vector<T>
[D] The result image as an Rgba32 buffer

(II. b) Converting the Y+Cb+Cr planes to an Rgb24 buffer, current sub-optimal pipeline

Y+Cb+Cr planes -> Rgb24 buffer, done by ConvertColorsInto
[D] 3 Planes of W*H sized float jpeg color channels normalized to 0-255 (3 x Buffer2D<float>, Y+Cb+Cr)
(T) Color convert and pack into a single Vector4 buffer
[D] Floating point RGBA data as Memory<Vector4>
(T) Convert the Vector4 buffer to an Rgba32 buffer, utilizing BulkConvertNormalizedFloatToByteClampOverflows, utilizing AVX2 conversion and narrow operations through Vector<T>
[D] Temporary Rgba32 buffer
(T) PixelOperations<Rgb24>.FromRgba32() (sub-optimal, extra transformation!)
[D] The result image as an Rgb24 buffer

(II. b++) Converting the Y+Cb+Cr planes to an Rgba24 buffer, IMPROVEMENT PROPOSAL

See #1121

(III. a) Resize Image<Rgba32>, current pipeline

TODO

(III. b) Resize Image<Rgb24>, current pipeline

TODO.
Without any change, the current code shall run faster than for Image<Rgba32>.

(III. b++) Resize Image<Rgb24>, IMPROVEMENT PROPOSAL

TODO

Integer-based SIMD pipelines

Although the Hardware Intrinsic API removes all theoretical boundaries to have 1:1 match with other high performance imaging libraries, for both Jpeg Decoder and Resize by utilizing AVX2 and SSE2 integer algorithms, there is a big practical challange: It's very hard to introduce these improvements in an iterative manner.

It's not possible to exchange the elements of the Jpeg pipeline at arbitrary points, because it would lead to insertion of extra float <-> Int16/32 conversions. To overcome this, we should start introducing integer transformations and data states at the beginning and/or at the end of the pipeline. This could be done by replacing the transformations and the data states in subsequent PR-s while moving the Int16 -> float conversion towards the bottom (when starting from the beginning), and the float -> byte conversion towards the top (when starting from the end). EG:

  • At the beginning of the pipeline first replace dequantization, then IDCT, then normalization etc..
  • At the end of the pipeline, we shall implement a full integer YCbCr24 -> Rgb24 SIMD conversion first

Conclusion

If we aim for low hanging fruits, I would start by implementing (II. b++) and (III. b++). After that, we can continue by introducing integer SIMD operations starting at the beginning or at the end of the Jpeg pipeline.

I would also suggest to keep the current floating point pipeline in the codebase as is, to avoid perf regressions for pre-3.0 users. I believe those platforms will be still relevant for many customers for a couple of other years.

@JimBobSquarePants
Copy link
Member

Thanks for taking the time writing all of the above, it's very informative. 🤯

Focusing on Jpeg decoding optimization for now I would advocate for being as radical as possible.

I propose redefining the entire JpegPostProcessor pipeline as three separate implementations based on an integer pipeline. This includes Dequantization, IDCT, Subsampling, and Colorspace transforms.

  1. Scalar. (Old framework, odd devices, edge cases)
  2. Limited Intrinsics. (NET Core 2.1 with good migration path)
  3. Full Intrinsics (NET Core 3+, the future)

I would focus on 1 and 3 as a priority and refactor our current floating point implementation to fit via scaling at the beginning as you describe.

I would also suggest to keep the current floating point pipeline in the codebase as is, to avoid perf regressions for pre-3.0 users. I believe those platforms will be still relevant for many customers for a couple of other years.

The upgrade path from NET Core 2 to 3 is surprisingly simple for the most part. I recently ported 5 quite complex libraries in a matter of days with very little refactoring required so while 2.1 has LTS support until August 2021 I believe many customers will have moved on by then.

The benefit I see from cleanly sliced implementations are the following:

  • You get to write each implementation without compromise. Scalar does not suffer from possible performance deficit floating point, and the full intrinsic pipeline can be optimized to maximum capacity.
  • Cleaner, more easy to understand, and debug code. Easier to document also.

I appreciate that this is a lot of work but together I think we can do it. I'm also thinking V1 not RC as a milestone since all the APIs are internal.

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 15, 2019

Focusing on Jpeg decoding optimization for now I would advocate for being as radical as possible.

I propose redefining the entire JpegPostProcessor pipeline as three separate implementations based on an integer pipeline. This includes Dequantization, IDCT, Subsampling, and Colorspace transforms.

Now the bitter pill: the total amount of work for replacing the entire pipeline is huge. Think of at least 2-3 man-weeks of full time work, assuming that we know exactly what we are doing (I wouldn't dare to say so about myself). I make these estimations based on my own experiences and by lurking in @dcommander's comments in libjpeg-turbo, especially in issues which are marked with "funding-needed"

I'm also thinking V1 not RC as a milestone since all the APIs are internal.

Because of the amount of work, I don't think it makes sense to talk about a full scale rewrite within the V1 timeframe. We want to get the library released before 2021. Also: everything being internal doesn't mean that a significant rewrite is an option during a pre-release bugfix cycle. (Regressions, behavioral breaking changes.)

There are other problems about aiming full-integer pipeline everywhere:

  1. Scalar. (Old framework, odd devices, edge cases).

Expect a regression of an order of magnitude. This is basically the stuff we started with in 2016. And it's a big amount of extra work to do it properly.

  1. Limited Intrinsics. (NET Core 2.1 with good migration path)

This is not possible with integers because of missing intrinsics as fundamental as division.

Even if we started ImageSharp in 2019, I would say that FP pipelines are valuable and worth to implement:

  • Performance does not suck on stuff != netcoreapp3.*
  • Individual transformation code is straightforward and readable: it shows the calculation you are actually doing. No arbitrary bit magic. It can be used as a reference to understand what we need to calculate in other pipelines.
  • Other libs also have junctions in their pipeline because of historical reasons and platform-specific optimizations. For me this is just a small part of the overall complexity.

Summary/TLDR
My opinion is exactly the opposite: We should be incremental and conservative when it's about refactoring. The low hanging fruits (II. b++ and III. b++) would bring visible and siginficant improvements. And by "low hanging" I mean: can be implemented in a couple of man-days, instead of weeks. There is no silver bullet for removing complexity in this stuff.

While doing the optimizations, we can improve the understandability of the code by cleaning it up, adding comments, and introduce simplifications in the pipeline where the perf impacts are limited. Eg: at a certain point, we can consider dropping all the Vector<T> code, since Vector4 is fast enough, and way easier to read. (=> Eliminate huge part of #if-s and other complex pipeline junctions.)

EDIT
Typos, small additions.

@JimBobSquarePants
Copy link
Member

The low hanging fruits (II. b++ and III. b++) would bring visible and siginficant improvements. And by "low hanging" I mean: can be implemented in a couple of man-days, instead of weeks. There is no silver bullet for removing complexity in this stuff.

If you truly believe this then I'm with you all the way and we'll do it your way. You are, by far and away, the performance expert here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants