[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121

gowa · 2025-01-12T22:54:27Z

Note: it is not a ready to merge pull request, but a request to check if the concept of using code generation for solving some performance issues, associated with the usage of protobuf reflection when writing or reading parquet files, is of potential interest of repository owners. I decided to verify the concept at a rather early stage due to a significant effort required to implement the change. Should the approach and a new optional dependency on ByteBuddy is found satisfactorily and potentially acceptable to be included into parquet-java, I will attempt to properly finish first the 'write' part and then the 'read' part (in terms of code quality and tests). Therefore, any feedback is appreciated.

Rationale for this change

We read and write a lot of parquet data, defined by protobuf schemas from Java. It is seen that this can be done faster than what is offered out of the box now.
The change introduced improves proto-to-parquet file writing performance by means of code generation (in my synthetic tests by around 50% with SNAPPY compression, especially, when structures have a lot of primitive type fields).

What changes are included in this PR?

an extension point in MessageWriter that redirects writing to a generated on-the-fly class dealing with protobuf generated classes getters directly, not via Protobuf Java Reflection methods.
a separate class where all code generation logic is located.

Are these changes tested?

current unit tests work fine.

Are there any user-facing changes?

a configuration to disable code generation logic.

wgtmac · 2025-01-14T02:03:07Z

Thanks for your interest in contributing this! This seems to be a large feature and the performance gain is promising! However, I'm afraid that this PR may not get prompt review due to lack of active parquet-protobuf maintainers. I do not have any knowledge on ByteBuddy so it might take a long time to wrap it up. Is it possible to make it pluggable so the large portion of codegen logic does not have to exist in the parquet-java repo?

cc @gszadovszky @julienledem if you know someone can help review this.

gszadovszky · 2025-01-14T08:00:42Z

The noted performance gain is promising indeed. However, it would be nice to see actual numbers for different scenarios (flat columns, general nested columns, deeply nested columns) of read/write. You might even implement your performance tests in the module parquet-benchmarks.

Unfortunately, I'm not expert in parquet-protobuf either not even talking about ByteBuddy. For the final PR review it would be a great help to have someone who have some experience with ByteBuddy even if not being a Parquet committer.

…ompatible with java8 without hacks

gowa · 2025-01-23T02:48:35Z

Hi @gszadovszky , @wgtmac . Thank you for your feedback.
Yes, I see that it is a big feature and the implementation is far from being a simple fix. And, maybe, it should be a pluggable thing instead of being a first-class resident in the code. However, if you feel the changes can be incorporated into the main codebase, I could try to find someone to review ByteBuddy part and implement the reader part as well.

As for benchmarks. I've implemented some and committed. I attempted to replicate the original org.apache.parquet.benchmarks.WriteBenchmarks with some proto stuff in org.apache.parquet.benchmarks.ProtoWriteBenchmarks.

The result are as follows: the bigger number of fields (especially primitives), the bigger the gain, as this fix only optimize getters and boxing/unboxing.

E.g. for 100 int32 fields:

Benchmark                                                            (codegenMode)  (protoClass)  Mode  Cnt   Score   Error  Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                         OFF  Test100Int32    ss    5  13.171 ± 1.206   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                REQUIRED_ALL  Test100Int32    ss    5   6.075 ± 1.258   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                         OFF  Test100Int32    ss    5  13.304 ± 1.497   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                REQUIRED_ALL  Test100Int32    ss    5   6.235 ± 0.617   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                         OFF  Test100Int32    ss    5  13.450 ± 3.429   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                REQUIRED_ALL  Test100Int32    ss    5   5.947 ± 0.430   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                         OFF  Test100Int32    ss    5  13.433 ± 3.879   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                REQUIRED_ALL  Test100Int32    ss    5   6.523 ± 2.831   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP                    OFF  Test100Int32    ss    5  13.288 ± 0.429   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP           REQUIRED_ALL  Test100Int32    ss    5   6.333 ± 0.444   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY                  OFF  Test100Int32    ss    5  13.197 ± 1.396   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY         REQUIRED_ALL  Test100Int32    ss    5   6.855 ± 2.689   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed            OFF  Test100Int32    ss    5  13.473 ± 1.930   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed   REQUIRED_ALL  Test100Int32    ss    5   6.006 ± 0.285   s/op

For 30 int32 fields:

Benchmark                                                            (codegenMode)  (protoClass)  Mode  Cnt  Score   Error  Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                         OFF   Test30Int32    ss    5  3.421 ± 1.303   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                REQUIRED_ALL   Test30Int32    ss    5  2.410 ± 0.357   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                         OFF   Test30Int32    ss    5  3.396 ± 0.708   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                REQUIRED_ALL   Test30Int32    ss    5  2.362 ± 0.174   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                         OFF   Test30Int32    ss    5  3.250 ± 0.721   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                REQUIRED_ALL   Test30Int32    ss    5  2.310 ± 0.168   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                         OFF   Test30Int32    ss    5  3.447 ± 0.884   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                REQUIRED_ALL   Test30Int32    ss    5  2.416 ± 0.387   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP                    OFF   Test30Int32    ss    5  3.156 ± 0.276   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP           REQUIRED_ALL   Test30Int32    ss    5  2.514 ± 0.687   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY                  OFF   Test30Int32    ss    5  3.398 ± 0.853   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY         REQUIRED_ALL   Test30Int32    ss    5  2.501 ± 0.323   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed            OFF   Test30Int32    ss    5  3.644 ± 3.423   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed   REQUIRED_ALL   Test30Int32    ss    5  2.384 ± 0.203   s/op

For 30 strings ("fieldXX:XX"):

Benchmark                                                            (codegenMode)  (protoClass)  Mode  Cnt   Score   Error  Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                         OFF  Test30String    ss    5   9.426 ± 3.621   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                REQUIRED_ALL  Test30String    ss    5   8.257 ± 1.113   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                         OFF  Test30String    ss    5   9.848 ± 1.141   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                REQUIRED_ALL  Test30String    ss    5   8.302 ± 1.910   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                         OFF  Test30String    ss    5  10.216 ± 1.843   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                REQUIRED_ALL  Test30String    ss    5   8.173 ± 1.419   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                         OFF  Test30String    ss    5   9.940 ± 1.680   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                REQUIRED_ALL  Test30String    ss    5   8.242 ± 1.270   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP                    OFF  Test30String    ss    5   9.833 ± 1.010   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP           REQUIRED_ALL  Test30String    ss    5   8.247 ± 1.284   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY                  OFF  Test30String    ss    5   9.638 ± 0.502   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY         REQUIRED_ALL  Test30String    ss    5   7.935 ± 0.889   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed            OFF  Test30String    ss    5   9.968 ± 1.651   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed   REQUIRED_ALL  Test30String    ss    5   8.356 ± 1.319   s/op

For 5-7 fields the gain is negligeable.

gszadovszky · 2025-01-23T08:55:04Z

Thank you for the results, @gowa. I'm convinced.

I am understand you code well that if CodeGenMode.OFF is used, then it is the original behavior without ByteBuddy? If that's the case, we might choose to have it as default for the upcoming release to give time for the users to test it. Then, we can change the default for the next release.

Another question is the dependency. What transitive dependencies ByteBuddy would get into parquet-protobuf? Does it have transitive dependencies that may conflict with parquet-java itself or the ecosystems who would use parquet-protobuf?

gowa · 2025-01-23T10:37:19Z

Hi @gszadovszky !
yes, with OFF is the old behavior. This whole codegeneration thing is currently implemented as an interceptor (Predicate). When OFF, then predicate is just x -> false, not bringing much overhead to the original impl and not even trying to generate classes.

I will check dependencies in details, but I've put bytebuddy as an optional one. If it is not in classpath, then for all modes codegen won't happen, safely falling back to the original implementation, except for REQUIRE_ALL mode, when we insist on codegeneration.

gszadovszky · 2025-01-23T11:14:36Z

Thanks, @gowa for the clarification.

It would be nice to add a description for the users in the README. If one would like to use ByteBuddy, what to do, what options do they have. How it will behave by default (if ByteBuddy is not on the classpath).

Overall, I'm good with this approach and your PR. If you would find somebody who can do a deeper review on the ByteBuddy related codes, I'll be more than happy to approve it.

gowa · 2025-01-23T11:55:11Z

@gszadovszky , all right. thanks. there is still a lot to do. like, implementing the read part. at least, I know that this all can make live. I will surely update the docs, etc. to make it a good quality commit.

gowa added 2 commits January 10, 2025 02:55

use ByteBuddy code generation to write proto to parquet faster

bd3724e

run unit tests in different codegen modes

ad6822c

gowa force-pushed the parquet-protobuf-codegen-v1 branch 2 times, most recently from 1d14b84 to 0461411 Compare January 22, 2025 23:08

fix CICD errors: making some methods public to make code generation c…

1d780f6

…ompatible with java8 without hacks

gowa force-pushed the parquet-protobuf-codegen-v1 branch from 0461411 to 1d780f6 Compare January 22, 2025 23:09

benchmark ProtoWriteSupport with ByteBuddy

5b3b747

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121

[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121

gowa commented Jan 12, 2025 •

edited

Loading

wgtmac commented Jan 14, 2025 •

edited

Loading

gszadovszky commented Jan 14, 2025

gowa commented Jan 23, 2025 •

edited

Loading

gszadovszky commented Jan 23, 2025

gowa commented Jan 23, 2025

gszadovszky commented Jan 23, 2025

gowa commented Jan 23, 2025

[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121

Are you sure you want to change the base?

[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121

Conversation

gowa commented Jan 12, 2025 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

wgtmac commented Jan 14, 2025 • edited Loading

gszadovszky commented Jan 14, 2025

gowa commented Jan 23, 2025 • edited Loading

gszadovszky commented Jan 23, 2025

gowa commented Jan 23, 2025

gszadovszky commented Jan 23, 2025

gowa commented Jan 23, 2025

gowa commented Jan 12, 2025 •

edited

Loading

wgtmac commented Jan 14, 2025 •

edited

Loading

gowa commented Jan 23, 2025 •

edited

Loading