Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

gowa
Copy link

@gowa gowa commented Jan 12, 2025

Note: it is not a ready to merge pull request, but a request to check if the concept of using code generation for solving some performance issues, associated with the usage of protobuf reflection when writing or reading parquet files, is of potential interest of repository owners. I decided to verify the concept at a rather early stage due to a significant effort required to implement the change. Should the approach and a new optional dependency on ByteBuddy is found satisfactorily and potentially acceptable to be included into parquet-java, I will attempt to properly finish first the 'write' part and then the 'read' part (in terms of code quality and tests). Therefore, any feedback is appreciated.

Rationale for this change

We read and write a lot of parquet data, defined by protobuf schemas from Java. It is seen that this can be done faster than what is offered out of the box now.
The change introduced improves proto-to-parquet file writing performance by means of code generation (in my synthetic tests by around 50% with SNAPPY compression, especially, when structures have a lot of primitive type fields).

What changes are included in this PR?

  1. an extension point in MessageWriter that redirects writing to a generated on-the-fly class dealing with protobuf generated classes getters directly, not via Protobuf Java Reflection methods.
  2. a separate class where all code generation logic is located.

Are these changes tested?

current unit tests work fine.

Are there any user-facing changes?

a configuration to disable code generation logic.

@wgtmac
Copy link
Member

wgtmac commented Jan 14, 2025

Thanks for your interest in contributing this! This seems to be a large feature and the performance gain is promising! However, I'm afraid that this PR may not get prompt review due to lack of active parquet-protobuf maintainers. I do not have any knowledge on ByteBuddy so it might take a long time to wrap it up. Is it possible to make it pluggable so the large portion of codegen logic does not have to exist in the parquet-java repo?

cc @gszadovszky @julienledem if you know someone can help review this.

@gszadovszky
Copy link
Contributor

The noted performance gain is promising indeed. However, it would be nice to see actual numbers for different scenarios (flat columns, general nested columns, deeply nested columns) of read/write. You might even implement your performance tests in the module parquet-benchmarks.

Unfortunately, I'm not expert in parquet-protobuf either not even talking about ByteBuddy. For the final PR review it would be a great help to have someone who have some experience with ByteBuddy even if not being a Parquet committer.

@gowa gowa force-pushed the parquet-protobuf-codegen-v1 branch 2 times, most recently from 1d14b84 to 0461411 Compare January 22, 2025 23:08
@gowa gowa force-pushed the parquet-protobuf-codegen-v1 branch from 0461411 to 1d780f6 Compare January 22, 2025 23:09
@gowa
Copy link
Author

gowa commented Jan 23, 2025

Hi @gszadovszky , @wgtmac . Thank you for your feedback.
Yes, I see that it is a big feature and the implementation is far from being a simple fix. And, maybe, it should be a pluggable thing instead of being a first-class resident in the code. However, if you feel the changes can be incorporated into the main codebase, I could try to find someone to review ByteBuddy part and implement the reader part as well.

As for benchmarks. I've implemented some and committed. I attempted to replicate the original org.apache.parquet.benchmarks.WriteBenchmarks with some proto stuff in org.apache.parquet.benchmarks.ProtoWriteBenchmarks.

The result are as follows: the bigger number of fields (especially primitives), the bigger the gain, as this fix only optimize getters and boxing/unboxing.

E.g. for 100 int32 fields:

Benchmark                                                            (codegenMode)  (protoClass)  Mode  Cnt   Score   Error  Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                         OFF  Test100Int32    ss    5  13.171 ± 1.206   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                REQUIRED_ALL  Test100Int32    ss    5   6.075 ± 1.258   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                         OFF  Test100Int32    ss    5  13.304 ± 1.497   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                REQUIRED_ALL  Test100Int32    ss    5   6.235 ± 0.617   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                         OFF  Test100Int32    ss    5  13.450 ± 3.429   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                REQUIRED_ALL  Test100Int32    ss    5   5.947 ± 0.430   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                         OFF  Test100Int32    ss    5  13.433 ± 3.879   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                REQUIRED_ALL  Test100Int32    ss    5   6.523 ± 2.831   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP                    OFF  Test100Int32    ss    5  13.288 ± 0.429   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP           REQUIRED_ALL  Test100Int32    ss    5   6.333 ± 0.444   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY                  OFF  Test100Int32    ss    5  13.197 ± 1.396   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY         REQUIRED_ALL  Test100Int32    ss    5   6.855 ± 2.689   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed            OFF  Test100Int32    ss    5  13.473 ± 1.930   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed   REQUIRED_ALL  Test100Int32    ss    5   6.006 ± 0.285   s/op

For 30 int32 fields:

Benchmark                                                            (codegenMode)  (protoClass)  Mode  Cnt  Score   Error  Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                         OFF   Test30Int32    ss    5  3.421 ± 1.303   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                REQUIRED_ALL   Test30Int32    ss    5  2.410 ± 0.357   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                         OFF   Test30Int32    ss    5  3.396 ± 0.708   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                REQUIRED_ALL   Test30Int32    ss    5  2.362 ± 0.174   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                         OFF   Test30Int32    ss    5  3.250 ± 0.721   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                REQUIRED_ALL   Test30Int32    ss    5  2.310 ± 0.168   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                         OFF   Test30Int32    ss    5  3.447 ± 0.884   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                REQUIRED_ALL   Test30Int32    ss    5  2.416 ± 0.387   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP                    OFF   Test30Int32    ss    5  3.156 ± 0.276   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP           REQUIRED_ALL   Test30Int32    ss    5  2.514 ± 0.687   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY                  OFF   Test30Int32    ss    5  3.398 ± 0.853   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY         REQUIRED_ALL   Test30Int32    ss    5  2.501 ± 0.323   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed            OFF   Test30Int32    ss    5  3.644 ± 3.423   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed   REQUIRED_ALL   Test30Int32    ss    5  2.384 ± 0.203   s/op

For 30 strings ("fieldXX:XX"):

Benchmark                                                            (codegenMode)  (protoClass)  Mode  Cnt   Score   Error  Units
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                         OFF  Test30String    ss    5   9.426 ± 3.621   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS4MUncompressed                REQUIRED_ALL  Test30String    ss    5   8.257 ± 1.113   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                         OFF  Test30String    ss    5   9.848 ± 1.141   s/op
ProtoWriteBenchmarks.write1MRowsBS256MPS8MUncompressed                REQUIRED_ALL  Test30String    ss    5   8.302 ± 1.910   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                         OFF  Test30String    ss    5  10.216 ± 1.843   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS4MUncompressed                REQUIRED_ALL  Test30String    ss    5   8.173 ± 1.419   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                         OFF  Test30String    ss    5   9.940 ± 1.680   s/op
ProtoWriteBenchmarks.write1MRowsBS512MPS8MUncompressed                REQUIRED_ALL  Test30String    ss    5   8.242 ± 1.270   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP                    OFF  Test30String    ss    5   9.833 ± 1.010   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeGZIP           REQUIRED_ALL  Test30String    ss    5   8.247 ± 1.284   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY                  OFF  Test30String    ss    5   9.638 ± 0.502   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeSNAPPY         REQUIRED_ALL  Test30String    ss    5   7.935 ± 0.889   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed            OFF  Test30String    ss    5   9.968 ± 1.651   s/op
ProtoWriteBenchmarks.write1MRowsDefaultBlockAndPageSizeUncompressed   REQUIRED_ALL  Test30String    ss    5   8.356 ± 1.319   s/op

For 5-7 fields the gain is negligeable.

@gszadovszky
Copy link
Contributor

Thank you for the results, @gowa. I'm convinced.

I am understand you code well that if CodeGenMode.OFF is used, then it is the original behavior without ByteBuddy? If that's the case, we might choose to have it as default for the upcoming release to give time for the users to test it. Then, we can change the default for the next release.

Another question is the dependency. What transitive dependencies ByteBuddy would get into parquet-protobuf? Does it have transitive dependencies that may conflict with parquet-java itself or the ecosystems who would use parquet-protobuf?

@gowa
Copy link
Author

gowa commented Jan 23, 2025

Hi @gszadovszky !
yes, with OFF is the old behavior. This whole codegeneration thing is currently implemented as an interceptor (Predicate). When OFF, then predicate is just x -> false, not bringing much overhead to the original impl and not even trying to generate classes.

I will check dependencies in details, but I've put bytebuddy as an optional one. If it is not in classpath, then for all modes codegen won't happen, safely falling back to the original implementation, except for REQUIRE_ALL mode, when we insist on codegeneration.

@gszadovszky
Copy link
Contributor

Thanks, @gowa for the clarification.

It would be nice to add a description for the users in the README. If one would like to use ByteBuddy, what to do, what options do they have. How it will behave by default (if ByteBuddy is not on the classpath).

Overall, I'm good with this approach and your PR. If you would find somebody who can do a deeper review on the ByteBuddy related codes, I'll be more than happy to approve it.

@gowa
Copy link
Author

gowa commented Jan 23, 2025

@gszadovszky , all right. thanks. there is still a lot to do. like, implementing the read part. at least, I know that this all can make live. I will surely update the docs, etc. to make it a good quality commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants