-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [PREVIEW] use ByteBuddy code generation to write proto to parquet faster #3121
base: master
Are you sure you want to change the base?
Conversation
Thanks for your interest in contributing this! This seems to be a large feature and the performance gain is promising! However, I'm afraid that this PR may not get prompt review due to lack of active cc @gszadovszky @julienledem if you know someone can help review this. |
The noted performance gain is promising indeed. However, it would be nice to see actual numbers for different scenarios (flat columns, general nested columns, deeply nested columns) of read/write. You might even implement your performance tests in the module Unfortunately, I'm not expert in |
1d14b84
to
0461411
Compare
…ompatible with java8 without hacks
0461411
to
1d780f6
Compare
Hi @gszadovszky , @wgtmac . Thank you for your feedback. As for benchmarks. I've implemented some and committed. I attempted to replicate the original org.apache.parquet.benchmarks.WriteBenchmarks with some proto stuff in org.apache.parquet.benchmarks.ProtoWriteBenchmarks. The result are as follows: the bigger number of fields (especially primitives), the bigger the gain, as this fix only optimize getters and boxing/unboxing. E.g. for 100 int32 fields:
For 30 int32 fields:
For 30 strings ("fieldXX:XX"):
For 5-7 fields the gain is negligeable. |
Thank you for the results, @gowa. I'm convinced. I am understand you code well that if Another question is the dependency. What transitive dependencies ByteBuddy would get into |
Hi @gszadovszky ! I will check dependencies in details, but I've put bytebuddy as an optional one. If it is not in classpath, then for all modes codegen won't happen, safely falling back to the original implementation, except for REQUIRE_ALL mode, when we insist on codegeneration. |
Thanks, @gowa for the clarification. It would be nice to add a description for the users in the README. If one would like to use ByteBuddy, what to do, what options do they have. How it will behave by default (if ByteBuddy is not on the classpath). Overall, I'm good with this approach and your PR. If you would find somebody who can do a deeper review on the ByteBuddy related codes, I'll be more than happy to approve it. |
@gszadovszky , all right. thanks. there is still a lot to do. like, implementing the read part. at least, I know that this all can make live. I will surely update the docs, etc. to make it a good quality commit. |
Note: it is not a ready to merge pull request, but a request to check if the concept of using code generation for solving some performance issues, associated with the usage of protobuf reflection when writing or reading parquet files, is of potential interest of repository owners. I decided to verify the concept at a rather early stage due to a significant effort required to implement the change. Should the approach and a new optional dependency on ByteBuddy is found satisfactorily and potentially acceptable to be included into parquet-java, I will attempt to properly finish first the 'write' part and then the 'read' part (in terms of code quality and tests). Therefore, any feedback is appreciated.
Rationale for this change
We read and write a lot of parquet data, defined by protobuf schemas from Java. It is seen that this can be done faster than what is offered out of the box now.
The change introduced improves proto-to-parquet file writing performance by means of code generation (in my synthetic tests by around 50% with SNAPPY compression, especially, when structures have a lot of primitive type fields).
What changes are included in this PR?
Are these changes tested?
current unit tests work fine.
Are there any user-facing changes?
a configuration to disable code generation logic.