-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe with_column and with_column_renamed performance improvements #14653
base: main
Are you sure you want to change the base?
Conversation
After spending more time reviewing the dataframe and logical plan code I have a feeling that my assumption is in fact not correct and that a dataframe can indeed have a plan that is not normalized/columnized prior to with_column being called. Joins, window, aggregate, are possible examples. |
I suspect you're right about that assumption not being correct. I've dug through a bit, but I'd probably need to write up a unit test to verify. |
I've made some changes locally where I test to see if the existing plan is a projection but I realized that I can't just rely on that either as the plan could possibly have been manually made then a DataFrame wrapped around it and the with_column function called. For my approach to work I would need a way to strongly guarantee that the last projection that was made was done via the project(..) function in the builder where the normalization/columnization is guaranteed to have happened. I'm not sure right now how to do that |
# Conflicts: # datafusion/core/src/dataframe/mod.rs
|
Which issue does this PR close?
If there is any dataframe experts here I would love a review of my assumptions. As noted in #14563 (comment)
Rationale for this change
Improve performance for with_column and with_column_renamed dataframe functions.
What changes are included in this PR?
Code
Are these changes tested?
Existing tests
Are there any user-facing changes?
No.