Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe with_column and with_column_renamed performance improvements #14653

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Feb 13, 2025

Which issue does this PR close?

with_column_10          time:   [3.6602 ms 3.8457 ms 4.2130 ms]
with_column_100         time:   [34.874 ms 35.979 ms 38.018 ms]
with_column_200         time:   [183.49 ms 187.29 ms 191.65 ms]

If there is any dataframe experts here I would love a review of my assumptions. As noted in #14563 (comment)

My thinking for the changes in my current branch was that any 'new' Expr (parameter to with_column, with_column_renamed, etc) would go through the normalization, everything else I would like to think would already have been normalized or else how would it be in the DataFrame? What worries me is that I don't know if that assumption is correct or not.

Rationale for this change

Improve performance for with_column and with_column_renamed dataframe functions.

What changes are included in this PR?

Code

Are these changes tested?

Existing tests

Are there any user-facing changes?

No.

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Feb 13, 2025
@Omega359
Copy link
Contributor Author

After spending more time reviewing the dataframe and logical plan code I have a feeling that my assumption is in fact not correct and that a dataframe can indeed have a plan that is not normalized/columnized prior to with_column being called. Joins, window, aggregate, are possible examples.

@timsaucer timsaucer self-requested a review February 13, 2025 21:38
@timsaucer
Copy link
Contributor

I suspect you're right about that assumption not being correct. I've dug through a bit, but I'd probably need to write up a unit test to verify.

@Omega359 Omega359 closed this Feb 16, 2025
@Omega359 Omega359 reopened this Feb 16, 2025
@Omega359
Copy link
Contributor Author

I've made some changes locally where I test to see if the existing plan is a projection but I realized that I can't just rely on that either as the plan could possibly have been manually made then a DataFrame wrapped around it and the with_column function called.

For my approach to work I would need a way to strongly guarantee that the last projection that was made was done via the project(..) function in the builder where the normalization/columnization is guaranteed to have happened. I'm not sure right now how to do that

@Omega359 Omega359 marked this pull request as ready for review February 17, 2025 18:54
@Omega359
Copy link
Contributor Author

with_column_10          time:   [6.1112 ms 6.2616 ms 6.4226 ms]
                        change: [+18.276% +23.739% +29.703%] (p = 0.00 < 0.05)
with_column_100         time:   [41.379 ms 54.353 ms 66.683 ms]
                        change: [+18.326% +33.842% +55.521%] (p = 0.00 < 0.05)
with_column_200         time:   [198.59 ms 209.45 ms 224.33 ms]
                        change: [-1.0194% +5.1906% +12.830%] (p = 0.20 > 0.05)
with_column_500         time:   [3.5914 s 3.7758 s 4.0800 s]
                        change: [-13.580% -6.5454% +2.3214%] (p = 0.16 > 0.05)
                        No change in performance detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants