Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early exit on column normalisation to improve DataFrame performance #14636

Merged
merged 3 commits into from
Feb 17, 2025

Conversation

blaginin
Copy link
Contributor

Which issue does this PR close?

Related to #14563 (probably more prs to come)

Rationale for this change

Now, when normalizing the column, we always generate plan.using_columns() which is recursive and very expensive - and may not be needed if column is already normalized

What changes are included in this PR?

Exit early if column already has a relation set. Also, set the relation when with_column_renamed is called

Are these changes tested?

Extended a test to assert references

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Feb 12, 2025
@blaginin
Copy link
Contributor Author

Got +38% increase in dataframe benchmark

                    before      after
with_column_10      769.06 µs   673.32 µs
with_column_100     1.4952 s    978.43 ms
with_column_200     36.544 s    26.682 s

@alamb
Copy link
Contributor

alamb commented Feb 12, 2025

FYI @Omega359

@Omega359
Copy link
Contributor

I'll check this out tomorrow. we've been chatting about our approaches on #14563

@Omega359
Copy link
Contributor

I think this is a reasonable change. I think it could be incorporated into my upcoming PR to enhance things a bit, especially since I think it may help other dataframe functions such as select(exprs)

@blaginin blaginin marked this pull request as ready for review February 13, 2025 15:40
@blaginin
Copy link
Contributor Author

Nice thank you! But let's maybe keep PRs atomic? I plan to do one more (the one I described as a third in the issue), I don't think they overlap with each other?

@Omega359
Copy link
Contributor

Of course we can have them atomic :)

datafusion/core/src/dataframe/mod.rs Show resolved Hide resolved
Comment on lines +837 to +842
let column = column.into();
if column.relation.is_some() {
// column is already normalized
return Ok(column);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a no brainer performance boost. Thanks!

@alamb alamb changed the title Early exit on column normalisation Early exit on column normalisation to improve DataFrame performance Feb 14, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a no brainer performance boost. Thanks!

I agree -- thank you @blaginin @timsaucer and @Omega359

🚀

@alamb alamb merged commit 580e622 into apache:main Feb 17, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants