Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse last projection layer when renaming columns #14684

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

blaginin
Copy link
Contributor

Which issue does this PR close?

Related to #14563

Rationale for this change

Currently, when with_column_renamed is called, a new Projection layer is always created. For example, in the dataframe benchmark, logical plan looks like this:

Zen Browser 2025-02-15 18 38 03

These layers do not affect the query itself (as they are removed by optimize_projections), but they make renaming new columns (and optimization itself) quite slow.

What changes are included in this PR?

I added an optimization for one edge case - when there's already a projection on top and we're adding a new one. This is a common scenario, especially when renaming many columns (as in the dataframe benchmark). Instead of adding a new projection layer on top, we replace the existing one if possible.

Are these changes tested?

Extended test case

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Feb 15, 2025
@blaginin
Copy link
Contributor Author

group                                                main                                    one-layer
with_column_10                                       1.39   764.9±10.37µs        ? ?/sec     1.00    550.0±4.71µs        ? ?/sec
with_column_100                                      3.88  1490.1±22.27ms        ? ?/sec     1.00    384.5±3.24ms        ? ?/sec
with_column_200                                      5.63      36.6±0.43s        ? ?/sec     1.00       6.5±0.63s        ? ?/sec

@Omega359
Copy link
Contributor

Quite a speed improvement! I'll look at the code in a bit

.with_column_renamed("c4", "boom")?;

assert_eq!("\
Projection: aggregate_test_100.c1 AS one, aggregate_test_100.c2 AS two, aggregate_test_100.c3, aggregate_test_100.c2 + aggregate_test_100.c3 AS sum AS total\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AS sum AS total is valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough reading 🙏 AS sum AS total happens because there's an alias over an alias - I believe it's harmless since it just wraps the original expression.

There's a function unalias_nested that can resolve this, but I don't want to call it because it's recursive and expensive. A potentially better option IMO is to reuse the existing alias in alias_qualified and alias if it's already there (similar to this PR). This feels like a separate improvement from this PR though

SchemaError::FieldNotFound { .. },
_,
)) => {
self.plan = LogicalPlan::Projection(
Copy link
Contributor

@Omega359 Omega359 Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field is not found why are we not just returning Ok(self)? I am unsure why we would need to add a projection for the plan in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do that because self was destructed and so I need to put plan back. try_new_with_schema is quite cheap and this shouldn't be happening often anyway

})
.collect();
LogicalPlan::Projection(Projection::try_new(expr, input)?)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of duplication here. I wonder if there is a way we could reduce that? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very fair! I also dislike repeating qualifier_rename, field_rename twice, but it's happening because they reference plan and must come after its destruction (hence needed twice)

@blaginin blaginin marked this pull request as ready for review February 17, 2025 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants