Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Analyzer within LogicalPlan building stage #14618

Open
6 tasks
jayzhan211 opened this issue Feb 12, 2025 · 7 comments
Open
6 tasks

Integrate Analyzer within LogicalPlan building stage #14618

jayzhan211 opened this issue Feb 12, 2025 · 7 comments
Labels
enhancement New feature or request

Comments

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 12, 2025

Is your feature request related to a problem or challenge?

The steps in Logical Layer is Sql->LogicalPlan->Analyzer->Optimizer.

These 5 rules are in Analyzer

            Arc::new(InlineTableScan::new()),
            // Every rule that will generate [Expr::Wildcard] should be placed in front of [ExpandWildcardRule].
            Arc::new(ExpandWildcardRule::new()),
            // [Expr::Wildcard] should be expanded before [TypeCoercion]
            Arc::new(ResolveGroupingFunction::new()),
            Arc::new(TypeCoercion::new()),
            Arc::new(CountWildcardRule::new()),

The role of the Analyzer is unclear to me. Having two types of "optimization" after the plan is completed doesn’t seem necessary. Instead, we should have one optimization step during plan construction and another after the plan is finalized. I believe these rules can be placed either in the SQL → LogicalPlan building stage or in the optimizer.

Comments of Analyzer

/// [`AnalyzerRule`]s transform [`LogicalPlan`]s in some way to make
/// the plan valid prior to the rest of the DataFusion optimization process.
///
/// `AnalyzerRule`s are different than an [`OptimizerRule`](crate::OptimizerRule)s
/// which must preserve the semantics of the `LogicalPlan`, while computing
/// results in a more optimal way.
///
/// For example, an `AnalyzerRule` may resolve [`Expr`](datafusion_expr::Expr)s into more specific
/// forms such as a subquery reference, or do type coercion to ensure the types
/// of operands are correct.

If a rule MUST be executed for plan validity (i.e. TypeCoercion), it should be applied during the plan creation stage, not after the plan is completed. However, if the rule is OPTIONAL for plan completion, it should be applied in the optimizer.

I propose removing the concept of the Analyzer and integrating it into the SQL → LogicalPlan stage. Specifically, TypeCoercion should be applied before the plan is finalized (#14380).

Before moving TypeCoercion into the builder, ExpandWildcardRule needs to be relocated first. The remaining three rules can be moved either into the builder or the optimize

Describe the solution you'd like

Requirement

Rules in the Analyzer are optional, allowing users to choose whether to apply them or add custom rules. This flexibility should be preserved, ensuring that the rule remains optional and customizable even after being moved out of the Analyzer.

Tasks

Describe alternatives you've considered

No response

Additional context

No response

@jayzhan211 jayzhan211 added the enhancement New feature or request label Feb 12, 2025
@xudong963
Copy link
Member

+1 for the change

@rkrishn7
Copy link
Contributor

rkrishn7 commented Feb 12, 2025

+1

I'd be happy to help here. I can start by taking on moving ExpandWildcardRule

I opened #14634 to track this

@findepi
Copy link
Member

findepi commented Feb 12, 2025

The role of the Analyzer is unclear to me. Having two types of "optimization" after the plan is completed doesn’t seem necessary. Instead, we should have one optimization step during plan construction and another after the plan is finalized. I believe these rules can be placed either in the SQL → LogicalPlan building stage or in the optimizer.

I would not call them "optimizations".
They are simply logical ingredients towards building fully resolved Logical Plan (LP).

If we do this issue (+1 from me), we should be able to do

Even without introducing new IR, just making current LP the IR.

So if LP is to be the real IR, we should also plan on making the Expr a real expression:

@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

I propose removing the concept of the Analyzer and integrating it into the SQL → LogicalPlan stage. Specifically, TypeCoercion should be applied before the plan is finalized (#14380).

Another important consideration I think is that DataFusion is used from non SQL contexts (for example the DataFrame API and when other languages are compiled into LogicalPlans)

The type coercion logic specifically, and likely the other analyzer rules, need to be applied to LogicalPlans that also come from these non SQL sources

@findepi
Copy link
Member

findepi commented Feb 15, 2025

Good point @alamb . This would mean that the LogicalPlan, being the public API for syntactial query building, is inherently not "fully resolved". Which suggests creating new IR might actually be an easier option that "healing" LP to become fully resolved IR itself.

@jayzhan211
Copy link
Contributor Author

I agree.

In SQL, the transformation follows this path:

SQL → Expr → LogicalPlan

For DataFrame operations, the transformation is:

Expr → LogicalPlan

The same principles that apply to the Expr → LogicalPlan transformation in the SQL case should also be considered for DataFrame operations. We need to ensure consistency in handling this transformation.

@alamb
Copy link
Contributor

alamb commented Feb 15, 2025

Good point @alamb . This would mean that the LogicalPlan, being the public API for syntactial query building, is inherently not "fully resolved". Which suggests creating new IR might actually be an easier option that "healing" LP to become fully resolved IR itself.

A new IR might work

I think @wiedld and I were trying to do the "healing" (well really checking if the LogicalPlan isn't healthy) via the `InvariantLevel idea here:

https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html#method.check_invariants

It isn't yet sidely used, but maybe could be more and more formalized 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants