Integrate `Analyzer` within LogicalPlan building stage #14618

jayzhan211 · 2025-02-12T00:51:51Z

Is your feature request related to a problem or challenge?

The steps in Logical Layer is Sql->LogicalPlan->Analyzer->Optimizer.

These 5 rules are in Analyzer

            Arc::new(InlineTableScan::new()),
            // Every rule that will generate [Expr::Wildcard] should be placed in front of [ExpandWildcardRule].
            Arc::new(ExpandWildcardRule::new()),
            // [Expr::Wildcard] should be expanded before [TypeCoercion]
            Arc::new(ResolveGroupingFunction::new()),
            Arc::new(TypeCoercion::new()),
            Arc::new(CountWildcardRule::new()),

The role of the Analyzer is unclear to me. Having two types of "optimization" after the plan is completed doesn’t seem necessary. Instead, we should have one optimization step during plan construction and another after the plan is finalized. I believe these rules can be placed either in the SQL → LogicalPlan building stage or in the optimizer.

Comments of Analyzer

/// [`AnalyzerRule`]s transform [`LogicalPlan`]s in some way to make
/// the plan valid prior to the rest of the DataFusion optimization process.
///
/// `AnalyzerRule`s are different than an [`OptimizerRule`](crate::OptimizerRule)s
/// which must preserve the semantics of the `LogicalPlan`, while computing
/// results in a more optimal way.
///
/// For example, an `AnalyzerRule` may resolve [`Expr`](datafusion_expr::Expr)s into more specific
/// forms such as a subquery reference, or do type coercion to ensure the types
/// of operands are correct.

If a rule MUST be executed for plan validity (i.e. TypeCoercion), it should be applied during the plan creation stage, not after the plan is completed. However, if the rule is OPTIONAL for plan completion, it should be applied in the optimizer.

I propose removing the concept of the Analyzer and integrating it into the SQL → LogicalPlan stage. Specifically, TypeCoercion should be applied before the plan is finalized (#14380).

Before moving TypeCoercion into the builder, ExpandWildcardRule needs to be relocated first. The remaining three rules can be moved either into the builder or the optimize

Describe the solution you'd like

Requirement

Rules in the Analyzer are optional, allowing users to choose whether to apply them or add custom rules. This flexibility should be preserved, ensuring that the rule remains optional and customizable even after being moved out of the Analyzer.

Tasks

Move ExpandWildcardRule in SQL->LogicalPlan stage (Move ExpandWildcardRule into Logical Plan construction #14634)
User Defined Coercion Rules #14296 We could make TypeCoercion customizable before moving it out of the Analyzer.
Move TypeCoercion in SQL->LogicalPlan stage
Revisit Create UNION plan node with correct schema #14380
Investigate other 3 rules and remove Analyzer internally.
Find a way to migrate existing customize analyzer rule away

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

xudong963 · 2025-02-12T09:48:18Z

+1 for the change

rkrishn7 · 2025-02-12T20:59:53Z

+1

I'd be happy to help here. I can start by taking on moving ExpandWildcardRule

I opened #14634 to track this

findepi · 2025-02-12T21:18:05Z

The role of the Analyzer is unclear to me. Having two types of "optimization" after the plan is completed doesn’t seem necessary. Instead, we should have one optimization step during plan construction and another after the plan is finalized. I believe these rules can be placed either in the SQL → LogicalPlan building stage or in the optimizer.

I would not call them "optimizations".
They are simply logical ingredients towards building fully resolved Logical Plan (LP).

If we do this issue (+1 from me), we should be able to do

Proposal: introduced typed expressions, separate AST and IR #12604
move all the coercion logic out of optimizer and thus achieve [Epic] Make DataFusion a reliable foundation for building query engines #12723

Even without introducing new IR, just making current LP the IR.

So if LP is to be the real IR, we should also plan on making the Expr a real expression:

alamb · 2025-02-15T10:53:17Z

I propose removing the concept of the Analyzer and integrating it into the SQL → LogicalPlan stage. Specifically, TypeCoercion should be applied before the plan is finalized (#14380).

Another important consideration I think is that DataFusion is used from non SQL contexts (for example the DataFrame API and when other languages are compiled into LogicalPlans)

The type coercion logic specifically, and likely the other analyzer rules, need to be applied to LogicalPlans that also come from these non SQL sources

findepi · 2025-02-15T11:55:44Z

Good point @alamb . This would mean that the LogicalPlan, being the public API for syntactial query building, is inherently not "fully resolved". Which suggests creating new IR might actually be an easier option that "healing" LP to become fully resolved IR itself.

jayzhan211 · 2025-02-15T12:59:17Z

I agree.

In SQL, the transformation follows this path:

SQL → Expr → LogicalPlan

For DataFrame operations, the transformation is:

Expr → LogicalPlan

The same principles that apply to the Expr → LogicalPlan transformation in the SQL case should also be considered for DataFrame operations. We need to ensure consistency in handling this transformation.

alamb · 2025-02-15T14:18:26Z

Good point @alamb . This would mean that the LogicalPlan, being the public API for syntactial query building, is inherently not "fully resolved". Which suggests creating new IR might actually be an easier option that "healing" LP to become fully resolved IR itself.

A new IR might work

I think @wiedld and I were trying to do the "healing" (well really checking if the LogicalPlan isn't healthy) via the `InvariantLevel idea here:

https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html#method.check_invariants

It isn't yet sidely used, but maybe could be more and more formalized 🤔

jayzhan211 added the enhancement New feature or request label Feb 12, 2025

rkrishn7 mentioned this issue Feb 12, 2025

Move ExpandWildcardRule into Logical Plan construction #14634

Open

rkrishn7 mentioned this issue Feb 12, 2025

Remove Wildcard from Expr #7765

Open

jayzhan211 mentioned this issue Feb 16, 2025

Remove CountWildcardRule in Analyzer and move the functionality in ExprPlanner, add plan_aggregate and plan_window to planner #14689

Draft

findepi mentioned this issue Feb 17, 2025

[EPIC] Decouple logical from physical types #12622

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate `Analyzer` within LogicalPlan building stage #14618

Integrate `Analyzer` within LogicalPlan building stage #14618

jayzhan211 commented Feb 12, 2025 •

edited by findepi

Loading

xudong963 commented Feb 12, 2025

rkrishn7 commented Feb 12, 2025 •

edited

Loading

findepi commented Feb 12, 2025

alamb commented Feb 15, 2025

findepi commented Feb 15, 2025

jayzhan211 commented Feb 15, 2025

alamb commented Feb 15, 2025

Integrate Analyzer within LogicalPlan building stage #14618

Integrate Analyzer within LogicalPlan building stage #14618

Comments

jayzhan211 commented Feb 12, 2025 • edited by findepi Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Requirement

Tasks

Describe alternatives you've considered

Additional context

xudong963 commented Feb 12, 2025

rkrishn7 commented Feb 12, 2025 • edited Loading

findepi commented Feb 12, 2025

alamb commented Feb 15, 2025

findepi commented Feb 15, 2025

jayzhan211 commented Feb 15, 2025

alamb commented Feb 15, 2025

Integrate `Analyzer` within LogicalPlan building stage #14618

Integrate `Analyzer` within LogicalPlan building stage #14618

jayzhan211 commented Feb 12, 2025 •

edited by findepi

Loading

rkrishn7 commented Feb 12, 2025 •

edited

Loading