Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax physical schema validation #14519

Conversation

findepi
Copy link
Member

@findepi findepi commented Feb 5, 2025

Physical plan can be further optimized. In particular, an expression can be determined as never null even if it wasn't known at the time of logical planning. Thus, the final schema check needs to be relax, allowing now-non-null data where nullable data was expected. This replaces schema equality check, with asymmetric "is satisfied by" relation.

@findepi
Copy link
Member Author

findepi commented Feb 5, 2025

@findepi findepi force-pushed the findepi/relax-physical-schema-validation-dc64f7 branch from e33e84a to 387b568 Compare February 5, 2025 21:59
Physical plan can be further optimized. In particular, an expression can
be determined as never null even if it wasn't known at the time of
logical planning. Thus, the final schema check needs to be relax,
allowing now-non-null data where nullable data was expected. This
replaces schema equality check, with asymmetric "is satisfied by"
relation.
@findepi findepi force-pushed the findepi/relax-physical-schema-validation-dc64f7 branch from 387b568 to fa37ee5 Compare February 5, 2025 22:08
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but I don't have all of the context to give it an actual +1. I just had some free time and decided to take up @alamb's call for more folks reviewing PRs.

datafusion/core/src/schema_equivalence.rs Show resolved Hide resolved
// TODO (DataType::Union(, _), DataType::Union(_, _)) => {}
// TODO (DataType::Dictionary(_, _), DataType::Dictionary(_, _)) => {}
// TODO (DataType::Map(_, _), DataType::Map(_, _)) => {}
// TODO (DataType::RunEndEncoded(_, _), DataType::RunEndEncoded(_, _)) => {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not add these as part of this PR that I'm missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

laziness and avoiding pr scope creep. i wanted to get structure clear and decided upon first

for example, it's not totally obvious we should be recursing into types at all. i think we should, but that's the decision being made.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough!

@@ -689,7 +693,7 @@ impl DefaultPhysicalPlanner {
if physical_field.data_type() != logical_field.data_type() {
differences.push(format!("field data type at index {} [{}]: (physical) {} vs (logical) {}", i, physical_field.name(), physical_field.data_type(), logical_field.data_type()));
}
if physical_field.is_nullable() != logical_field.is_nullable() {
if physical_field.is_nullable() && !logical_field.is_nullable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like it! I still don't get why we still check nullability in schemas equivalence, 🤔 logical a physical schema can be derived differently and nullable sometimes derived in different way as well.

Nullability checks was a source of dozens problems on schema mismatch especially for UNION

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely only a few case like Union is exception, most of the case doesn't change nullability

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logical a physical schema can be derived differently and nullable sometimes derived in different way as well.

agreed, but the earlier delivered schema acts as a contract (promise) for a later delivered schema
if we told the world that expr won't contain null values, we can't change the mind at physical planning time. it violates the constraint (promise / contract)
if we told the world that expr may contain null values, we didn't promise that it will contain null values, and we may happen to produce no null values (and even be aware of that)

/// schemas except that original schema can have nullable fields where candidate
/// is constrained to not provide null data.
pub(crate) fn schema_satisfied_by(original: &Schema, candidate: &Schema) -> bool {
original.metadata() == candidate.metadata()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering, do we really need to compare metadata? if it works for now we can have it, but since metadata is not strongly typed in fact just a HashMap<String, String> it might be an issue if from logical/physical schema someone decides to store something in there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree.
note that the bottom line, aka the original behavior, is the schema Eq check, which includes metadata equality check.
in this PR i wanted to relax nullability checks only.

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@comphead comphead merged commit 479a277 into apache:main Feb 7, 2025
25 checks passed
@findepi findepi deleted the findepi/relax-physical-schema-validation-dc64f7 branch February 7, 2025 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nullable Expr being constant fold to value can cause schema change and internal error
4 participants