-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax physical schema validation #14519
Relax physical schema validation #14519
Conversation
e33e84a
to
387b568
Compare
Physical plan can be further optimized. In particular, an expression can be determined as never null even if it wasn't known at the time of logical planning. Thus, the final schema check needs to be relax, allowing now-non-null data where nullable data was expected. This replaces schema equality check, with asymmetric "is satisfied by" relation.
387b568
to
fa37ee5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but I don't have all of the context to give it an actual +1. I just had some free time and decided to take up @alamb's call for more folks reviewing PRs.
// TODO (DataType::Union(, _), DataType::Union(_, _)) => {} | ||
// TODO (DataType::Dictionary(_, _), DataType::Dictionary(_, _)) => {} | ||
// TODO (DataType::Map(_, _), DataType::Map(_, _)) => {} | ||
// TODO (DataType::RunEndEncoded(_, _), DataType::RunEndEncoded(_, _)) => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to not add these as part of this PR that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
laziness and avoiding pr scope creep. i wanted to get structure clear and decided upon first
for example, it's not totally obvious we should be recursing into types at all. i think we should, but that's the decision being made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough!
@@ -689,7 +693,7 @@ impl DefaultPhysicalPlanner { | |||
if physical_field.data_type() != logical_field.data_type() { | |||
differences.push(format!("field data type at index {} [{}]: (physical) {} vs (logical) {}", i, physical_field.name(), physical_field.data_type(), logical_field.data_type())); | |||
} | |||
if physical_field.is_nullable() != logical_field.is_nullable() { | |||
if physical_field.is_nullable() && !logical_field.is_nullable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like it! I still don't get why we still check nullability in schemas equivalence, 🤔 logical a physical schema can be derived differently and nullable sometimes derived in different way as well.
Nullability checks was a source of dozens problems on schema mismatch especially for UNION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely only a few case like Union is exception, most of the case doesn't change nullability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logical a physical schema can be derived differently and nullable sometimes derived in different way as well.
agreed, but the earlier delivered schema acts as a contract (promise) for a later delivered schema
if we told the world that expr
won't contain null values, we can't change the mind at physical planning time. it violates the constraint (promise / contract)
if we told the world that expr
may contain null values, we didn't promise that it will contain null values, and we may happen to produce no null values (and even be aware of that)
/// schemas except that original schema can have nullable fields where candidate | ||
/// is constrained to not provide null data. | ||
pub(crate) fn schema_satisfied_by(original: &Schema, candidate: &Schema) -> bool { | ||
original.metadata() == candidate.metadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering, do we really need to compare metadata? if it works for now we can have it, but since metadata is not strongly typed in fact just a HashMap<String, String>
it might be an issue if from logical/physical schema someone decides to store something in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree.
note that the bottom line, aka the original behavior, is the schema Eq check, which includes metadata equality check.
in this PR i wanted to relax nullability checks only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Physical plan can be further optimized. In particular, an expression can be determined as never null even if it wasn't known at the time of logical planning. Thus, the final schema check needs to be relax, allowing now-non-null data where nullable data was expected. This replaces schema equality check, with asymmetric "is satisfied by" relation.