bug: improve schema checking for `insert into` cases #14572

zhuqi-lucas · 2025-02-10T06:51:11Z

Which issue does this PR close?

Describe the bug
In, #14394, it was reported that while attempting to implement a DataSink different schemas for the record batches were being given than per the RecordBatchStream.

A fix for the given example, an INSERT INTO ... VALUES query, was merged (#14472). However, this issue likely arises when the schema of the source of an INSERT statement contain fields that differ from the table schema in terms of nullability. That is, the problem is not just limited to INSERT INTO ... VALUES statements.

Closes #14550

What changes are included in this PR?

Add a separate nullable checking besides the original checking which only include the name and datatype.
Improve the error message to including more info about the error.

We will improve the checking for the 3 cases, also improve the error message.

There are three cases we need to check

The len of the schema of the plan and the schema of the table should be the same

2 (The nullable flag of the schema of the plan and the schema of the table should be the same) This is not needed, we have checking in execution plan.

The datatype of the schema of the plan and the schema of the table should be the same

Are these changes tested?

Yes

Are there any user-facing changes?

No

jayzhan211

Can you explain the reason of the change in test.slt, thanks

jayzhan211 · 2025-02-11T12:38:04Z

datafusion/common/src/dfschema.rs

+    // 1. The len of the schema of the plan and the schema of the table should be the same
+    // 2. The nullable flag of the schema of the plan and the schema of the table should be the same
+    // 3. The datatype of the schema of the plan and the schema of the table should be the same
+    fn logically_equivalent_names_and_types(&self, other: &Self) -> Result<(), String> {


Why not Result<bool>

Originally i use Result < bool >, but i want to get three different error messages for different case, so i change to Result<(), String>.

You can also define different messages with internal_err!("msg1"), internal_err!("msg2")

Thank you @jayzhan211 for this good suggestion, i change my code to use result<()>, it seems it's enough for this case, similar with many other cases, for example:

/// Check if the schema have some fields with the same name pub fn check_names(&self) -> Result<()> { }

jayzhan211 · 2025-02-11T12:41:00Z

datafusion/common/src/dfschema.rs

-                f1.name() == f2.name()
-                    && DFSchema::datatype_is_logically_equal(
+            .try_for_each(|(f1, f2)| {
+                if f1.is_nullable() != f2.is_nullable() {


If the field is nullable, we can insert non-null column. Similar to #14519

This seems a regression to me 🤔. Even though the schema of a source is nullable, all of its data can be non-nullable, and in such cases, it can still be inserted into a non-nullable sink. When inserting, we currently validate against the actual data rather than the schema. See check_not_null_constraints

If 'DataSink receiving different schemas' is an issue, we can rewrite the schema of batches emitted by DataSinkExec.

Thank you @jayzhan211 and @jonahgao for review, this is a good point, i change it to the only error case for nullable check:
// only check the case when the table field is not nullable and the insert data field is nullable

jayzhan211 · 2025-02-11T12:41:31Z

datafusion/sqllogictest/test_files/insert.slt

@@ -78,7 +104,7 @@ physical_plan
 query I
 INSERT INTO table_without_values SELECT
 SUM(c4) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
-COUNT(*) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
+NULLIF(COUNT(*) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), 0)


Why do we need NULLIF? Does its use indicate a potential issue?

This regression now not happened after the above code changes.

zhuqi-lucas · 2025-02-11T16:00:30Z

datafusion/sqllogictest/test_files/insert_to_external.slt

@@ -81,11 +77,9 @@ STORED AS arrow
 LOCATION 'test_files/scratch/insert_to_external/arrow_dict_partitioned/'
 PARTITIONED BY (b);

-query I
+query error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'b' nullability: false, got field: 'b', nullability: true


This is strange, it means the PARTITIONED BY (b) will make the field 'b' nullability: false?

This is the only different case when PARTITIONED BY happen.

cc @jayzhan211 @jonahgao

zhuqi-lucas · 2025-02-11T16:11:13Z

datafusion/sqllogictest/test_files/aggregate_skip_partial.slt

@@ -228,7 +228,7 @@ CREATE TABLE aggregate_test_100_null (
  c11 FLOAT
 );

-statement ok
+statement error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'c5' nullability: false, got field: 'c5', nullability: true


This is the only regression in the slt i think. cc @jayzhan211 @jonahgao

# Setup test data table statement ok CREATE EXTERNAL TABLE aggregate_test_100 ( c1 VARCHAR NOT NULL, c2 TINYINT NOT NULL, c3 SMALLINT NOT NULL, c4 SMALLINT, c5 INT, c6 BIGINT NOT NULL, c7 SMALLINT NOT NULL, c8 INT NOT NULL, c9 INT UNSIGNED NOT NULL, c10 BIGINT UNSIGNED NOT NULL, c11 FLOAT NOT NULL, c12 DOUBLE NOT NULL, c13 VARCHAR NOT NULL ) STORED AS CSV LOCATION '../../testing/data/csv/aggregate_test_100.csv' OPTIONS ('format.has_header' 'true'); statement ok CREATE TABLE aggregate_test_100_null ( c2 TINYINT NOT NULL, c5 INT NOT NULL, c3 SMALLINT, c11 FLOAT ); statement error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'c5' nullability: false, got field: 'c5', nullability: true INSERT INTO aggregate_test_100_null SELECT c2, c5, CASE WHEN c1 = 'e' THEN NULL ELSE c3 END as c3, CASE WHEN c1 = 'a' THEN NULL ELSE c11 END as c11 FROM aggregate_test_100;

I think the original behaviour is wrong, because the insert table is not nullable.

zhuqi-lucas · 2025-02-11T16:17:13Z

datafusion/sqllogictest/test_files/aggregate_skip_partial.slt

+statement ok
+CREATE TABLE aggregate_test_100_null (
+  c2  TINYINT NOT NULL,
+  c5  INT,


Noted, i also add the successful case which the table field c5 is nullable.

zhuqi-lucas · 2025-02-12T03:50:20Z

Can you explain the reason of the change in test.slt, thanks

Thank you for review @jayzhan211, i already updated the slt now, and added note for the only 2 different results for the sql.

zhuqi-lucas · 2025-02-12T09:49:29Z

datafusion/sqllogictest/test_files/insert.slt

 insert into table_without_values(field2) values(300);
+----


This is due to we have checking in insert into plan now, before this we only have the check for execution plan check.

zhuqi-lucas · 2025-02-12T09:59:49Z

The CI error is caused by:

#14625

zhuqi-lucas · 2025-02-12T10:10:25Z

datafusion/sqllogictest/test_files/insert_to_external.slt

 insert into dictionary_encoded_parquet_partitioned
 select * from dictionary_encoded_values
 ----
-2
+DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table. Expected table field 'b' nullability: false, got field: 'b', nullability: true


This is also expected, because PARTITIONED BY (b), will make the b nullable to false.

We shouldn't support insert nullable value for partition key i think.

Error should be after query error to pass CI

because PARTITIONED BY (b), will make the b nullable to false can be add as comment

Added comments in latest PR, thanks! @jayzhan211

Error should be after query error to pass CI, i think it's auto generated after the PR here:

#14439

For example:

https://github.com/apache/datafusion/pull/14439/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196R273

It is auto generated a long time before, we need to manually move it

But why CI is green 🤔

@jayzhan211 This https://github.com/apache/datafusion/pull/14439/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196R273 merged less than a week ago. I think it's start from there, and i tried local sql logic test use -- --complete, it's also the same result, i guess the CI also use it to generate and verify?

We may need a follow-up issue to investigate it.

This is the multiline error feature of sqllogictest-rs.
https://github.com/risinglightdb/sqllogictest-rs/blob/7ee44cd995fb65175bb647d07f63b557dbaa22c7/CHANGELOG.md#0180---2023-11-08

jonahgao · 2025-02-12T13:45:44Z

datafusion/common/src/dfschema.rs

+                .zip(other.fields().iter())
+                .try_for_each(|(f1, f2)| {
+                    // only check the case when the table field is not nullable and the insert data field is nullable
+                    if !f1.is_nullable() && f2.is_nullable() {


This condition would prevent the following query from executing, but it works on both the main branch and Postgres.

create table t1(a int not null); create table t2(a int); insert into t2 values(100); insert into t1 select * from t2;

As I mentioned earlier, we already have a check during execution called check_not_null_constraints, so I think we should not add this restriction here.

Thanks @jonahgao , got it now, this is a good example to explain.

@jonahgao I addressed the comments, but is the issue valid now? #14550

I can't find which bug we need to fix, thanks!

I'm not sure if it's necessary to ensure that the schema of output batches has the same nullability. This issue exists not only with inserts but also with other queries like UNION.

DataFusion CLI v45.0.0 > create table t1(a int not null) as values(1); 0 row(s) fetched. Elapsed 0.009 seconds. > create table t2(a int) as values(2); 0 row(s) fetched. Elapsed 0.011 seconds. > select * from t1 union all select * from t2; batch schema: Schema { fields: [Field { name: "a", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } batch schema: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } +---+ | a | +---+ | 1 | | 2 | +---+ 2 row(s) fetched. Elapsed 0.007 seconds.

If it is necessary, perhaps we should rewrite the nullability of the output batches instead of restricting the input schemas to have the same nullability, as the latter could prevent some queries from being executed.

DataFusion CLI v45.0.0 > create table t1(a int not null) as values(1); 0 row(s) fetched. Elapsed 0.012 seconds. > create table t2(a int) as values(null); 0 row(s) fetched. Elapsed 0.003 seconds. > select * from t1 union all select * from t2; +------+ | a | +------+ | NULL | | 1 | +------+ 2 row(s) fetched. Elapsed 0.004 seconds.

Thanks @jonahgao , got it now, we may need to make the output schema unified if we have nullable field for any input.

For example, above case, we need to make sure the schema is nullable for output.

Updated, it seems several union issues are also related the nullable for union:
#14352

jonahgao · 2025-02-13T14:25:32Z

datafusion/common/src/dfschema.rs

+                .try_for_each(|(f1, f2)| {
+                    if f1.name() != f2.name() || !DFSchema::datatype_is_logically_equal(f1.data_type(), f2.data_type()) {
+                        _plan_err!(
+                            "Inserting query schema mismatch: Expected table field '{}' with type {:?}, \


Based on the function name, it seems that we haven't restricted it to only be used for insertion🤔.

Checked the function only be called by insert into cases.

Perhaps we could update the comments to reflect this

Added comments in latest PR, thanks all.

alamb

This looks like an improvment to me -- thank you @zhuqi-lucas and @jonahgao

alamb · 2025-02-15T11:01:56Z

datafusion/common/src/dfschema.rs

+                .try_for_each(|(f1, f2)| {
+                    if f1.name() != f2.name() || !DFSchema::datatype_is_logically_equal(f1.data_type(), f2.data_type()) {
+                        _plan_err!(
+                            "Inserting query schema mismatch: Expected table field '{}' with type {:?}, \


Perhaps we could update the comments to reflect this

alamb · 2025-02-15T11:04:03Z

datafusion/core/tests/dataframe/mod.rs

+    match write_df
+        .write_table("t", DataFrameWriteOptions::new())
+        .await
+    {
+        Ok(_) => {}
+        Err(e) => {
+            assert_contains!(
+                e.to_string(),
+                "Inserting query must have the same schema length as the table."
+            );
+        }
+    }


I think you can write this much more concisely using unwrap_err

Suggested change

match write_df

.write_table("t", DataFrameWriteOptions::new())

.await

{

Ok(_) => {}

Err(e) => {

assert_contains!(

e.to_string(),

"Inserting query must have the same schema length as the table."

);

}

}

let e = write_df

.write_table("t", DataFrameWriteOptions::new())

.await

.unwrap_err();

The same comment applies to the code below as well

Good idea, thanks @alamb , addressed in latest PR.

alamb · 2025-02-15T11:04:20Z

datafusion/core/tests/dataframe/mod.rs

+    let initial_table = Arc::new(MemTable::try_new(schema.clone(), vec![vec![]])?);
+    session_ctx.register_table("t", initial_table.clone())?;
+
+    // There are three cases we need to check


Suggested change

// There are three cases we need to check

// There are two cases we need to check

Good catch! Addressed in latest

zhuqi-lucas · 2025-02-15T13:24:06Z

This looks like an improvment to me -- thank you @zhuqi-lucas and @jonahgao

Thanks @alamb for review, addressed comments in latest PR. For union schema potential schema checking improvement #14572 (comment) , we may can create a new issue to discuss it.

alamb · 2025-02-17T11:56:39Z

🚀

github-actions bot added core Core DataFusion crate common Related to common crate sqllogictest SQL Logic Tests (.slt) labels Feb 10, 2025

zhuqi-lucas marked this pull request as draft February 10, 2025 10:16

zhuqi-lucas marked this pull request as ready for review February 10, 2025 15:32

jayzhan211 reviewed Feb 11, 2025

View reviewed changes

github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Physical Expressions optimizer Optimizer rules proto Related to proto crate functions labels Feb 11, 2025

zhuqi-lucas force-pushed the 14550_issue branch from 0e7dac6 to af496fb Compare February 11, 2025 15:52

github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Physical Expressions optimizer Optimizer rules proto Related to proto crate functions labels Feb 11, 2025

zhuqi-lucas commented Feb 11, 2025

View reviewed changes

zhuqi-lucas requested review from jayzhan211 and jonahgao February 11, 2025 16:19

alamb changed the title ~~bug: improve schema checking for instert into cases~~ bug: improve schema checking for `insert into cases Feb 11, 2025

alamb changed the title ~~bug: improve schema checking for `insert into cases~~ bug: improve schema checking for insert into cases Feb 11, 2025

Address comments

f648838

zhuqi-lucas force-pushed the 14550_issue branch from 0071247 to f648838 Compare February 12, 2025 09:26

zhuqi-lucas added 2 commits February 12, 2025 17:39

Address comments

72cf849

Merge remote-tracking branch 'upstream/main' into 14550_issue

87133a2

zhuqi-lucas commented Feb 12, 2025

View reviewed changes

zhuqi-lucas added 2 commits February 12, 2025 19:52

Merge remote-tracking branch 'upstream/main' into 14550_issue

281086b

Address comments

d529fad

jonahgao reviewed Feb 12, 2025

View reviewed changes

Address comment

a43ba72

xudong963 mentioned this pull request Feb 13, 2025

Release DataFusion 46.0.0 #14123

Open

16 tasks

jonahgao reviewed Feb 13, 2025

View reviewed changes

alamb approved these changes Feb 15, 2025

View reviewed changes

Address new comments

0f82c4f

alamb merged commit 42eabb9 into apache:main Feb 17, 2025
24 checks passed

	// There are three cases we need to check
	// There are two cases we need to check

bug: improve schema checking for insert into cases #14572

bug: improve schema checking for insert into cases #14572

Conversation

zhuqi-lucas commented Feb 10, 2025 • edited Loading

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas commented Feb 12, 2025

Choose a reason for hiding this comment

zhuqi-lucas commented Feb 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuqi-lucas Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuqi-lucas Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuqi-lucas commented Feb 15, 2025 • edited Loading

alamb commented Feb 17, 2025

bug: improve schema checking for `insert into` cases #14572

bug: improve schema checking for `insert into` cases #14572

zhuqi-lucas commented Feb 10, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 12, 2025 •

edited

Loading

zhuqi-lucas Feb 12, 2025 •

edited

Loading

zhuqi-lucas commented Feb 15, 2025 •

edited

Loading