DataSink::write_all given invalid RecordBatchStream #14394

gatesn · 2025-01-31T21:51:01Z

Describe the bug

We were trying to implement a DataSink and found that we were being given different schemas for the record batches than per the RecordBatchStream.

STREAM DTYPE Schema { fields: [Field { name: "c1", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }
RB SCHEMA: Schema { fields: [Field { name: "c1", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }

STREAM DTYPE Schema { fields: [Field { name: "c1", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }
RB SCHEMA: Schema { fields: [Field { name: "c1", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "c2", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }

The first (correct) invocation comes from assembling a logical plan with LogicalPlanBuilder::insert_into

The second (incorrect) invocation comes from session.sql("INSERT INTO my_tbl VALUES ('hello', 42::INT);")

I figured I'd add an assertion into the RecordBatchStreamAdapter, and it looks like ~12 tests fail on main right now with mismatched schemas. I wonder if it's worth adding that as a debug assertion?

https://github.com/gatesn/datafusion/pull/new/ngates/record-batch-stream-schema

cc @AdamGS

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

ozankabak · 2025-02-02T06:01:42Z

Is this a bug in DataSink code, or is it a downstream bug that becomes somewhat hard to notice because there is no debug_assert?

gatesn · 2025-02-02T11:47:12Z

I imagine it's downstream, and of course, the debug assert only catches bad RecordBatchStream impls that used the Adapter.

This specific bug may not even be caught by the debug assert if it doesn't use the Adapter somewhere.

zhuqi-lucas · 2025-02-02T12:23:49Z

Thank you @gatesn for the report.

Can you provide the full code or sql to reproduce the it? So we can solve it more quickly.

gatesn · 2025-02-02T12:40:51Z

Yes, this code fails the schema assertion:

https://github.com/apache/datafusion/compare/main...gatesn:datafusion:ngates/record-batch-stream-schema?expand=1#diff-9b1672adeba35025e24d21f8d7da2f0e87487231775c5e470edf56886f90c837R811-R833

alamb · 2025-02-03T11:01:54Z

I wonder if it's worth adding that as a debug assertion?

I agree this would be good to add -- most similar check in DataFusion do a check and raise an DataFusionError;:Internal if hit

ozankabak · 2025-02-03T12:45:51Z

The only thing I would be worried about is the potential overhead. I get the feeling that this is something that should be done "once" but I'm not sure how we can do it. Maybe utilize the invariant and sanity checkers to check the contract somehow. Let's think about this.

zhuqi-lucas · 2025-02-03T15:17:13Z

The second (incorrect) invocation comes from session.sql("INSERT INTO my_tbl VALUES ('hello', 42::INT);")

          LogicalPlan::Dml(DmlStatement {
                table_name,
                op: WriteOp::Insert(insert_op),
                ..
            }) => {
                let name = table_name.table();
                let schema = session_state.schema_for_ref(table_name.clone())?;
                if let Some(provider) = schema.table(name).await? {
                    let input_exec = children.one()?;
                    provider
                        .insert_into(session_state, input_exec, *insert_op)
                        .await?
                } else {
                    return exec_err!("Table '{table_name}' does not exist");
                }
            }

I think we may can add check during the physical plan generation, we have the table schema, also we have the input_exec to insert into, so we can check here?

gatesn · 2025-02-03T15:26:13Z

Is this overhead a problem if it's only a debug_assertion? It will be compiled out in release builds. Or are you looking for a release assertion to put somewhere?

rkrishn7 · 2025-02-04T06:31:39Z

Poked around and I think the specific issue here is due to the fact that the schema assigned to LogicalPlan::Values during planning defaults all its fields to nullable.

I've opened a PR to capture nullability from the table schema for the specified columns.

jonahgao · 2025-02-06T08:28:19Z

I am concerned whether union will also lead to similar behavior when the nullability of the two inputs is different.

rkrishn7 · 2025-02-07T20:20:09Z

@jonahgao Thank you for calling this out. I think you're right!

In fact, I think we could say more generally, this issue arises when the schema of the source of an INSERT statement contain fields that differ from the table schema in terms of nullability.

I think a better approach may be to map the schema of the source plan to a new schema that demonstrates parity in field nullability with the table schema. We could do this directly when planning the insert, after constructing the source LogicalPlan. In insert_to_plan.

I opened a new issue for this here - #14550

Would love a sanity check from you to verify the above makes sense!

gatesn added the bug Something isn't working label Jan 31, 2025

gatesn mentioned this issue Jan 31, 2025

Fix null coercion of scalars spiraldb/vortex#2172

Open

alamb mentioned this issue Jan 24, 2025

Release DataFusion 46.0.0 #14123

Open

12 tasks

rkrishn7 mentioned this issue Feb 4, 2025

fix: Capture nullability in Values node planning #14472

Merged

jayzhan211 closed this as completed in #14472 Feb 7, 2025

rkrishn7 mentioned this issue Feb 7, 2025

Construct source plan schema with correct nullability during INSERT planning. #14550

Open

zhuqi-lucas mentioned this issue Feb 10, 2025

bug: improve schema checking for insert into cases #14572

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataSink::write_all given invalid RecordBatchStream #14394

DataSink::write_all given invalid RecordBatchStream #14394

gatesn commented Jan 31, 2025

ozankabak commented Feb 2, 2025

gatesn commented Feb 2, 2025

zhuqi-lucas commented Feb 2, 2025

gatesn commented Feb 2, 2025 •

edited

Loading

alamb commented Feb 3, 2025

ozankabak commented Feb 3, 2025

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

gatesn commented Feb 3, 2025

rkrishn7 commented Feb 4, 2025 •

edited

Loading

jonahgao commented Feb 6, 2025

rkrishn7 commented Feb 7, 2025

DataSink::write_all given invalid RecordBatchStream #14394

DataSink::write_all given invalid RecordBatchStream #14394

Comments

gatesn commented Jan 31, 2025

Describe the bug

To Reproduce

Expected behavior

Additional context

ozankabak commented Feb 2, 2025

gatesn commented Feb 2, 2025

zhuqi-lucas commented Feb 2, 2025

gatesn commented Feb 2, 2025 • edited Loading

alamb commented Feb 3, 2025

ozankabak commented Feb 3, 2025

zhuqi-lucas commented Feb 3, 2025 • edited Loading

gatesn commented Feb 3, 2025

rkrishn7 commented Feb 4, 2025 • edited Loading

jonahgao commented Feb 6, 2025

rkrishn7 commented Feb 7, 2025

gatesn commented Feb 2, 2025 •

edited

Loading

zhuqi-lucas commented Feb 3, 2025 •

edited

Loading

rkrishn7 commented Feb 4, 2025 •

edited

Loading