-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataSink::write_all given invalid RecordBatchStream #14394
Comments
Is this a bug in |
I imagine it's downstream, and of course, the debug assert only catches bad RecordBatchStream impls that used the Adapter. This specific bug may not even be caught by the debug assert if it doesn't use the Adapter somewhere. |
Thank you @gatesn for the report. Can you provide the full code or sql to reproduce the it? So we can solve it more quickly. |
I agree this would be good to add -- most similar check in DataFusion do a check and raise an |
The only thing I would be worried about is the potential overhead. I get the feeling that this is something that should be done "once" but I'm not sure how we can do it. Maybe utilize the invariant and sanity checkers to check the contract somehow. Let's think about this. |
The second (incorrect) invocation comes from session.sql("INSERT INTO my_tbl VALUES ('hello', 42::INT);") LogicalPlan::Dml(DmlStatement {
table_name,
op: WriteOp::Insert(insert_op),
..
}) => {
let name = table_name.table();
let schema = session_state.schema_for_ref(table_name.clone())?;
if let Some(provider) = schema.table(name).await? {
let input_exec = children.one()?;
provider
.insert_into(session_state, input_exec, *insert_op)
.await?
} else {
return exec_err!("Table '{table_name}' does not exist");
}
} I think we may can add check during the physical plan generation, we have the table schema, also we have the input_exec to insert into, so we can check here? |
Is this overhead a problem if it's only a debug_assertion? It will be compiled out in release builds. Or are you looking for a release assertion to put somewhere? |
Poked around and I think the specific issue here is due to the fact that the schema assigned to I've opened a PR to capture nullability from the table schema for the specified columns. |
I am concerned whether union will also lead to similar behavior when the nullability of the two inputs is different. |
@jonahgao Thank you for calling this out. I think you're right! In fact, I think we could say more generally, this issue arises when the schema of the source of an I think a better approach may be to map the schema of the source plan to a new schema that demonstrates parity in field nullability with the table schema. We could do this directly when planning the insert, after constructing the source I opened a new issue for this here - #14550 Would love a sanity check from you to verify the above makes sense! |
Describe the bug
We were trying to implement a
DataSink
and found that we were being given different schemas for the record batches than per the RecordBatchStream.The first (correct) invocation comes from assembling a logical plan with
LogicalPlanBuilder::insert_into
The second (incorrect) invocation comes from
session.sql("INSERT INTO my_tbl VALUES ('hello', 42::INT);")
I figured I'd add an assertion into the RecordBatchStreamAdapter, and it looks like ~12 tests fail on main right now with mismatched schemas. I wonder if it's worth adding that as a debug assertion?
https://github.com/gatesn/datafusion/pull/new/ngates/record-batch-stream-schema
cc @AdamGS
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: