Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper NULL handling in array functions #14451

Open
jkosh44 opened this issue Feb 3, 2025 · 7 comments
Open

Proper NULL handling in array functions #14451

jkosh44 opened this issue Feb 3, 2025 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@jkosh44
Copy link
Contributor

jkosh44 commented Feb 3, 2025

Describe the bug

The following array functions do not properly handle NULL input, they either return an error or panic (non-exhaustive):

  • array_sort
  • array_replace
  • array_replace_all
  • array_replace_n
  • array_resize

Discovered in #14289

To Reproduce

Scalar input

> SELECT array_sort([1,2,3], NULL);
Internal error: could not cast value to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<i32>>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

> SELECT array_resize([1,2,3], NULL, 0);
Internal error: could not cast value to arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::Int64Type>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

> SELECT array_replace([1,2,3], NULL, NULL);
thread 'main' panicked at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-54.0.0/src/transform/mod.rs:418:13:
assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
  left: Int64
 right: Null
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:76:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:373:5
   4: arrow_data::transform::MutableArrayData::with_capacities
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-54.0.0/src/transform/mod.rs:418:13
   5: datafusion_functions_nested::replace::general_replace
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/replace.rs:303:23
   6: datafusion_functions_nested::replace::array_replace_inner
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/replace.rs:394:13
   7: core::ops::function::Fn::call
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:79:5
   8: datafusion_functions_nested::utils::make_scalar_function::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/utils.rs:72:22
   9: <datafusion_functions_nested::replace::ArrayReplace as datafusion_expr::udf::ScalarUDFImpl>::invoke_batch
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/replace.rs:123:9
  10: datafusion_expr::udf::ScalarUDFImpl::invoke_with_args
             at /home/joe/Projects/datafusion/datafusion/expr/src/udf.rs:643:9
  11: datafusion_expr::udf::ScalarUDF::invoke_with_args
             at /home/joe/Projects/datafusion/datafusion/expr/src/udf.rs:237:9
  12: <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate
             at /home/joe/Projects/datafusion/datafusion/physical-expr/src/scalar_function.rs:195:22
  13: datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator::evaluate_to_scalar
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:639:29
  14: <datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator as datafusion_common::tree_node::TreeNodeRewriter>::f_up
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:529:30
  15: datafusion_common::tree_node::TreeNode::rewrite::{{closure}}::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:183:13
  16: datafusion_common::tree_node::Transformed<T>::transform_parent
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:763:44
  17: datafusion_common::tree_node::TreeNode::rewrite::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:28:9
  18: stacker::maybe_grow
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/stacker-0.1.17/src/lib.rs:55:9
  19: datafusion_common::tree_node::TreeNode::rewrite
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:177:50
  20: datafusion_optimizer::simplify_expressions::expr_simplifier::ExprSimplifier<S>::simplify_with_cycle_count
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:211:17
  ...
  71: tokio::runtime::runtime::Runtime::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:340:13
  72: datafusion_cli::main
             at ./src/main.rs:131:5
  73: core::ops::function::FnOnce::call_once
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
killed

Process finished with exit code 9

Array Input

> CREATE TABLE t (a int[], b int, c int);
0 row(s) fetched. 
Elapsed 0.004 seconds.

> INSERT INTO t VALUES ([1,2,3], 1, 2), (NULL, 1, 2), ([1,2,3], NULL, 2), ([1,2,3], 1, NULL);
+-------+
| count |
+-------+
| 4     |
+-------+
1 row(s) fetched. 
Elapsed 0.006 seconds.

> SELECT array_sort(a, b, c) FROM t;
Internal error: could not cast value to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<i32>>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
> SELECT array_resize(a, b, c) FROM t;
Internal error: could not cast value to arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::Int64Type>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Expected behavior

The functions should either return NULL or correctly handle the NULL input without panicking.

Additional context

I briefly investigated the functions and found the following:

array_sort

To fix this function we can add NullHandling::Propagate to theimpl ScalarUDFImpl for ArraySort. Then we also have to update array_sort_inner to return NULL on NULL inputs.

array_resize

I haven't looked into this one yet, but I think the solution is probably similar to array_sort.

array_replace

Note this applies to array_replace_all and array_replace_n.

This function actually does not want to propagate NULLs, it should be able to find NULL elements and replace with NULL elements. When the inputs are ArrayRefs then actually everything works correctly, it's only when the inputs are Scalar that things break. The issue is that the functions accepts DataType::Null instead of a null value for the list element type. Then we get type errors when trying to use the DataType::Null with a typed list.

I believe if we updated the signature from Signature::any(3, Volatility::Immutable) to something that includes type information then we would fix the panic. However, we face a familiar problem that there is no way to represent (list, element-type, element-type, i64) with the current Signature struct.

@jkosh44 jkosh44 added the bug Something isn't working label Feb 3, 2025
@alan910127
Copy link
Contributor

take

@alan910127
Copy link
Contributor

alan910127 commented Feb 6, 2025

Hello @jkosh44, what do you think about this?

array_resize should return NULL only if the second argument (new length) is NULL. If the third argument (new element) is NULL, the new slots (if any) should be set to NULL. This behavior would make array_resize(array, new_len) equivalent to array_resize(array, new_len, NULL), which I think makes more sense.

Also, since null_handling makes array_sort return NULL for NULL inputs, should we still handle NULL cases in array_sort_inner?

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 6, 2025

Hello @jkosh44, what do you think about this?

array_resize should return NULL only if the second argument (new length) is NULL. If the third argument (new element) is NULL, the new slots (if any) should be set to NULL. This behavior would make array_resize(array, new_len) equivalent to array_resize(array, new_len, NULL), which I think makes more sense.

Good point, I don't think we can use NullHandling here and will just have to handle the NULL manually. The DuckDB behavior matches this as well.

duckdb> select array_resize([1,2,3], 4, NULL);
┌─────────────────────────────────────────────────┐
│ array_resize(main.list_value(1, 2, 3), 4, NULL) │
╞═════════════════════════════════════════════════╡
│ [1, 2, 3, ]                                     │
└─────────────────────────────────────────────────┘

I think if we update the signature of array_resize to be more descriptive, then the error will go away. EDIT: Actually I'm not sure if this is true or not.

Also, since null_handling makes array_sort return NULL for NULL inputs, should we still handle NULL cases in array_sort_inner?

Yes, we should make sure that null handling is consistent for ColumnarValue::Scalar(_) and ColumnarValue::Array(_) inputs. As an FYI, I'm thinking that we should actually remove the NullHandling enum. I wrote up my rationale here: #14289 (comment)

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 6, 2025

@alan910127 I have a very rough proposal, that might help you with this issue. When working with ArrayFunctionSignature I found it very difficult to express my desired signature. Currently it looks like this:

#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
pub enum ArrayFunctionSignature {
/// Specialized Signature for ArrayAppend and similar functions
/// The first argument should be List/LargeList/FixedSizedList, and the second argument should be non-list or list.
/// The second argument's list dimension should be one dimension less than the first argument's list dimension.
/// List dimension of the List/LargeList is equivalent to the number of List.
/// List dimension of the non-list is 0.
ArrayAndElement,
/// Specialized Signature for ArrayPrepend and similar functions
/// The first argument should be non-list or list, and the second argument should be List/LargeList.
/// The first argument's list dimension should be one dimension less than the second argument's list dimension.
ElementAndArray,
/// Specialized Signature for Array functions of the form (List/LargeList, Index+)
/// The first argument should be List/LargeList/FixedSizedList, and the next n arguments should be Int64.
ArrayAndIndexes(NonZeroUsize),
/// Specialized Signature for Array functions of the form (List/LargeList, Element, Optional Index)
ArrayAndElementAndOptionalIndex,
/// Specialized Signature for ArrayEmpty and similar functions
/// The function takes a single argument that must be a List/LargeList/FixedSizeList
/// or something that can be coerced to one of those types.
Array,
/// A function takes a single argument that must be a List/LargeList/FixedSizeList
/// which gets coerced to List, with element type recursively coerced to List too if it is list-like.
RecursiveArray,
/// Specialized Signature for MapArray
/// The function takes a single argument that must be a MapArray
MapArray,
}

However, what if we modified it to look like this:

#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
pub struct ArrayFunctionSignature {
    args: Vec<ArrayFunctionArg>,
}

#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
pub enum ArrayFunctionArg {
    Element,
    Index,
    Array,
    RecursiveArray,
    MapArray,
    DataType{
        data_type: DataType,
        name: &'static str
    },
}

I may try to put a PoC together today.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 6, 2025

Also FWIW, it seems like the issue with some of these functions is with passing in any incorrect type, not just NULL. For example,

> select array_resize([1,2,3,4], 'hello');
Internal error: could not cast value to arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::Int64Type>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

which makes me think that fixing the signatures is the correct way to address this.

jkosh44 added a commit to jkosh44/datafusion that referenced this issue Feb 6, 2025
This commit allows for more expressive array function signatures.
Previously, `ArrayFunctionSignature` was an enum of potential argument
combinations and orders. For many array functions, none of the
`ArrayFunctionSignature` variants work, so they use
`TypeSignature::VariadicAny` instead. This commit will allow those
functions to use more descriptive signatures which will prevent them
from having to perform manual type checking in the function
implementation.

As an example, this commit also updates the signature of the
`array_replace` family of functions to use a new expressive signature,
which removes a panic that existed previously.

Works towards resolving apache#14451
jkosh44 added a commit to jkosh44/datafusion that referenced this issue Feb 6, 2025
This commit allows for more expressive array function signatures.
Previously, `ArrayFunctionSignature` was an enum of potential argument
combinations and orders. For many array functions, none of the
`ArrayFunctionSignature` variants work, so they use
`TypeSignature::VariadicAny` instead. This commit will allow those
functions to use more descriptive signatures which will prevent them
from having to perform manual type checking in the function
implementation.

As an example, this commit also updates the signature of the
`array_replace` family of functions to use a new expressive signature,
which removes a panic that existed previously.

Works towards resolving apache#14451
@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 6, 2025

@alan910127 I have a very rough proposal, that might help you with this issue

OK I put this together here: #14532, if you're interested in help reviewing. It fixes array_resize, but doesn't touch any of the other array functions. If people are happy with that change, then you should be able to build off of that to fix the rest of the array functions.

EDIT: Or if you're interested you can help debug the test failure in the PR because I'm done working on it for the day.

jkosh44 added a commit to jkosh44/datafusion that referenced this issue Feb 6, 2025
This commit allows for more expressive array function signatures.
Previously, `ArrayFunctionSignature` was an enum of potential argument
combinations and orders. For many array functions, none of the
`ArrayFunctionSignature` variants worked, so they used
`TypeSignature::VariadicAny` instead. This commit will allow those
functions to use more descriptive signatures which will prevent them
from having to perform manual type checking in the function
implementation.

As an example, this commit also updates the signature of the
`array_replace` family of functions to use a new expressive signature,
which removes a panic that existed previously.

There are still a couple of limitations with this approach. First of
all, there's no way to describe a function that has multiple different
arrays of different type or dimension. Additionally, there isn't
support for functions with map arrays and recursive arrays that have
more than one argument.

Works towards resolving apache#14451
@alan910127
Copy link
Contributor

@jkosh44 Thanks for all this information! I’ll take a look at the PR later and likely review it. If I have any thoughts or suggestions, I’ll leave comments there.

jkosh44 added a commit to jkosh44/datafusion that referenced this issue Feb 10, 2025
This commit allows for more expressive array function signatures.
Previously, `ArrayFunctionSignature` was an enum of potential argument
combinations and orders. For many array functions, none of the
`ArrayFunctionSignature` variants worked, so they used
`TypeSignature::VariadicAny` instead. This commit will allow those
functions to use more descriptive signatures which will prevent them
from having to perform manual type checking in the function
implementation.

As an example, this commit also updates the signature of the
`array_replace` family of functions to use a new expressive signature,
which removes a panic that existed previously.

There are still a couple of limitations with this approach. First of
all, there's no way to describe a function that has multiple different
arrays of different type or dimension. Additionally, there isn't
support for functions with map arrays and recursive arrays that have
more than one argument.

Works towards resolving apache#14451
jkosh44 added a commit to jkosh44/datafusion that referenced this issue Feb 11, 2025
This commit allows for more expressive array function signatures.
Previously, `ArrayFunctionSignature` was an enum of potential argument
combinations and orders. For many array functions, none of the
`ArrayFunctionSignature` variants worked, so they used
`TypeSignature::VariadicAny` instead. This commit will allow those
functions to use more descriptive signatures which will prevent them
from having to perform manual type checking in the function
implementation.

As an example, this commit also updates the signature of the
`array_replace` family of functions to use a new expressive signature,
which removes a panic that existed previously.

There are still a couple of limitations with this approach. First of
all, there's no way to describe a function that has multiple different
arrays of different type or dimension. Additionally, there isn't
support for functions with map arrays and recursive arrays that have
more than one argument.

Works towards resolving apache#14451
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants