function: Allow more expressive array signatures #14532

jkosh44 · 2025-02-06T21:34:15Z

This commit allows for more expressive array function signatures. Previously, ArrayFunctionSignature was an enum of potential argument combinations and orders. For many array functions, none of the ArrayFunctionSignature variants work, so they use TypeSignature::VariadicAny instead. This commit will allow those functions to use more descriptive signatures which will prevent them from having to perform manual type checking in the function implementation.

As an example, this commit also updates the signature of the array_replace family of functions to use a new expressive signature, which removes a panic that existed previously.

Works towards resolving #14451

Which issue does this PR close?

Works towards closing, but doesn't fully close, #14451

Are these changes tested?

Yes

Are there any user-facing changes?

No, other than removing some panics.

jkosh44 · 2025-02-07T02:59:26Z

This is failing CI for the following reason.

Previously, in get_valid_types() functions with the array signature ArrayAndIndexes and Array would convert top level FixedSizeLists to Lists; nested FixedSizeLists would not be changed. However, functions with the array signature ArrayAndElement, ElementAndArray, and ArrayAndElementAndOptionalIndex would recursively convert FixedSizeLists to Lists.

So for example, if we had the type FixedSizeList(FixedSizeList(Int64)), ArrayAndIndexes would convert this to List(FixedSizeList(Int64)) but ArrayAndElement would convert this to List(List(Int64)).

This PR modifies the logic of get_valid_types() so that we always recursively convert FixedSizeLists to Lists. So FixedSizeList(FixedSizeList(Int64)) is always converted to List(List(Int64)).

There's a test, test_array_element_return_type_fixed_size_list(), that essentially asserts that get_valid_types() returns List(FixedSizedList(Int32)) for the array_element function, which used to have the ArrayAndIndexes signature. This test now fails because the FixedSizedList is converted to a List recursively.

I don't understand the old behavior, and why it was different for different function signatures. So I'm having trouble figuring out what the new behavior should be. We could of course sniff out the functions arguments to see if we have [array, index+] and if so only replace top level FixedSizedList with List, but I wouldn't understand why that's correct.

jayzhan211 · 2025-02-07T03:16:37Z

See this, #13819 (comment).
We only convert an inner fixed-size list to a regular list when the function performs a mutation operation, such as array_append. For non-mutation operations, like array_element, we keep T as a fixed-size list

jkosh44 · 2025-02-07T03:18:33Z

datafusion/expr-common/src/signature.rs

+    Array {
+        /// A full list of the arguments accepted by this function.
+        arguments: Vec<ArrayFunctionArgument>,
+    },


One drawback of this structure is that someone can technically create a signature with no array, only for it to get rejected at runtime in get_valid_types(). An alternative structure could be:

/// A list of arguments that come before the array. pre_array_aguments: Vec<ArrayFunctionArgument>, /// A list of arguments that come after the array. post_array_aguments: Vec<ArrayFunctionArgument>,

Then we would remove the ArrayFunctionArgument::Array variant. This would have the draw back of only allowing a single array argument in the signature, which might be fine.

Another option is to force people to create this through a constructor that returns an error if there's no array argument.

jkosh44 · 2025-02-07T03:19:22Z

datafusion/expr-common/src/signature.rs

+    /// An Int64 index argument.
+    Index,


Offset might be a better name for this? It can technically be used for sizes in functions like array_resize or counts for functions like array_replace_n.

jkosh44 · 2025-02-07T05:18:53Z

See this, #13819 (comment). We only convert an inner fixed-size list to a regular list when the function performs a mutation operation, such as array_append. For non-mutation operations, like array_element, we keep T as a fixed-size list

Ah ok, so the idea was that if the function changed the size of the list, then it would recursively convert FixedSizedList to List, but if it didn't change the size then it would only change the top level FixedSizedList to a List. However, it looks like the code was making the assumption that all ArrayAndIndexes functions did not modify the list which was actually no longer true after 3dfce7d.

So it sounds like we need to include in the array signature whether or not the function might change the size of the list and use that information.

jayzhan211 · 2025-02-07T05:28:24Z

So it sounds like we need to include in the array signature whether or not the function might change the size of the list and use that information.

yes

jkosh44 · 2025-02-07T14:08:31Z

datafusion/common/src/utils/mod.rs

 /// assert_eq!(coerced_type, DataType::List(Arc::new(Field::new_list_field(DataType::Float64, true))));
 pub fn coerced_type_with_base_type_only(
    data_type: &DataType,
    base_type: &DataType,
+    mutable: bool,


We couldn't use the enum here because this crate doesn't have access to the enum.

Then it means the enum should be move to here.

Done. Another option would be to move this function, which is only used in the get_valid_types function, to function.rs.

jkosh44 · 2025-02-07T14:12:59Z

datafusion/expr-common/src/signature.rs

+        /// Whether any of the input arrays are modified.
+        mutability: ArrayFunctionMutability,


I'm still not super confident that I did this correctly or if this is the correct thing to be modelling. For example is array_positions mutable? It returns an array of potentially different sizes but doesn't actually mutate the input array.

I still think that I don't fully understand why the conversion is happening in get_valid_types. Why does the mutability of a function affect what types are accepted as arguments? It seems like the mutability of the function should affect the return types of the function not the argument types.

Why does the mutability of a function affect what types are accepted as arguments? It seems like the mutability of the function should affect the return types of the function not the argument types.

The current code functions as expected: both Fixed and List are accepted regardless of the specified mutability, and the return type is determined by the mutability setting. Isn't it 🤔 ?

I'm still not super confident that I did this correctly or if this is the correct thing to be modelling. For example is array_positions mutable? It returns an array of potentially different sizes but doesn't actually mutate the input array.

Instead of modeling "mutability," we can explicitly define the desired type in the function signature. This type can be either List or FixedSizeList, and we coerce the input accordingly

and the return type is determined by the mutability setting. Isn't it 🤔 ?

I'm still unfamiliar with much of the code base, so please take everything I'm saying with a grain of salt. The mutability of the function is only ever looked at by the get_valid_types() function, which is described in the code as "Returns a Vec of all possible valid argument types for the given signature.".

https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/expr/src/type_coercion/functions.rs#L352-L363

From that description, I would conclude that mutability determines the accepted argument types, not the return type.

However, ScalarUDFImpl has a couple of trait functions, like fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>, that determine the return type as a function of the argument types.

https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/expr/src/udf.rs#L540-L596

If we take a look at some of the implementations of return_type() for array functions, many of them blindly pass through the argument type of the input array.

https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L396-L398
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L704-L706
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L811-L813
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/concat.rs#L108-L110
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/concat.rs#L196-L198

So by modifying the accepted argument types we are indirectly modifying the return types.

Instead of modeling "mutability," we can explicitly define the desired type in the function signature. This type can be either List or FixedSizeList, and we coerce the input accordingly

It might be a better approach to not modify the accepted argument types (i.e. don't convert FixedSizeList to List in get_valid_types()), and instead move the logic to return_type(). Then functions can be explicit about not returning FixedSizeLists.

It might be confusing but return_type arguments are coerced already. get_valid_types returned coerced types and pass these to return_type, that is why we don't need to deal with coercion in return_type again.

FixedSizeList to List is part of the coercion so it should be the logic in Typecoercion and get_valid_types

get_valid_types is used in data_types_with_scalar_udf

/// Performs type coercion for scalar function arguments. /// /// Returns the data types to which each argument must be coerced to /// match `signature`. /// /// For more details on coercion in general, please see the /// [`type_coercion`](crate::type_coercion) module. pub fn data_types_with_scalar_udf(

Returns a Vec of all possible valid argument types for the given signature

Returns a Vec of all possible valid (coerced) argument types for the given signature

Ok, I think I'm starting to understand this better. We coerce the argument types in get_valid_types in case we might use that argument type in the return type. It is a bit confusing/unintuitive because we don't actually know anything about the return type at this point, and the return type may have nothing to do with the argument types For example array_has accepts an array but just returns a bool.

One thing I'm still confused about is, what if we wanted to define a function that accepted FixedSizeList and returned List, or vice versa? Is that possible? Does it even make sense?

I agree with your other comment though, requiring function implementers to coerce types in return_type is probably a footgun and will likely be forgotten.

I think the best way forward is to revert 3a7c0e6 and add back an enum that controls the coercing of arguments. I'll have a think about the best way of doing that. It feels a little bad that we'll require function implementers to implement return_type() and provide an enum that describes how we should coerce types for the return type, since they both are doing similar things. It seems like it may be a source of confusion.

Also I noticed through some manual testing that this PR broke some function calls with nested FixedSizeList so I'll add some test cases with that. For example, SELECT array_prepend(arrow_cast([1], 'FixedSizeList(1, Int64)'), [arrow_cast([2], 'FixedSizeList(1, Int64)'), arrow_cast([3], 'FixedSizeList(1, Int64)')]); is broken in this PR.

It is a bit confusing/unintuitive because we don't actually know anything about the return type at this point, and the return type may have nothing to do with the argument types

At the very least, the coerced types should be inferred from the signature you provided. There may be a better design for this, but I haven’t come up with an alternative idea yet.

One thing I'm still confused about is, what if we wanted to define a function that accepted FixedSizeList and returned List, or vice versa? Is that possible? Does it even make sense?

I think both should be possible for datafusion and makes sense dependent on what you want to do. FixedSizeList to List is common, List to FixedSizeList if your function always output fixed size. Most of the existing function in datafusion doesn't care about fixed size list, so convert it to list simplies things

I think the best way forward is to revert 3a7c0e6

Instead of define mutability, let user define whether they want to coerce fixed size list to list directly is probably a better idea. It is also more flexible. We can add such flag within ArraySignature

jkosh44 · 2025-02-10T15:05:24Z

It might be a better approach to not modify the accepted argument types (i.e. don't convert FixedSizeList to List in get_valid_types(), and instead move the logic to return_type(). Then functions can be explicit about not returning FixedSizeLists.

I pushed a commit for this to see how CI would like it. Happy to revert it if people don't like it.

I also pushed a commit to add some validations around ArrayFunctionSignature.

jayzhan211 · 2025-02-11T01:47:32Z

datafusion/functions-nested/src/remove.rs

@@ -98,7 +99,7 @@ impl ScalarUDFImpl for ArrayRemove {
    }

    fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
-        Ok(arg_types[0].clone())
+        Ok(coerced_fixed_size_list_to_list(&arg_types[0]))


If we coerce types in return_type then a question comes out, what coercion should be handled before return_type and what should be handled inside return_type

I don't think we should do this in this PR (or maybe ever), but if we ever did want to do this in the future, then I think the natural split would be to do coercion necessary for input arguments in get_valid_types() and coercion necessary for return types in return_type(). For example converting an array argument base type and an element type to a common type would happen in get_valid_types(), but converting a FixedSizedList to a List would happen in return_type().

If you take a look at array_dims as an example, it is doing something similar already in return_type(),

datafusion/datafusion/functions-nested/src/dimension.rs

Lines 98 to 107 in 2c73fcd

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {

Ok(match arg_types[0] {

List(_) | LargeList(_) | FixedSizeList(_, _) => {

List(Arc::new(Field::new_list_field(UInt64, true)))

}

_ => {

return plan_err!("The array_dims function can only accept List/LargeList/FixedSizeList.");

}

})

}

We already provide a trait function that allows functions to describe their return type as a function of their argument types, so it seems like a good place to put coercions that are needed specifically for the return type. On the other hand it adds a lot of complexity as you mentioned. The same code needs to be duplicated across all functions and the function implementer needs to actually remember to add the coercions. Additionally, the function implementation would actually need to handle both List and FixedSizeList, which most of them currently don't.

I just wanted to get some of my thoughts written down, but I agree that we shouldn't do any coercion in return_type in this PR.

jkosh44 · 2025-02-11T18:15:02Z

@jayzhan211 I just pushed a commit with your idea about adding a flag for array coercion. I'm feeling pretty good about the current state of this PR, but I have the following two open questions:

get_valid_types() always coerces the outermost FixedSizeList to List, no matter what the value of the flag is. I'm pretty sure that matches existing behavior, but I don't fully understand it. So I just wanted to double check with you that's it's correct.
Can you double check that I correctly assigned a function with Some(ListCoercion::FixedSizedListToList) vs None for the flag? I'm not 100% confident that I did it correctly.

This commit allows for more expressive array function signatures. Previously, `ArrayFunctionSignature` was an enum of potential argument combinations and orders. For many array functions, none of the `ArrayFunctionSignature` variants worked, so they used `TypeSignature::VariadicAny` instead. This commit will allow those functions to use more descriptive signatures which will prevent them from having to perform manual type checking in the function implementation. As an example, this commit also updates the signature of the `array_replace` family of functions to use a new expressive signature, which removes a panic that existed previously. There are still a couple of limitations with this approach. First of all, there's no way to describe a function that has multiple different arrays of different type or dimension. Additionally, there isn't support for functions with map arrays and recursive arrays that have more than one argument. Works towards resolving apache#14451

jayzhan211 · 2025-02-12T00:08:22Z

get_valid_types() always coerces the outermost FixedSizeList to List, no matter what the value of the flag is. I'm pretty sure that matches existing behavior, but I don't fully understand it. So I just wanted to double check with you that's it's correct

It is because of this, I think we now only coerce to list if the flag is set

    fn array(array_type: &DataType) -> Option<DataType> {
        match array_type {
            DataType::List(_) | DataType::LargeList(_) => Some(array_type.clone()),
            DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),
            _ => None,
        }
    }

jayzhan211 · 2025-02-12T00:12:02Z

datafusion/expr-common/src/signature.rs

+    pub fn new(arguments: Vec<ArrayFunctionArgument>) -> Result<Self, &'static str> {
+        if !arguments
+            .iter()
+            .any(|arg| *arg == ArrayFunctionArgument::Array)


instead of checking here, I think we can verify the validity in get_valid_types, so the definition of signature can be simplified.

jayzhan211 · 2025-02-12T00:18:47Z

datafusion/functions-nested/src/array_has.rs

@@ -94,7 +94,7 @@ impl Default for ArrayHas {
 impl ArrayHas {
    pub fn new() -> Self {
        Self {
-            signature: Signature::array_and_element(Volatility::Immutable),
+            signature: Signature::array_and_element(Volatility::Immutable, None),


How about something like this, we don't necessary need to wrap into a util function, it is less helpful when there are many fields

signature: Signature { type_signature: TypeSignature::ArraySignature( ArrayFunctionSignature::Array { arguments: vec![ ArrayFunctionArgument::Array, ArrayFunctionArgument::Element, ], array_coercion: Some(ListCoercion::FixedSizedListToList), }, ), volatility: Volatility::Immutable, },

but to avoid breaking Signature::array_and_element we can keep one without additional array_coercion field

github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Feb 6, 2025

This was referenced Feb 6, 2025

Proper NULL handling in array functions #14451

Open

functions: Remove NullHandling from scalar funcs #14531

Merged

jkosh44 force-pushed the array-signatures branch from b7fc773 to d4b74db Compare February 6, 2025 23:04

jkosh44 marked this pull request as ready for review February 7, 2025 02:59

jkosh44 commented Feb 7, 2025

View reviewed changes

github-actions bot added the common Related to common crate label Feb 7, 2025

jkosh44 commented Feb 7, 2025

View reviewed changes

jkosh44 force-pushed the array-signatures branch from 96b95e1 to 3a7c0e6 Compare February 10, 2025 15:05

jayzhan211 reviewed Feb 11, 2025

View reviewed changes

jkosh44 added 11 commits February 11, 2025 13:22

Add mutability

fa57013

Move mutability enum

8f9fd38

fmt

70e6f28

Fix doctest

d9fccdc

Add validation to array args

f5b6491

Remove mutability and update return types

54b5c4f

fmt

f885ac8

Fix clippy

3c651be

Fix imports

bf69ec8

Add list coercion flag

9ba3fe3

jkosh44 force-pushed the array-signatures branch from 6798ca2 to 9ba3fe3 Compare February 11, 2025 18:27

jkosh44 added 2 commits February 11, 2025 13:37

Some formatting fixes

b302914

Some formatting fixes

6c74609

jayzhan211 reviewed Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

function: Allow more expressive array signatures #14532

function: Allow more expressive array signatures #14532

jkosh44 commented Feb 6, 2025

jkosh44 commented Feb 7, 2025

jayzhan211 commented Feb 7, 2025

jkosh44 Feb 7, 2025

jkosh44 Feb 8, 2025

jkosh44 Feb 7, 2025

jkosh44 commented Feb 7, 2025

jayzhan211 commented Feb 7, 2025

jkosh44 Feb 7, 2025

jayzhan211 Feb 10, 2025

jkosh44 Feb 10, 2025

jkosh44 Feb 7, 2025

jayzhan211 Feb 10, 2025 •

edited

Loading

jkosh44 Feb 10, 2025 •

edited

Loading

jayzhan211 Feb 11, 2025 •

edited

Loading

jkosh44 Feb 11, 2025

jayzhan211 Feb 11, 2025 •

edited

Loading

jayzhan211 Feb 11, 2025 •

edited

Loading

jkosh44 commented Feb 10, 2025

jayzhan211 Feb 11, 2025

jkosh44 Feb 11, 2025 •

edited

Loading

jkosh44 commented Feb 11, 2025 •

edited

Loading

jayzhan211 commented Feb 12, 2025

jayzhan211 Feb 12, 2025

jayzhan211 Feb 12, 2025

jayzhan211 Feb 12, 2025 •

edited

Loading

		/// Whether any of the input arrays are modified.
		mutability: ArrayFunctionMutability,

	fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
	Ok(match arg_types[0] {
	List(_) \| LargeList(_) \| FixedSizeList(_, _) => {
	List(Arc::new(Field::new_list_field(UInt64, true)))
	}
	_ => {
	return plan_err!("The array_dims function can only accept List/LargeList/FixedSizeList.");
	}
	})
	}

function: Allow more expressive array signatures #14532

Are you sure you want to change the base?

function: Allow more expressive array signatures #14532

Conversation

jkosh44 commented Feb 6, 2025

Which issue does this PR close?

Are these changes tested?

Are there any user-facing changes?

jkosh44 commented Feb 7, 2025

jayzhan211 commented Feb 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkosh44 commented Feb 7, 2025

jayzhan211 commented Feb 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

jkosh44 Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

jkosh44 commented Feb 10, 2025

Choose a reason for hiding this comment

jkosh44 Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

jkosh44 commented Feb 11, 2025 • edited Loading

jayzhan211 commented Feb 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Feb 10, 2025 •

edited

Loading

jkosh44 Feb 10, 2025 •

edited

Loading

jayzhan211 Feb 11, 2025 •

edited

Loading

jayzhan211 Feb 11, 2025 •

edited

Loading

jayzhan211 Feb 11, 2025 •

edited

Loading

jkosh44 Feb 11, 2025 •

edited

Loading

jkosh44 commented Feb 11, 2025 •

edited

Loading

jayzhan211 Feb 12, 2025 •

edited

Loading