Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function: Allow more expressive array signatures #14532

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jkosh44
Copy link
Contributor

@jkosh44 jkosh44 commented Feb 6, 2025

This commit allows for more expressive array function signatures. Previously, ArrayFunctionSignature was an enum of potential argument combinations and orders. For many array functions, none of the ArrayFunctionSignature variants work, so they use TypeSignature::VariadicAny instead. This commit will allow those functions to use more descriptive signatures which will prevent them from having to perform manual type checking in the function implementation.

As an example, this commit also updates the signature of the array_replace family of functions to use a new expressive signature, which removes a panic that existed previously.

Works towards resolving #14451

Which issue does this PR close?

Works towards closing, but doesn't fully close, #14451

Are these changes tested?

Yes

Are there any user-facing changes?

No, other than removing some panics.

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Feb 6, 2025
@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 7, 2025

This is failing CI for the following reason.

Previously, in get_valid_types() functions with the array signature ArrayAndIndexes and Array would convert top level FixedSizeLists to Lists; nested FixedSizeLists would not be changed. However, functions with the array signature ArrayAndElement, ElementAndArray, and ArrayAndElementAndOptionalIndex would recursively convert FixedSizeLists to Lists.

So for example, if we had the type FixedSizeList(FixedSizeList(Int64)), ArrayAndIndexes would convert this to List(FixedSizeList(Int64)) but ArrayAndElement would convert this to List(List(Int64)).

This PR modifies the logic of get_valid_types() so that we always recursively convert FixedSizeLists to Lists. So FixedSizeList(FixedSizeList(Int64)) is always converted to List(List(Int64)).

There's a test, test_array_element_return_type_fixed_size_list(), that essentially asserts that get_valid_types() returns List(FixedSizedList(Int32)) for the array_element function, which used to have the ArrayAndIndexes signature. This test now fails because the FixedSizedList is converted to a List recursively.

I don't understand the old behavior, and why it was different for different function signatures. So I'm having trouble figuring out what the new behavior should be. We could of course sniff out the functions arguments to see if we have [array, index+] and if so only replace top level FixedSizedList with List, but I wouldn't understand why that's correct.

@jkosh44 jkosh44 marked this pull request as ready for review February 7, 2025 02:59
@jayzhan211
Copy link
Contributor

See this, #13819 (comment).
We only convert an inner fixed-size list to a regular list when the function performs a mutation operation, such as array_append. For non-mutation operations, like array_element, we keep T as a fixed-size list

Comment on lines 230 to 236
Array {
/// A full list of the arguments accepted by this function.
arguments: Vec<ArrayFunctionArgument>,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One drawback of this structure is that someone can technically create a signature with no array, only for it to get rejected at runtime in get_valid_types(). An alternative structure could be:

/// A list of arguments that come before the array.
pre_array_aguments: Vec<ArrayFunctionArgument>,
/// A list of arguments that come after the array.
post_array_aguments: Vec<ArrayFunctionArgument>,

Then we would remove the ArrayFunctionArgument::Array variant. This would have the draw back of only allowing a single array argument in the signature, which might be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to force people to create this through a constructor that returns an error if there's no array argument.

Comment on lines +269 to +305
/// An Int64 index argument.
Index,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset might be a better name for this? It can technically be used for sizes in functions like array_resize or counts for functions like array_replace_n.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 7, 2025

See this, #13819 (comment). We only convert an inner fixed-size list to a regular list when the function performs a mutation operation, such as array_append. For non-mutation operations, like array_element, we keep T as a fixed-size list

Ah ok, so the idea was that if the function changed the size of the list, then it would recursively convert FixedSizedList to List, but if it didn't change the size then it would only change the top level FixedSizedList to a List. However, it looks like the code was making the assumption that all ArrayAndIndexes functions did not modify the list which was actually no longer true after 3dfce7d.

So it sounds like we need to include in the array signature whether or not the function might change the size of the list and use that information.

@jayzhan211
Copy link
Contributor

So it sounds like we need to include in the array signature whether or not the function might change the size of the list and use that information.

yes

@github-actions github-actions bot added the common Related to common crate label Feb 7, 2025
/// assert_eq!(coerced_type, DataType::List(Arc::new(Field::new_list_field(DataType::Float64, true))));
pub fn coerced_type_with_base_type_only(
data_type: &DataType,
base_type: &DataType,
mutable: bool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We couldn't use the enum here because this crate doesn't have access to the enum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it means the enum should be move to here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Another option would be to move this function, which is only used in the get_valid_types function, to function.rs.

Comment on lines 233 to 234
/// Whether any of the input arrays are modified.
mutability: ArrayFunctionMutability,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not super confident that I did this correctly or if this is the correct thing to be modelling. For example is array_positions mutable? It returns an array of potentially different sizes but doesn't actually mutate the input array.

I still think that I don't fully understand why the conversion is happening in get_valid_types. Why does the mutability of a function affect what types are accepted as arguments? It seems like the mutability of the function should affect the return types of the function not the argument types.

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the mutability of a function affect what types are accepted as arguments? It seems like the mutability of the function should affect the return types of the function not the argument types.

The current code functions as expected: both Fixed and List are accepted regardless of the specified mutability, and the return type is determined by the mutability setting. Isn't it 🤔 ?

I'm still not super confident that I did this correctly or if this is the correct thing to be modelling. For example is array_positions mutable? It returns an array of potentially different sizes but doesn't actually mutate the input array.

Instead of modeling "mutability," we can explicitly define the desired type in the function signature. This type can be either List or FixedSizeList, and we coerce the input accordingly

Copy link
Contributor Author

@jkosh44 jkosh44 Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the return type is determined by the mutability setting. Isn't it 🤔 ?

I'm still unfamiliar with much of the code base, so please take everything I'm saying with a grain of salt. The mutability of the function is only ever looked at by the get_valid_types() function, which is described in the code as "Returns a Vec of all possible valid argument types for the given signature.".

https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/expr/src/type_coercion/functions.rs#L352-L363

From that description, I would conclude that mutability determines the accepted argument types, not the return type.

However, ScalarUDFImpl has a couple of trait functions, like fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>, that determine the return type as a function of the argument types.

https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/expr/src/udf.rs#L540-L596

If we take a look at some of the implementations of return_type() for array functions, many of them blindly pass through the argument type of the input array.

https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L396-L398
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L704-L706
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L811-L813
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/concat.rs#L108-L110
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/concat.rs#L196-L198

So by modifying the accepted argument types we are indirectly modifying the return types.

Instead of modeling "mutability," we can explicitly define the desired type in the function signature. This type can be either List or FixedSizeList, and we coerce the input accordingly

It might be a better approach to not modify the accepted argument types (i.e. don't convert FixedSizeList to List in get_valid_types()), and instead move the logic to return_type(). Then functions can be explicit about not returning FixedSizeLists.

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be confusing but return_type arguments are coerced already. get_valid_types returned coerced types and pass these to return_type, that is why we don't need to deal with coercion in return_type again.

FixedSizeList to List is part of the coercion so it should be the logic in Typecoercion and get_valid_types

get_valid_types is used in data_types_with_scalar_udf

/// Performs type coercion for scalar function arguments.
///
/// Returns the data types to which each argument must be coerced to
/// match `signature`.
///
/// For more details on coercion in general, please see the
/// [`type_coercion`](crate::type_coercion) module.
pub fn data_types_with_scalar_udf(

Returns a Vec of all possible valid argument types for the given signature

Returns a Vec of all possible valid (coerced) argument types for the given signature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I'm starting to understand this better. We coerce the argument types in get_valid_types in case we might use that argument type in the return type. It is a bit confusing/unintuitive because we don't actually know anything about the return type at this point, and the return type may have nothing to do with the argument types For example array_has accepts an array but just returns a bool.

One thing I'm still confused about is, what if we wanted to define a function that accepted FixedSizeList and returned List, or vice versa? Is that possible? Does it even make sense?

I agree with your other comment though, requiring function implementers to coerce types in return_type is probably a footgun and will likely be forgotten.

I think the best way forward is to revert 3a7c0e6 and add back an enum that controls the coercing of arguments. I'll have a think about the best way of doing that. It feels a little bad that we'll require function implementers to implement return_type() and provide an enum that describes how we should coerce types for the return type, since they both are doing similar things. It seems like it may be a source of confusion.

Also I noticed through some manual testing that this PR broke some function calls with nested FixedSizeList so I'll add some test cases with that. For example, SELECT array_prepend(arrow_cast([1], 'FixedSizeList(1, Int64)'), [arrow_cast([2], 'FixedSizeList(1, Int64)'), arrow_cast([3], 'FixedSizeList(1, Int64)')]); is broken in this PR.

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing/unintuitive because we don't actually know anything about the return type at this point, and the return type may have nothing to do with the argument types

At the very least, the coerced types should be inferred from the signature you provided. There may be a better design for this, but I haven’t come up with an alternative idea yet.

One thing I'm still confused about is, what if we wanted to define a function that accepted FixedSizeList and returned List, or vice versa? Is that possible? Does it even make sense?

I think both should be possible for datafusion and makes sense dependent on what you want to do. FixedSizeList to List is common, List to FixedSizeList if your function always output fixed size. Most of the existing function in datafusion doesn't care about fixed size list, so convert it to list simplies things

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way forward is to revert 3a7c0e6

Instead of define mutability, let user define whether they want to coerce fixed size list to list directly is probably a better idea. It is also more flexible. We can add such flag within ArraySignature

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 10, 2025

It might be a better approach to not modify the accepted argument types (i.e. don't convert FixedSizeList to List in get_valid_types(), and instead move the logic to return_type(). Then functions can be explicit about not returning FixedSizeLists.

I pushed a commit for this to see how CI would like it. Happy to revert it if people don't like it.

I also pushed a commit to add some validations around ArrayFunctionSignature.

@@ -98,7 +99,7 @@ impl ScalarUDFImpl for ArrayRemove {
}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
Ok(arg_types[0].clone())
Ok(coerced_fixed_size_list_to_list(&arg_types[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we coerce types in return_type then a question comes out, what coercion should be handled before return_type and what should be handled inside return_type

Copy link
Contributor Author

@jkosh44 jkosh44 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this in this PR (or maybe ever), but if we ever did want to do this in the future, then I think the natural split would be to do coercion necessary for input arguments in get_valid_types() and coercion necessary for return types in return_type(). For example converting an array argument base type and an element type to a common type would happen in get_valid_types(), but converting a FixedSizedList to a List would happen in return_type().

If you take a look at array_dims as an example, it is doing something similar already in return_type(),

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
Ok(match arg_types[0] {
List(_) | LargeList(_) | FixedSizeList(_, _) => {
List(Arc::new(Field::new_list_field(UInt64, true)))
}
_ => {
return plan_err!("The array_dims function can only accept List/LargeList/FixedSizeList.");
}
})
}

We already provide a trait function that allows functions to describe their return type as a function of their argument types, so it seems like a good place to put coercions that are needed specifically for the return type. On the other hand it adds a lot of complexity as you mentioned. The same code needs to be duplicated across all functions and the function implementer needs to actually remember to add the coercions. Additionally, the function implementation would actually need to handle both List and FixedSizeList, which most of them currently don't.

I just wanted to get some of my thoughts written down, but I agree that we shouldn't do any coercion in return_type in this PR.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 11, 2025

@jayzhan211 I just pushed a commit with your idea about adding a flag for array coercion. I'm feeling pretty good about the current state of this PR, but I have the following two open questions:

  • get_valid_types() always coerces the outermost FixedSizeList to List, no matter what the value of the flag is. I'm pretty sure that matches existing behavior, but I don't fully understand it. So I just wanted to double check with you that's it's correct.
  • Can you double check that I correctly assigned a function with Some(ListCoercion::FixedSizedListToList) vs None for the flag? I'm not 100% confident that I did it correctly.

This commit allows for more expressive array function signatures.
Previously, `ArrayFunctionSignature` was an enum of potential argument
combinations and orders. For many array functions, none of the
`ArrayFunctionSignature` variants worked, so they used
`TypeSignature::VariadicAny` instead. This commit will allow those
functions to use more descriptive signatures which will prevent them
from having to perform manual type checking in the function
implementation.

As an example, this commit also updates the signature of the
`array_replace` family of functions to use a new expressive signature,
which removes a panic that existed previously.

There are still a couple of limitations with this approach. First of
all, there's no way to describe a function that has multiple different
arrays of different type or dimension. Additionally, there isn't
support for functions with map arrays and recursive arrays that have
more than one argument.

Works towards resolving apache#14451
@jayzhan211
Copy link
Contributor

get_valid_types() always coerces the outermost FixedSizeList to List, no matter what the value of the flag is. I'm pretty sure that matches existing behavior, but I don't fully understand it. So I just wanted to double check with you that's it's correct

It is because of this, I think we now only coerce to list if the flag is set

    fn array(array_type: &DataType) -> Option<DataType> {
        match array_type {
            DataType::List(_) | DataType::LargeList(_) => Some(array_type.clone()),
            DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),
            _ => None,
        }
    }

pub fn new(arguments: Vec<ArrayFunctionArgument>) -> Result<Self, &'static str> {
if !arguments
.iter()
.any(|arg| *arg == ArrayFunctionArgument::Array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking here, I think we can verify the validity in get_valid_types, so the definition of signature can be simplified.

@@ -94,7 +94,7 @@ impl Default for ArrayHas {
impl ArrayHas {
pub fn new() -> Self {
Self {
signature: Signature::array_and_element(Volatility::Immutable),
signature: Signature::array_and_element(Volatility::Immutable, None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this, we don't necessary need to wrap into a util function, it is less helpful when there are many fields

            signature: Signature {
                type_signature: TypeSignature::ArraySignature(
                    ArrayFunctionSignature::Array {
                        arguments: vec![
                            ArrayFunctionArgument::Array,
                            ArrayFunctionArgument::Element,
                        ],
                        array_coercion: Some(ListCoercion::FixedSizedListToList),
                    },
                ),
                volatility: Volatility::Immutable,
            },

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but to avoid breaking Signature::array_and_element we can keep one without additional array_coercion field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants