-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
function: Allow more expressive array signatures #14532
base: main
Are you sure you want to change the base?
Conversation
b7fc773
to
d4b74db
Compare
This is failing CI for the following reason. Previously, in So for example, if we had the type This PR modifies the logic of There's a test, I don't understand the old behavior, and why it was different for different function signatures. So I'm having trouble figuring out what the new behavior should be. We could of course sniff out the functions arguments to see if we have |
See this, #13819 (comment). |
Array { | ||
/// A full list of the arguments accepted by this function. | ||
arguments: Vec<ArrayFunctionArgument>, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One drawback of this structure is that someone can technically create a signature with no array, only for it to get rejected at runtime in get_valid_types()
. An alternative structure could be:
/// A list of arguments that come before the array.
pre_array_aguments: Vec<ArrayFunctionArgument>,
/// A list of arguments that come after the array.
post_array_aguments: Vec<ArrayFunctionArgument>,
Then we would remove the ArrayFunctionArgument::Array
variant. This would have the draw back of only allowing a single array argument in the signature, which might be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to force people to create this through a constructor that returns an error if there's no array argument.
/// An Int64 index argument. | ||
Index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offset might be a better name for this? It can technically be used for sizes in functions like array_resize
or counts for functions like array_replace_n
.
Ah ok, so the idea was that if the function changed the size of the list, then it would recursively convert So it sounds like we need to include in the array signature whether or not the function might change the size of the list and use that information. |
yes |
datafusion/common/src/utils/mod.rs
Outdated
/// assert_eq!(coerced_type, DataType::List(Arc::new(Field::new_list_field(DataType::Float64, true)))); | ||
pub fn coerced_type_with_base_type_only( | ||
data_type: &DataType, | ||
base_type: &DataType, | ||
mutable: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We couldn't use the enum here because this crate doesn't have access to the enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it means the enum should be move to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Another option would be to move this function, which is only used in the get_valid_types
function, to function.rs
.
/// Whether any of the input arrays are modified. | ||
mutability: ArrayFunctionMutability, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not super confident that I did this correctly or if this is the correct thing to be modelling. For example is array_positions
mutable? It returns an array of potentially different sizes but doesn't actually mutate the input array.
I still think that I don't fully understand why the conversion is happening in get_valid_types
. Why does the mutability of a function affect what types are accepted as arguments? It seems like the mutability of the function should affect the return types of the function not the argument types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the mutability of a function affect what types are accepted as arguments? It seems like the mutability of the function should affect the return types of the function not the argument types.
The current code functions as expected: both Fixed and List are accepted regardless of the specified mutability, and the return type is determined by the mutability setting. Isn't it 🤔 ?
I'm still not super confident that I did this correctly or if this is the correct thing to be modelling. For example is array_positions mutable? It returns an array of potentially different sizes but doesn't actually mutate the input array.
Instead of modeling "mutability," we can explicitly define the desired type in the function signature. This type can be either List or FixedSizeList, and we coerce the input accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the return type is determined by the mutability setting. Isn't it 🤔 ?
I'm still unfamiliar with much of the code base, so please take everything I'm saying with a grain of salt. The mutability of the function is only ever looked at by the get_valid_types()
function, which is described in the code as "Returns a Vec of all possible valid argument types for the given signature.".
From that description, I would conclude that mutability determines the accepted argument types, not the return type.
However, ScalarUDFImpl
has a couple of trait functions, like fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>
, that determine the return type as a function of the argument types.
If we take a look at some of the implementations of return_type()
for array functions, many of them blindly pass through the argument type of the input array.
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L396-L398
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L704-L706
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/extract.rs#L811-L813
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/concat.rs#L108-L110
https://github.com/jkosh44/datafusion/blob/53b7ae53af30cc7b8734a6c292cc3e04a993afdc/datafusion/functions-nested/src/concat.rs#L196-L198
So by modifying the accepted argument types we are indirectly modifying the return types.
Instead of modeling "mutability," we can explicitly define the desired type in the function signature. This type can be either List or FixedSizeList, and we coerce the input accordingly
It might be a better approach to not modify the accepted argument types (i.e. don't convert FixedSizeList
to List
in get_valid_types()
), and instead move the logic to return_type()
. Then functions can be explicit about not returning FixedSizeList
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be confusing but return_type
arguments are coerced already. get_valid_types
returned coerced types and pass these to return_type
, that is why we don't need to deal with coercion in return_type
again.
FixedSizeList to List is part of the coercion so it should be the logic in Typecoercion
and get_valid_types
get_valid_types
is used in data_types_with_scalar_udf
/// Performs type coercion for scalar function arguments.
///
/// Returns the data types to which each argument must be coerced to
/// match `signature`.
///
/// For more details on coercion in general, please see the
/// [`type_coercion`](crate::type_coercion) module.
pub fn data_types_with_scalar_udf(
Returns a Vec of all possible valid argument types for the given signature
Returns a Vec of all possible valid (coerced) argument types for the given signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I'm starting to understand this better. We coerce the argument types in get_valid_types
in case we might use that argument type in the return type. It is a bit confusing/unintuitive because we don't actually know anything about the return type at this point, and the return type may have nothing to do with the argument types For example array_has
accepts an array but just returns a bool.
One thing I'm still confused about is, what if we wanted to define a function that accepted FixedSizeList
and returned List
, or vice versa? Is that possible? Does it even make sense?
I agree with your other comment though, requiring function implementers to coerce types in return_type
is probably a footgun and will likely be forgotten.
I think the best way forward is to revert 3a7c0e6 and add back an enum that controls the coercing of arguments. I'll have a think about the best way of doing that. It feels a little bad that we'll require function implementers to implement return_type()
and provide an enum that describes how we should coerce types for the return type, since they both are doing similar things. It seems like it may be a source of confusion.
Also I noticed through some manual testing that this PR broke some function calls with nested FixedSizeList
so I'll add some test cases with that. For example, SELECT array_prepend(arrow_cast([1], 'FixedSizeList(1, Int64)'), [arrow_cast([2], 'FixedSizeList(1, Int64)'), arrow_cast([3], 'FixedSizeList(1, Int64)')]);
is broken in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit confusing/unintuitive because we don't actually know anything about the return type at this point, and the return type may have nothing to do with the argument types
At the very least, the coerced types should be inferred from the signature you provided. There may be a better design for this, but I haven’t come up with an alternative idea yet.
One thing I'm still confused about is, what if we wanted to define a function that accepted FixedSizeList and returned List, or vice versa? Is that possible? Does it even make sense?
I think both should be possible for datafusion and makes sense dependent on what you want to do. FixedSizeList to List is common, List to FixedSizeList if your function always output fixed size. Most of the existing function in datafusion doesn't care about fixed size list, so convert it to list simplies things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best way forward is to revert 3a7c0e6
Instead of define mutability, let user define whether they want to coerce fixed size list to list directly is probably a better idea. It is also more flexible. We can add such flag within ArraySignature
I pushed a commit for this to see how CI would like it. Happy to revert it if people don't like it. I also pushed a commit to add some validations around |
96b95e1
to
3a7c0e6
Compare
@@ -98,7 +99,7 @@ impl ScalarUDFImpl for ArrayRemove { | |||
} | |||
|
|||
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | |||
Ok(arg_types[0].clone()) | |||
Ok(coerced_fixed_size_list_to_list(&arg_types[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we coerce types in return_type
then a question comes out, what coercion should be handled before return_type
and what should be handled inside return_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do this in this PR (or maybe ever), but if we ever did want to do this in the future, then I think the natural split would be to do coercion necessary for input arguments in get_valid_types()
and coercion necessary for return types in return_type()
. For example converting an array argument base type and an element type to a common type would happen in get_valid_types()
, but converting a FixedSizedList
to a List
would happen in return_type()
.
If you take a look at array_dims
as an example, it is doing something similar already in return_type()
,
datafusion/datafusion/functions-nested/src/dimension.rs
Lines 98 to 107 in 2c73fcd
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | |
Ok(match arg_types[0] { | |
List(_) | LargeList(_) | FixedSizeList(_, _) => { | |
List(Arc::new(Field::new_list_field(UInt64, true))) | |
} | |
_ => { | |
return plan_err!("The array_dims function can only accept List/LargeList/FixedSizeList."); | |
} | |
}) | |
} |
We already provide a trait function that allows functions to describe their return type as a function of their argument types, so it seems like a good place to put coercions that are needed specifically for the return type. On the other hand it adds a lot of complexity as you mentioned. The same code needs to be duplicated across all functions and the function implementer needs to actually remember to add the coercions. Additionally, the function implementation would actually need to handle both
List
and FixedSizeList
, which most of them currently don't.
I just wanted to get some of my thoughts written down, but I agree that we shouldn't do any coercion in return_type
in this PR.
@jayzhan211 I just pushed a commit with your idea about adding a flag for array coercion. I'm feeling pretty good about the current state of this PR, but I have the following two open questions:
|
This commit allows for more expressive array function signatures. Previously, `ArrayFunctionSignature` was an enum of potential argument combinations and orders. For many array functions, none of the `ArrayFunctionSignature` variants worked, so they used `TypeSignature::VariadicAny` instead. This commit will allow those functions to use more descriptive signatures which will prevent them from having to perform manual type checking in the function implementation. As an example, this commit also updates the signature of the `array_replace` family of functions to use a new expressive signature, which removes a panic that existed previously. There are still a couple of limitations with this approach. First of all, there's no way to describe a function that has multiple different arrays of different type or dimension. Additionally, there isn't support for functions with map arrays and recursive arrays that have more than one argument. Works towards resolving apache#14451
6798ca2
to
9ba3fe3
Compare
It is because of this, I think we now only coerce to list if the flag is set fn array(array_type: &DataType) -> Option<DataType> {
match array_type {
DataType::List(_) | DataType::LargeList(_) => Some(array_type.clone()),
DataType::FixedSizeList(field, _) => Some(DataType::List(Arc::clone(field))),
_ => None,
}
} |
pub fn new(arguments: Vec<ArrayFunctionArgument>) -> Result<Self, &'static str> { | ||
if !arguments | ||
.iter() | ||
.any(|arg| *arg == ArrayFunctionArgument::Array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of checking here, I think we can verify the validity in get_valid_types
, so the definition of signature can be simplified.
@@ -94,7 +94,7 @@ impl Default for ArrayHas { | |||
impl ArrayHas { | |||
pub fn new() -> Self { | |||
Self { | |||
signature: Signature::array_and_element(Volatility::Immutable), | |||
signature: Signature::array_and_element(Volatility::Immutable, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like this, we don't necessary need to wrap into a util function, it is less helpful when there are many fields
signature: Signature {
type_signature: TypeSignature::ArraySignature(
ArrayFunctionSignature::Array {
arguments: vec![
ArrayFunctionArgument::Array,
ArrayFunctionArgument::Element,
],
array_coercion: Some(ListCoercion::FixedSizedListToList),
},
),
volatility: Volatility::Immutable,
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but to avoid breaking Signature::array_and_element
we can keep one without additional array_coercion field
This commit allows for more expressive array function signatures. Previously,
ArrayFunctionSignature
was an enum of potential argument combinations and orders. For many array functions, none of theArrayFunctionSignature
variants work, so they useTypeSignature::VariadicAny
instead. This commit will allow those functions to use more descriptive signatures which will prevent them from having to perform manual type checking in the function implementation.As an example, this commit also updates the signature of the
array_replace
family of functions to use a new expressive signature, which removes a panic that existed previously.Works towards resolving #14451
Which issue does this PR close?
Works towards closing, but doesn't fully close, #14451
Are these changes tested?
Yes
Are there any user-facing changes?
No, other than removing some panics.