-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signature::Coercible with user defined implicit casting #14440
base: main
Are you sure you want to change the base?
Conversation
|
||
#[derive(Debug, Clone, Eq, PartialOrd, Hash)] | ||
pub struct ParameterType { | ||
pub param_type: LogicalTypeRef, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
param_type
: target type of function signature
allowed_casts
: implicit coercion allowed to cast to target type.
For example,
param_type
: string
allowed_casts
: binary, int
Valid: All are casted to string
func(string)
func(binary)
func(int)
Invalid:
func(float or other types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that Vec<Vec<ParameterType>>
might become challenging to reason about. It would also be confusing about when one use this new signature rather than Signature::Coercable
🤔
Given this seems very similar to Signature::Coercable
, and Signature::Coercable
mirrors what we want pretty well, could add some new information there on the allowed coercions rather than an entire new type signature. Something like extending Coercable
with rules could be used to coerce to the target type:
pub enum TypeSignature {
...
Coercible(Vec<Coercion>),
...
}
Where Coercion
looks like
struct Coercion {
desired_type: TypeSignatureClass,
allowed_casts: ... // also includes an option for User Defined?
}
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to have a breaking change to Signature::Coercible
too, the only concern is whether this cause regression or large impact to downstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin If we replace CoercibleV2
with Signature::Coercible
would it be a large change we should be concerned to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, I'll replace Signature::Coercible with Signature::CoercibleV2.
23953dd
to
806c6a6
Compare
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
806c6a6
to
c99e986
Compare
vec![ | ||
TypeSignatureClass::Native(logical_string()), | ||
TypeSignatureClass::Native(logical_int64()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In v1 version, any integer is casted to the defined NativeType's default casted type which is i64 in this case.
// Accept all integer types but cast them to i64 | ||
Coercion { | ||
desired_type: TypeSignatureClass::Native(logical_int64()), | ||
allowed_casts: vec![TypeSignatureClass::Integer], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, i32 is rejected.
signature: Signature::coercible_v2( | ||
vec![Coercion { | ||
desired_type: TypeSignatureClass::Native(logical_string()), | ||
allowed_casts: vec![TypeSignatureClass::Native(logical_binary())], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coercing binary to string is now easily customizable
I will try and review this carefully over the weekend Maybe @shehabgamin has some time to take a look too |
I will review over the weekend as well! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 -- I really like where this is heading. The Coercion
idea I think is the missing concept that DataFusion has been missing
It seems like there are three kinds of Coercion I know about:
- Coercing to a (single) specific data type
- Coercing to one of a specific logical type (eg. one of the String types)
- Coercing two types to a "comparable" / compatible one (where comparison is differently defined)
It seems in all cases, the operation that is desired is:
Given an existing
Arrow::DataType
and a coercion (that describes a desired type) what, if any cast, can be applied so the coersion is satisfied
If we could encapsulate this into Coercion
somehow and update all our operations in terms of Coercion
I think we would be able to stop churning this code.
Also once it is in Coercion
then there is a natural place for other systems to provide their own coercion rules
Questions:
- Can we make
Signature::Coercable
general enough to represent the same thing asSignature::Numeric
andSignature::String
? (if so perhaps we can deprecate them as a follow on PR)
/// For example, `Coercible(vec![logical_float64()])` accepts | ||
/// arguments like `vec![Int32]` or `vec![Float32]` | ||
/// since i32 and f32 can be cast to f64 | ||
/// [`Coercion`] contains not only the desired type but also the allowed casts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this API makes a lot of sense to me-- in fact I think it is pretty close to being able to express most other signatures.
@@ -431,6 +463,35 @@ impl TypeSignature { | |||
} | |||
} | |||
|
|||
fn get_possible_types_from_signature_classes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would make more sense to me as a method on TypeSignatureClass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is used in information.schema
to list all possible signature combination, but it is too granular than necessary, sometimes we can use NativeType
or a new struct that represent a set of DataType
and it should be enough.
@goldmedal
If there is function requires Integer
, we don't need to list all possible i8, i16, i32 or i64 but integer
instead. I think we need a output other than Vec<DataType>
but something that could combines both DataType
or a set of type similar to NativeType
. Otherwise we will generate tons of DataType
combination for Coercible
signature and I guess it is not readable for information.schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is "simplified" because of some reasons, if we add all possible types, the combination will be huge. This function is used in information.schema that shows the possible signatures a function has. Is listing all possible DataType combination helpful?
@jayzhan211 I'm moving the conversation to this thread to consolidate.
I totally understand your concern, here are my thoughts:
- I think we should rename this function and add a comment for clarity.
information.schema
is not reporting accurate information if we only list a subset of possible types. I like your idea of using other output that isn'tVec<DataType>
.
@@ -225,6 +229,45 @@ impl Display for TypeSignatureClass { | |||
} | |||
} | |||
|
|||
impl TypeSignatureClass { | |||
/// Returns the default cast type for the given `TypeSignatureClass`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar what a "default cast type" is
I looked at the definition of LogicalType:defaut_cast_for and that didn't help me either (no comments)
https://github.com/apache/datafusion/blob/08d3b656d761a8e0d9e7f7c35c7b76cfec00e095/datafusion/common/src/types/native.rs#L201-L200
🤔
#[derive(Debug, Clone, Eq, PartialOrd)] | ||
pub struct Coercion { | ||
pub desired_type: TypeSignatureClass, | ||
pub allowed_casts: Vec<TypeSignatureClass>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is allowed_casts
s? Does that represent the source types that this coercion applies to?
If so, perhaps we could name this filed allowed_source_types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for allowed_source_types
@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> Vec<DataType> { | |||
} | |||
} | |||
|
|||
#[derive(Debug, Clone, Eq, PartialOrd)] | |||
pub struct Coercion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of Coercion.
Can this idea be used for user defined coercions or coercions to specific (Arrow) types?
Also, is there any way to allow users to provide their own coercion rules? For example, if Sail / @shehabgamin wants to support converting numeric values to strings automatically, would he be express that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this would be really great to have!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to ascii that binary to string is supported, you add numeric types in allowed_source_types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 , I believe @alamb 's question (please correct me if I'm wrong) is about creating functionality for a downstream user to override the default signature of a UDF in order to provide their own coercion rules.
For example, something like this:
let scalar_expr = ScalarExprBuilder::new(AsciiFunc::new(), args)
.with_signature(Signature::any(1, Volatility::Immutable))
.build()
.map(Arc::new)?;
This is the conversation we were having here as well:
#14296 (comment)
#14296 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 Thanks for taking on this initiative, I like the direction of where this is going!
My biggest take away is that is_matched_type
+ default_casted_type
together seem too ambiguous/restrictive as is. Together they add assumptions tightly coupled to the implementation which makes it hard for users to tailor logic according to their needs. I also worry that there isn't clarity into what's going on unless you carefully dig into the code. IMO the logic should be very simple and as follows:
- If
source_type
isdesired_type
ORsource_type
is one ofallowed_casts
, then if needed:desired_type.default_cast_for(source_type)
.
It would also be really great if we could find an elegant way to handle DataType::Timestamp
with a wildcard for both TimeUnit
and TZ
(where the TZ
wildcard also matches None
). Some effort has already been made regarding the signature (which might be sufficient for now), but it would also be useful to be able to use the wildcard in the return_type
function as well.
} | ||
// This is an existing use case for casting string to timestamp, since we don't have specific unit and timezone from string, | ||
// so we use the default timestamp type with nanosecond precision and no timezone | ||
// TODO: Consider allowing the user to specify the default timestamp type instead of having it predefined in DataFusion when we have more use cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user should be allowed to specify the default timestamp type.
match signature_classes { | ||
TypeSignatureClass::Native(l) => get_data_types(l.native()), | ||
TypeSignatureClass::Timestamp => { | ||
vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the TimeUnit
s are missing. It would be great if TIMEZONE_WILDCARD
encompassed None
and it would also be great if there was a wildcard for TimeUnit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is "simplified" because of some reasons, if we add all possible types, the combination will be huge. This function is used in information.schema
that shows the possible signatures a function has. Is listing all possible DataType
combination helpful?
vec![DataType::Date64] | ||
} | ||
TypeSignatureClass::Time => { | ||
vec![DataType::Time64(TimeUnit::Nanosecond)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should encompass all possible DataType::Time32
and DataType::Time64
. Also same comment as above regarding TimeUnit
wildcard.
] | ||
} | ||
TypeSignatureClass::Date => { | ||
vec![DataType::Date64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be DataType::Date32
and DataType::Date64
vec![DataType::Time64(TimeUnit::Nanosecond)] | ||
} | ||
TypeSignatureClass::Interval => { | ||
vec![DataType::Interval(IntervalUnit::DayTime)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should encompass all possible DataType::Interval
vec![DataType::Interval(IntervalUnit::DayTime)] | ||
} | ||
TypeSignatureClass::Duration => { | ||
vec![DataType::Duration(TimeUnit::Nanosecond)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should encompass all possible DataType::Duration
. Also same comment as above regarding TimeUnit
wildcard.
vec![DataType::Duration(TimeUnit::Nanosecond)] | ||
} | ||
TypeSignatureClass::Integer => { | ||
vec![DataType::Int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should encompass all possible ints.
#[derive(Debug, Clone, Eq, PartialOrd)] | ||
pub struct Coercion { | ||
pub desired_type: TypeSignatureClass, | ||
pub allowed_casts: Vec<TypeSignatureClass>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for allowed_source_types
@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> Vec<DataType> { | |||
} | |||
} | |||
|
|||
#[derive(Debug, Clone, Eq, PartialOrd)] | |||
pub struct Coercion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this would be really great to have!
@@ -47,3 +49,11 @@ singleton!(LOGICAL_FLOAT64, logical_float64, Float64); | |||
singleton!(LOGICAL_DATE, logical_date, Date); | |||
singleton!(LOGICAL_BINARY, logical_binary, Binary); | |||
singleton!(LOGICAL_STRING, logical_string, String); | |||
|
|||
// TODO: Extend macro | |||
// TODO: Should we use LOGICAL_TIMESTAMP_NANO to distinguish unit and timzeone? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if there was a wildcard to match all the units and timezones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeSignatureClass::Timestamp acts as a wildcard, covering all possible units and timezones. If you're considering introducing a similar concept in NativeType, we should also discuss whether we need to include Numeric, Integer, or Float as part of it.
@@ -209,14 +210,13 @@ impl TypeSignature { | |||
#[derive(Debug, Clone, Eq, PartialEq, PartialOrd, Hash)] | |||
pub enum TypeSignatureClass { | |||
Timestamp, | |||
Date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use NativeType::Date instead, not need TypeSignatureClass::Date
Using TypeSignatureClass:Timestamp can represent wildcard of timestamp. Define wildcard of timestamp for the return type might not be possible since we return arrow type and it doesn't have wildcard type |
.any(|t| is_matched_type(t, ¤t_logical_type)) { | ||
// If the condition is met which means `implicit coercion`` is provided so we can safely unwrap | ||
let default_casted_type = param.default_casted_type().unwrap(); | ||
let casted_type = default_casted_type.default_cast_for(current_type)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't find more simplified logic
@jayzhan211 I will re-review by tomorrow EOD! |
I revisit the binary to string for ascii but it seems either Postgres and Duckdb correctly support it.
|
/// | ||
/// Get all possible types for `information_schema` from the given `TypeSignature` | ||
// | ||
// TODO: Make this function private | ||
pub fn get_possible_types(&self) -> Vec<Vec<DataType>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment, can we rename the function too?
@jayzhan211 Should we port any relevant tests from the old PR? |
or close this first then revisit #14268 BTW, most of the binary-to-string conversions mentioned in #14268 might not be ideal for DataFusion. We should reconsider them. |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?