Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signature::Coercible with user defined implicit casting #14440

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Feb 3, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions functions labels Feb 3, 2025

#[derive(Debug, Clone, Eq, PartialOrd, Hash)]
pub struct ParameterType {
pub param_type: LogicalTypeRef,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

param_type: target type of function signature
allowed_casts: implicit coercion allowed to cast to target type.

For example,
param_type: string
allowed_casts: binary, int

Valid: All are casted to string
func(string) 
func(binary)
func(int)
Invalid:
func(float or other types)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that Vec<Vec<ParameterType>> might become challenging to reason about. It would also be confusing about when one use this new signature rather than Signature::Coercable 🤔

Given this seems very similar to Signature::Coercable, and Signature::Coercable mirrors what we want pretty well, could add some new information there on the allowed coercions rather than an entire new type signature. Something like extending Coercable with rules could be used to coerce to the target type:

pub enum TypeSignature {
...
    Coercible(Vec<Coercion>),
...
}

Where Coercion looks like

struct Coercion {
  desired_type: TypeSignatureClass,
  allowed_casts: ... // also includes an option for User Defined?
}

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have a breaking change to Signature::Coercible too, the only concern is whether this cause regression or large impact to downstream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shehabgamin If we replace CoercibleV2 with Signature::Coercible would it be a large change we should be concerned to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, I'll replace Signature::Coercible with Signature::CoercibleV2.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 6, 2025
@jayzhan211 jayzhan211 marked this pull request as draft February 7, 2025 00:38
@github-actions github-actions bot added the common Related to common crate label Feb 7, 2025
@jayzhan211 jayzhan211 changed the title Draft: coercible signature Signature::Coercible with user defined implicit casting Feb 7, 2025
vec![
TypeSignatureClass::Native(logical_string()),
TypeSignatureClass::Native(logical_int64()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v1 version, any integer is casted to the defined NativeType's default casted type which is i64 in this case.

// Accept all integer types but cast them to i64
Coercion {
desired_type: TypeSignatureClass::Native(logical_int64()),
allowed_casts: vec![TypeSignatureClass::Integer],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, i32 is rejected.

signature: Signature::coercible_v2(
vec![Coercion {
desired_type: TypeSignatureClass::Native(logical_string()),
allowed_casts: vec![TypeSignatureClass::Native(logical_binary())],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coercing binary to string is now easily customizable

@jayzhan211 jayzhan211 marked this pull request as ready for review February 7, 2025 03:57
@jayzhan211 jayzhan211 added the api change Changes the API exposed to users of the crate label Feb 7, 2025
@jayzhan211 jayzhan211 requested a review from alamb February 7, 2025 13:04
@alamb
Copy link
Contributor

alamb commented Feb 7, 2025

I will try and review this carefully over the weekend

Maybe @shehabgamin has some time to take a look too

@shehabgamin
Copy link
Contributor

I will try and review this carefully over the weekend

Maybe @shehabgamin has some time to take a look too

I will review over the weekend as well!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- I really like where this is heading. The Coercion idea I think is the missing concept that DataFusion has been missing

It seems like there are three kinds of Coercion I know about:

  1. Coercing to a (single) specific data type
  2. Coercing to one of a specific logical type (eg. one of the String types)
  3. Coercing two types to a "comparable" / compatible one (where comparison is differently defined)

It seems in all cases, the operation that is desired is:

Given an existing Arrow::DataType and a coercion (that describes a desired type) what, if any cast, can be applied so the coersion is satisfied

If we could encapsulate this into Coercion somehow and update all our operations in terms of Coercion I think we would be able to stop churning this code.

Also once it is in Coercion then there is a natural place for other systems to provide their own coercion rules

Questions:

  1. Can we make Signature::Coercable general enough to represent the same thing as Signature::Numeric and Signature::String ? (if so perhaps we can deprecate them as a follow on PR)

/// For example, `Coercible(vec![logical_float64()])` accepts
/// arguments like `vec![Int32]` or `vec![Float32]`
/// since i32 and f32 can be cast to f64
/// [`Coercion`] contains not only the desired type but also the allowed casts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API makes a lot of sense to me-- in fact I think it is pretty close to being able to express most other signatures.

@@ -431,6 +463,35 @@ impl TypeSignature {
}
}

fn get_possible_types_from_signature_classes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would make more sense to me as a method on TypeSignatureClass

Copy link
Contributor Author

@jayzhan211 jayzhan211 Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used in information.schema to list all possible signature combination, but it is too granular than necessary, sometimes we can use NativeType or a new struct that represent a set of DataType and it should be enough.

@goldmedal
If there is function requires Integer, we don't need to list all possible i8, i16, i32 or i64 but integer instead. I think we need a output other than Vec<DataType> but something that could combines both DataType or a set of type similar to NativeType. Otherwise we will generate tons of DataType combination for Coercible signature and I guess it is not readable for information.schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is "simplified" because of some reasons, if we add all possible types, the combination will be huge. This function is used in information.schema that shows the possible signatures a function has. Is listing all possible DataType combination helpful?

@jayzhan211 I'm moving the conversation to this thread to consolidate.

I totally understand your concern, here are my thoughts:

  • I think we should rename this function and add a comment for clarity.
  • information.schema is not reporting accurate information if we only list a subset of possible types. I like your idea of using other output that isn't Vec<DataType>.

@@ -225,6 +229,45 @@ impl Display for TypeSignatureClass {
}
}

impl TypeSignatureClass {
/// Returns the default cast type for the given `TypeSignatureClass`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar what a "default cast type" is

I looked at the definition of LogicalType:defaut_cast_for and that didn't help me either (no comments)
https://github.com/apache/datafusion/blob/08d3b656d761a8e0d9e7f7c35c7b76cfec00e095/datafusion/common/src/types/native.rs#L201-L200

🤔

#[derive(Debug, Clone, Eq, PartialOrd)]
pub struct Coercion {
pub desired_type: TypeSignatureClass,
pub allowed_casts: Vec<TypeSignatureClass>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is allowed_castss? Does that represent the source types that this coercion applies to?

If so, perhaps we could name this filed allowed_source_types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for allowed_source_types

@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> Vec<DataType> {
}
}

#[derive(Debug, Clone, Eq, PartialOrd)]
pub struct Coercion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of Coercion.

Can this idea be used for user defined coercions or coercions to specific (Arrow) types?

Also, is there any way to allow users to provide their own coercion rules? For example, if Sail / @shehabgamin wants to support converting numeric values to strings automatically, would he be express that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this would be really great to have!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to ascii that binary to string is supported, you add numeric types in allowed_source_types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 , I believe @alamb 's question (please correct me if I'm wrong) is about creating functionality for a downstream user to override the default signature of a UDF in order to provide their own coercion rules.

For example, something like this:

let scalar_expr = ScalarExprBuilder::new(AsciiFunc::new(), args)
                        .with_signature(Signature::any(1, Volatility::Immutable))
                        .build()
                        .map(Arc::new)?;

This is the conversation we were having here as well:
#14296 (comment)
#14296 (comment)

Copy link
Contributor

@shehabgamin shehabgamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 Thanks for taking on this initiative, I like the direction of where this is going!

My biggest take away is that is_matched_type + default_casted_type together seem too ambiguous/restrictive as is. Together they add assumptions tightly coupled to the implementation which makes it hard for users to tailor logic according to their needs. I also worry that there isn't clarity into what's going on unless you carefully dig into the code. IMO the logic should be very simple and as follows:

  • If source_type is desired_type OR source_type is one of allowed_casts, then if needed: desired_type.default_cast_for(source_type).

It would also be really great if we could find an elegant way to handle DataType::Timestamp with a wildcard for both TimeUnit and TZ (where the TZ wildcard also matches None). Some effort has already been made regarding the signature (which might be sufficient for now), but it would also be useful to be able to use the wildcard in the return_type function as well.

}
// This is an existing use case for casting string to timestamp, since we don't have specific unit and timezone from string,
// so we use the default timestamp type with nanosecond precision and no timezone
// TODO: Consider allowing the user to specify the default timestamp type instead of having it predefined in DataFusion when we have more use cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user should be allowed to specify the default timestamp type.

match signature_classes {
TypeSignatureClass::Native(l) => get_data_types(l.native()),
TypeSignatureClass::Timestamp => {
vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the TimeUnits are missing. It would be great if TIMEZONE_WILDCARD encompassed None and it would also be great if there was a wildcard for TimeUnit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is "simplified" because of some reasons, if we add all possible types, the combination will be huge. This function is used in information.schema that shows the possible signatures a function has. Is listing all possible DataType combination helpful?

vec![DataType::Date64]
}
TypeSignatureClass::Time => {
vec![DataType::Time64(TimeUnit::Nanosecond)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should encompass all possible DataType::Time32 and DataType::Time64. Also same comment as above regarding TimeUnit wildcard.

]
}
TypeSignatureClass::Date => {
vec![DataType::Date64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be DataType::Date32 and DataType::Date64

vec![DataType::Time64(TimeUnit::Nanosecond)]
}
TypeSignatureClass::Interval => {
vec![DataType::Interval(IntervalUnit::DayTime)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should encompass all possible DataType::Interval

vec![DataType::Interval(IntervalUnit::DayTime)]
}
TypeSignatureClass::Duration => {
vec![DataType::Duration(TimeUnit::Nanosecond)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should encompass all possible DataType::Duration. Also same comment as above regarding TimeUnit wildcard.

vec![DataType::Duration(TimeUnit::Nanosecond)]
}
TypeSignatureClass::Integer => {
vec![DataType::Int64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should encompass all possible ints.

#[derive(Debug, Clone, Eq, PartialOrd)]
pub struct Coercion {
pub desired_type: TypeSignatureClass,
pub allowed_casts: Vec<TypeSignatureClass>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for allowed_source_types

@@ -460,6 +521,44 @@ fn get_data_types(native_type: &NativeType) -> Vec<DataType> {
}
}

#[derive(Debug, Clone, Eq, PartialOrd)]
pub struct Coercion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this would be really great to have!

@@ -47,3 +49,11 @@ singleton!(LOGICAL_FLOAT64, logical_float64, Float64);
singleton!(LOGICAL_DATE, logical_date, Date);
singleton!(LOGICAL_BINARY, logical_binary, Binary);
singleton!(LOGICAL_STRING, logical_string, String);

// TODO: Extend macro
// TODO: Should we use LOGICAL_TIMESTAMP_NANO to distinguish unit and timzeone?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if there was a wildcard to match all the units and timezones.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeSignatureClass::Timestamp acts as a wildcard, covering all possible units and timezones. If you're considering introducing a similar concept in NativeType, we should also discuss whether we need to include Numeric, Integer, or Float as part of it.

@@ -209,14 +210,13 @@ impl TypeSignature {
#[derive(Debug, Clone, Eq, PartialEq, PartialOrd, Hash)]
pub enum TypeSignatureClass {
Timestamp,
Date,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use NativeType::Date instead, not need TypeSignatureClass::Date

@jayzhan211
Copy link
Contributor Author

but it would also be useful to be able to use the wildcard in the return_type function as well.

Using TypeSignatureClass:Timestamp can represent wildcard of timestamp. Define wildcard of timestamp for the return type might not be possible since we return arrow type and it doesn't have wildcard type

.any(|t| is_matched_type(t, &current_logical_type)) {
// If the condition is met which means `implicit coercion`` is provided so we can safely unwrap
let default_casted_type = param.default_casted_type().unwrap();
let casted_type = default_casted_type.default_cast_for(current_type)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find more simplified logic

@shehabgamin
Copy link
Contributor

@jayzhan211 I will re-review by tomorrow EOD!

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Feb 11, 2025

I revisit the binary to string for ascii but it seems either Postgres and Duckdb correctly support it.

postgres=# select ascii('0xa');
 ascii 
-------
    48
(1 row)

postgres=# select ascii(X'a');
ERROR:  function ascii(bit) does not exist
LINE 1: select ascii(X'a');
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
postgres=# select ascii(x'a');
ERROR:  function ascii(bit) does not exist
LINE 1: select ascii(x'a');
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
postgres=# select ascii('xa');
 ascii 
-------
   120
(1 row)
D select ascii('0xa');
┌──────────────┐
│ ascii('0xa') │
│    int32     │
├──────────────┤
│           48 │
└──────────────┘
D select ascii(x'a');
┌─────────────┐
│ ascii('xa') │
│    int32    │
├─────────────┤
│         120 │
└─────────────┘
D select ascii('xa');
┌─────────────┐
│ ascii('xa') │
│    int32    │
├─────────────┤
│         120 │
└─────────────┘

Note that DuckDB process ascii(x'a') as ascii('xa') which is not binary

///
/// Get all possible types for `information_schema` from the given `TypeSignature`
//
// TODO: Make this function private
pub fn get_possible_types(&self) -> Vec<Vec<DataType>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment, can we rename the function too?

@shehabgamin
Copy link
Contributor

@jayzhan211 Should we port any relevant tests from the old PR?
#14268

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Feb 12, 2025

@jayzhan211 Should we port any relevant tests from the old PR? #14268

or close this first then revisit #14268

BTW, most of the binary-to-string conversions mentioned in #14268 might not be ideal for DataFusion. We should reconsider them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate common Related to common crate functions logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants