-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime-adaptive data representation #12720
Comments
Prior discussion #7421 |
The issue example is about plain, dictionary and REE encoded data (also covered by #7421).
|
Why is it forced to be the same type? Maybe we should address this issue first. |
We need to support arrow::kernel for various data types. For instance, operations like comparison currently don’t support execution across different types. It's unclear whether this is due to a lack of implementation or if it's not considered a good approach. @tustvold, do you think implement arrow::kernel for types like |
As a general rule we don't support operations on heterogenous types to avoid the combinatorial explosion of codegen that would result, and the corresponding impact on build times and binary size. There are some exceptions to this though:
I don't think as a general rule it makes sense to support heterogenous operations, but rather where there is a compelling justification for why coercion isn't appropriate. For example supporting mixed StringArray, LargeStringArray, StringViewArray seems hard to justify, given coercion is extremely cheap. Widening casts for decimals would likely be similar. Where/when this coercion takes is a question for DF, but IMO I would expect the physical plan to be in terms of the physical arrow types, with the logical plan potentially in terms of some higher level logical datatype (although I do wonder if this type grouping might be operator specific and therefore make more sense as an internal utility). This allows the physical optimizer to make decisions about when and how coercion takes place, as opposed to this being an implicit behaviour of the kernels (potentially performing the same conversion multiple times redundantly) |
@findepi I think we need to somewhat apply coercion in physical optimizer that dealing with the physical arrow types like String family or types that don't make sense to be supported in kernel function. @tustvold DataFusion keeps |
IMO ScalarValue shouldn't ever be on the hot path, if it is it indicates an issue with the way that kernel has been implemented. It has been a while since I looked at DF, but it did use to implement a lot of the windowed aggregates and array functions using ScalarValue when they probably shouldn't have been. IMO unless it is arrow kernels that are bottlenecked on ScalarValue::Utf8 it wouldn't make sense to push this into arrow-rs |
I guess we're in agreement that this pertains only to the lowest-level operations exposed by arrow.
This is definitely an option. This is what is intended by #12622 (cc @notfilippo , @tobixdev). Maybe it is a premature idea, given than we're not done with #12622 yet. |
Originally posted by @andygrove in #11513 (comment)
DataFusion physical planning result mandates particular Arrow type (
DataType
) for each of the processed columns.This doesn't reflect reality of modern systems though.
DataType
s) and forcing single common representation is not efficientExample 1:
plain table scan reading Parquet files. Same column may be represented differently in individual files (plain array vs RLE/REE vs Dictionary) and it is not optimally efficient to force a particular data layout on the output of the table scan.
Example 2
UNION ALL query may union data from multiple sources, which can naturally produces data in different data types.
Context
DataType
) requires some higher-level notion of types in DataFusion, which is to be delivered in [Proposal] Decouple logical from physical types #11513 [EPIC] Decouple logical from physical types #12622The text was updated successfully, but these errors were encountered: