Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime-adaptive data representation #12720

Open
findepi opened this issue Oct 2, 2024 · 8 comments
Open

Runtime-adaptive data representation #12720

findepi opened this issue Oct 2, 2024 · 8 comments

Comments

@findepi
Copy link
Member

findepi commented Oct 2, 2024

Originally posted by @andygrove in #11513 (comment)

We are running into the RecordBatches with same logical type but different physical types issue in DataFusion Comet. For a single table, a column may be dictionary-encoded in some Parquet files, and not in others, so we are forced to cast them all to the same type, which introduces unnecessary dictionary encoding (or decoding) overhead.

DataFusion physical planning result mandates particular Arrow type (DataType) for each of the processed columns.
This doesn't reflect reality of modern systems though.

  • source data may be naturally representable in different Arrow types (DataTypes) and forcing single common representation is not efficient
  • adaptive execution of certain operations (like pre aggregation) would benefit from being able to adjust data processing in response to incoming data characteristics observed at runtime

Example 1:
plain table scan reading Parquet files. Same column may be represented differently in individual files (plain array vs RLE/REE vs Dictionary) and it is not optimally efficient to force a particular data layout on the output of the table scan.

Example 2
UNION ALL query may union data from multiple sources, which can naturally produces data in different data types.

Context

@findepi
Copy link
Member Author

findepi commented Feb 17, 2025

Prior discussion #7421

@findepi
Copy link
Member Author

findepi commented Feb 17, 2025

The issue example is about plain, dictionary and REE encoded data (also covered by #7421).
However, we could do more.

  • For example, for decimal(1,0) type we currently use 128 bits per value, where 8 would suffice. Plenty of waste. Support new Arrow types decimal32 and decimal64 arrow-rs#6661 Will help a bit, but more can be done.
  • For a column with decimal(38,0) type, we may still use 128 bit per value even once Decimal32 type is added. But what if the column actually contains only numbers 0..9? The runtime representation could be smaller size integer type.
    • judging on how Snowflake returns data over their Arrow interface, this is likely what they do internally

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 17, 2025

so we are forced to cast them all to the same type

Why is it forced to be the same type? Maybe we should address this issue first.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 17, 2025

We need to support arrow::kernel for various data types.

For instance, operations like comparison currently don’t support execution across different types. It's unclear whether this is due to a lack of implementation or if it's not considered a good approach.

https://github.com/apache/arrow-rs/blob/38d6e691f4ee1b356f28d77b6820de67166c51c3/arrow-ord/src/ord.rs#L355-L392

@tustvold, do you think implement arrow::kernel for types like Utf8 vs LargeUtf8 or Utf8 vs Dictionary(_, Utf8) a good idea? Types that are different but similar (have the same logical type i.e. String)

@tustvold
Copy link
Contributor

tustvold commented Feb 17, 2025

As a general rule we don't support operations on heterogenous types to avoid the combinatorial explosion of codegen that would result, and the corresponding impact on build times and binary size.

There are some exceptions to this though:

  • Some arithmetic, e.g. intervals with dates and timestamps
  • Dictionaries with their non-dictionary encoded counterparts
  • Metadata only differences, e.g. timezones, decimal precision, etc...

I don't think as a general rule it makes sense to support heterogenous operations, but rather where there is a compelling justification for why coercion isn't appropriate. For example supporting mixed StringArray, LargeStringArray, StringViewArray seems hard to justify, given coercion is extremely cheap. Widening casts for decimals would likely be similar.

Where/when this coercion takes is a question for DF, but IMO I would expect the physical plan to be in terms of the physical arrow types, with the logical plan potentially in terms of some higher level logical datatype (although I do wonder if this type grouping might be operator specific and therefore make more sense as an internal utility). This allows the physical optimizer to make decisions about when and how coercion takes place, as opposed to this being an implicit behaviour of the kernels (potentially performing the same conversion multiple times redundantly)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 17, 2025

@findepi I think we need to somewhat apply coercion in physical optimizer that dealing with the physical arrow types like String family or types that don't make sense to be supported in kernel function.

@tustvold DataFusion keeps ScalarValue::Utf8(String) for performance reason, given it is more lightweight compare than Scalar<ArrayRef>. If we need kernel for (String/LargeString/StringView Array, rust::String), do you think it makes sense to upstream in arrow or it is better to keep it in DF?

@tustvold
Copy link
Contributor

tustvold commented Feb 17, 2025

keeps ScalarValue::Utf8(String) for performance reason, given it is more lightweight compare than Scalar

IMO ScalarValue shouldn't ever be on the hot path, if it is it indicates an issue with the way that kernel has been implemented. It has been a while since I looked at DF, but it did use to implement a lot of the windowed aggregates and array functions using ScalarValue when they probably shouldn't have been.

IMO unless it is arrow kernels that are bottlenecked on ScalarValue::Utf8 it wouldn't make sense to push this into arrow-rs

@findepi
Copy link
Member Author

findepi commented Feb 19, 2025

As a general rule we don't support operations on heterogenous types to avoid the combinatorial explosion of codegen that would result, and the corresponding impact on build times and binary size.

I guess we're in agreement that this pertains only to the lowest-level operations exposed by arrow.
Exploding codegen is not the only way to support runtime-adaptive data representation, but this runtime-adaptivity needs to end somewhere. We can decide where it is terminated. If it's terminated inside arrow kernels, we should expect binary code bloat.

Where/when this coercion takes is a question for DF, but IMO I would expect the physical plan to be in terms of the physical arrow types, with the logical plan potentially in terms of some higher level logical datatype

This is definitely an option. This is what is intended by #12622 (cc @notfilippo , @tobixdev).
When creating this issue as a separate one, i intended to go further and have adaptivity at runtime.
Often, data flowing from two different branches of UNION ALL doesn't need to be unified at all.

Maybe it is a premature idea, given than we're not done with #12622 yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants