Runtime-adaptive data representation #12720

findepi · 2024-10-02T14:29:56Z

Originally posted by @andygrove in #11513 (comment)

We are running into the RecordBatches with same logical type but different physical types issue in DataFusion Comet. For a single table, a column may be dictionary-encoded in some Parquet files, and not in others, so we are forced to cast them all to the same type, which introduces unnecessary dictionary encoding (or decoding) overhead.

DataFusion physical planning result mandates particular Arrow type (DataType) for each of the processed columns.
This doesn't reflect reality of modern systems though.

source data may be naturally representable in different Arrow types (DataTypes) and forcing single common representation is not efficient
adaptive execution of certain operations (like pre aggregation) would benefit from being able to adjust data processing in response to incoming data characteristics observed at runtime

Example 1:
plain table scan reading Parquet files. Same column may be represented differently in individual files (plain array vs RLE/REE vs Dictionary) and it is not optimally efficient to force a particular data layout on the output of the table scan.

Example 2
UNION ALL query may union data from multiple sources, which can naturally produces data in different data types.

Context

not prescribing a particular Arrow type (DataType) requires some higher-level notion of types in DataFusion, which is to be delivered in [Proposal] Decouple logical from physical types #11513 [EPIC] Decouple logical from physical types #12622
- that issue explicitly lists runtime adaptivity ("RecordBatches with same logical type but different physical types") as a non-goal, so there is no overlap
formal signatures of functions inserted into the plan need to operate on higher level notion of types; this relates to Simple Functions #12635

The text was updated successfully, but these errors were encountered:

findepi · 2025-02-17T07:54:05Z

Prior discussion #7421

findepi · 2025-02-17T07:59:26Z

The issue example is about plain, dictionary and REE encoded data (also covered by #7421).
However, we could do more.

For example, for decimal(1,0) type we currently use 128 bits per value, where 8 would suffice. Plenty of waste. Support new Arrow types decimal32 and decimal64 arrow-rs#6661 Will help a bit, but more can be done.
For a column with decimal(38,0) type, we may still use 128 bit per value even once Decimal32 type is added. But what if the column actually contains only numbers 0..9? The runtime representation could be smaller size integer type.
- judging on how Snowflake returns data over their Arrow interface, this is likely what they do internally

jayzhan211 · 2025-02-17T11:01:48Z

so we are forced to cast them all to the same type

Why is it forced to be the same type? Maybe we should address this issue first.

jayzhan211 · 2025-02-17T11:41:50Z

We need to support arrow::kernel for various data types.

For instance, operations like comparison currently don’t support execution across different types. It's unclear whether this is due to a lack of implementation or if it's not considered a good approach.

https://github.com/apache/arrow-rs/blob/38d6e691f4ee1b356f28d77b6820de67166c51c3/arrow-ord/src/ord.rs#L355-L392

@tustvold, do you think implement arrow::kernel for types like Utf8 vs LargeUtf8 or Utf8 vs Dictionary(_, Utf8) a good idea? Types that are different but similar (have the same logical type i.e. String)

tustvold · 2025-02-17T11:54:26Z

As a general rule we don't support operations on heterogenous types to avoid the combinatorial explosion of codegen that would result, and the corresponding impact on build times and binary size.

There are some exceptions to this though:

Some arithmetic, e.g. intervals with dates and timestamps
Dictionaries with their non-dictionary encoded counterparts
Metadata only differences, e.g. timezones, decimal precision, etc...

I don't think as a general rule it makes sense to support heterogenous operations, but rather where there is a compelling justification for why coercion isn't appropriate. For example supporting mixed StringArray, LargeStringArray, StringViewArray seems hard to justify, given coercion is extremely cheap. Widening casts for decimals would likely be similar.

Where/when this coercion takes is a question for DF, but IMO I would expect the physical plan to be in terms of the physical arrow types, with the logical plan potentially in terms of some higher level logical datatype (although I do wonder if this type grouping might be operator specific and therefore make more sense as an internal utility). This allows the physical optimizer to make decisions about when and how coercion takes place, as opposed to this being an implicit behaviour of the kernels (potentially performing the same conversion multiple times redundantly)

jayzhan211 · 2025-02-17T13:36:53Z

@findepi I think we need to somewhat apply coercion in physical optimizer that dealing with the physical arrow types like String family or types that don't make sense to be supported in kernel function.

@tustvold DataFusion keeps ScalarValue::Utf8(String) for performance reason, given it is more lightweight compare than Scalar<ArrayRef>. If we need kernel for (String/LargeString/StringView Array, rust::String), do you think it makes sense to upstream in arrow or it is better to keep it in DF?

tustvold · 2025-02-17T16:08:38Z

keeps ScalarValue::Utf8(String) for performance reason, given it is more lightweight compare than Scalar

IMO ScalarValue shouldn't ever be on the hot path, if it is it indicates an issue with the way that kernel has been implemented. It has been a while since I looked at DF, but it did use to implement a lot of the windowed aggregates and array functions using ScalarValue when they probably shouldn't have been.

IMO unless it is arrow kernels that are bottlenecked on ScalarValue::Utf8 it wouldn't make sense to push this into arrow-rs

findepi · 2025-02-19T10:56:55Z

As a general rule we don't support operations on heterogenous types to avoid the combinatorial explosion of codegen that would result, and the corresponding impact on build times and binary size.

I guess we're in agreement that this pertains only to the lowest-level operations exposed by arrow.
Exploding codegen is not the only way to support runtime-adaptive data representation, but this runtime-adaptivity needs to end somewhere. We can decide where it is terminated. If it's terminated inside arrow kernels, we should expect binary code bloat.

Where/when this coercion takes is a question for DF, but IMO I would expect the physical plan to be in terms of the physical arrow types, with the logical plan potentially in terms of some higher level logical datatype

This is definitely an option. This is what is intended by #12622 (cc @notfilippo , @tobixdev).
When creating this issue as a separate one, i intended to go further and have adaptivity at runtime.
Often, data flowing from two different branches of UNION ALL doesn't need to be unified at all.

Maybe it is a premature idea, given than we're not done with #12622 yet.

findepi mentioned this issue Oct 2, 2024

[Epic] Make DataFusion a reliable foundation for building query engines #12723

Open

10 tasks

findepi mentioned this issue Feb 17, 2025

[EPIC] Decouple logical from physical types #12622

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime-adaptive data representation #12720

Runtime-adaptive data representation #12720

findepi commented Oct 2, 2024 •

edited

Loading

findepi commented Feb 17, 2025

findepi commented Feb 17, 2025

jayzhan211 commented Feb 17, 2025 •

edited

Loading

jayzhan211 commented Feb 17, 2025 •

edited

Loading

tustvold commented Feb 17, 2025 •

edited

Loading

jayzhan211 commented Feb 17, 2025 •

edited

Loading

tustvold commented Feb 17, 2025 •

edited

Loading

findepi commented Feb 19, 2025

Runtime-adaptive data representation #12720

Runtime-adaptive data representation #12720

Comments

findepi commented Oct 2, 2024 • edited Loading

Context

findepi commented Feb 17, 2025

findepi commented Feb 17, 2025

jayzhan211 commented Feb 17, 2025 • edited Loading

jayzhan211 commented Feb 17, 2025 • edited Loading

tustvold commented Feb 17, 2025 • edited Loading

jayzhan211 commented Feb 17, 2025 • edited Loading

tustvold commented Feb 17, 2025 • edited Loading

findepi commented Feb 19, 2025

findepi commented Oct 2, 2024 •

edited

Loading

jayzhan211 commented Feb 17, 2025 •

edited

Loading

jayzhan211 commented Feb 17, 2025 •

edited

Loading

tustvold commented Feb 17, 2025 •

edited

Loading

jayzhan211 commented Feb 17, 2025 •

edited

Loading

tustvold commented Feb 17, 2025 •

edited

Loading