ListingTable cannot handle partition evolution #13270

adriangb · 2024-11-06T01:54:36Z

Describe the bug

With CSV:

echo "a,b\n1,2" > data1.csv
mkdir a=2
echo "b\n3" > a=2/data2.csv
datafusion-cli
> SELECT * FROM '**/*.csv';
Arrow error: Csv error: incorrect number of fields for line 1, expected 2 got 1

With Parquet:

import os
import polars as pl

pl.DataFrame({'a': [1], 'b': [2]}).write_parquet('data1.parquet')
os.mkdir('a=2')
pl.DataFrame({'b': [3]}).write_parquet('a=2/data2.parquet')

datafusion-cli
> SELECT * FROM '**/*.parquet';
+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 |   |
+---+---+
2 row(s) fetched.
Elapsed 0.055 seconds.

To Reproduce

No response

Expected behavior

Partition evolution is handled and both cases return

+---+---+
| b | a |
+---+---+
| 2 | 1 |
| 3 | 2 |
+---+---+

Additional context

Having played around quite a bit with ParquetExec and the SchemaAdapter machinery I think what should happen is:

Partition values are on a per-file basis, in particular on each PartitionedFile and not on the FileScanConfig
Partition values are passed into the SchemaAdapter machinery and for each file it decides if it needs to add a column generated from partition values or not

The text was updated successfully, but these errors were encountered:

adriangb · 2024-11-06T01:54:54Z

cc @alamb I had promised you this a long time ago but only got around to it now

alamb · 2024-11-06T16:30:38Z

Thanks @adriangb

zhuqi-lucas · 2024-12-19T03:32:59Z

take

adriangb · 2025-02-10T14:53:39Z

@logan-keede I see you're doing some work on FileScanConfig. Would it be relevant to consider what needs to be changed to fix this?

logan-keede · 2025-02-10T15:27:41Z

@adriangb my focus has been on refactoring FileScanConfig to move it out of core. I cant say I understand the internals that much, but I will look into it and mention it here if I find something relevant.

zhuqi-lucas · 2025-02-11T04:03:32Z

@adriangb Sorry for the delay, i am starting to investigate this issue this week.

zhuqi-lucas · 2025-02-11T10:50:38Z

First round investigation:

We need runtime to do the partition evolution and infer partitions result need to overwrite the empty FileScanConfig table_partition_cols, i can't find a good way until now. Because many cases in code using the FileScanConfig table_partition_cols to pass the paras.

@adriangb Do you have any suggestions, how can we do this in current architecture?

Updated:

We may can have a runtime cache to store the partition evolution result? So we can use it if FileScanConfig table_partition_cols is empty?

adriangb · 2025-02-11T17:33:21Z

I think the fundamental issue is that the partition columns are specified on a per-exec basis via FileScanConfig. The only solutions I can think of are:

Change the APIs to allow multiple FileScanConfig's to be supplied. This bring about issues of making sure the output schemas all match so they can be unioned, etc.
Move partition column generation into SchemaAdapter. The issue with this is that SchemaAdapter exists at a lower level than the concept of partition columns and it might be inappropriate to put that logic in there directly. But at the same time FileScanConfig and ParquetExec (recently folded into ParquetDataSource?) exist above the level of a single file, and partitioning can be as granular as a single file. I think the solution here would be to add hooks into SchemaAdapter to be able to handle missing columns so that the exec can inject information on how to generate partition columns from file paths. It could do that very dynamically on a per-file basis with no config or we could say that you have to pass the union of all columns that might be partition columns along with their field types and then if a file has only a subset of those that's okay, but we error or fill in nulls if we encounter a missing column that was not declared as a partition column.

TheBuilderJR · 2025-02-18T06:06:21Z

+1 I'm also blocked on this. It'd be nice if schema evolution could be a first class citizen in datafusion. It's been pretty painful/stressful running into schema evolution bugs with https://telemetry.sh. It feels like a ticking time bomb before a schema gets corrupted :(

zhuqi-lucas · 2025-02-18T06:30:24Z

Just noticed we have a solution for partition evolution for dynamic file catalog, see details PR, may be we need some improvement based on it?

https://github.com/apache/datafusion/pull/12683/files

Still can't find a good solution from code side, feel free to take it.

TheBuilderJR · 2025-02-18T21:53:18Z

@zhuqi-lucas here's one current failure scenario with evolution: #14755

adriangb added the bug Something isn't working label Nov 6, 2024

github-actions bot assigned zhuqi-lucas Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ListingTable cannot handle partition evolution #13270

ListingTable cannot handle partition evolution #13270

adriangb commented Nov 6, 2024

adriangb commented Nov 6, 2024

alamb commented Nov 6, 2024

zhuqi-lucas commented Dec 19, 2024

adriangb commented Feb 10, 2025

logan-keede commented Feb 10, 2025

zhuqi-lucas commented Feb 11, 2025

zhuqi-lucas commented Feb 11, 2025 •

edited

Loading

adriangb commented Feb 11, 2025

TheBuilderJR commented Feb 18, 2025

zhuqi-lucas commented Feb 18, 2025 •

edited

Loading

TheBuilderJR commented Feb 18, 2025

ListingTable cannot handle partition evolution #13270

ListingTable cannot handle partition evolution #13270

Comments

adriangb commented Nov 6, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context

adriangb commented Nov 6, 2024

alamb commented Nov 6, 2024

zhuqi-lucas commented Dec 19, 2024

adriangb commented Feb 10, 2025

logan-keede commented Feb 10, 2025

zhuqi-lucas commented Feb 11, 2025

zhuqi-lucas commented Feb 11, 2025 • edited Loading

adriangb commented Feb 11, 2025

TheBuilderJR commented Feb 18, 2025

zhuqi-lucas commented Feb 18, 2025 • edited Loading

TheBuilderJR commented Feb 18, 2025

zhuqi-lucas commented Feb 11, 2025 •

edited

Loading

zhuqi-lucas commented Feb 18, 2025 •

edited

Loading