WIP Parquet: Support reading/writing geometry and geography columns #12347

Kontinuation · 2025-02-20T07:30:49Z

This PR depends on #12346. It implements part of the iceberg geo spec: #10981.

The iceberg spec requires that geometry and geography types in iceberg are mapped to BINARY physical types with GEOMETRY or GEOGRAPHY logical type annotations. These 2 spatial logical types were introduced to the Parquet format since apache/parquet-format#240, and there is still on-going effort to implement them in parquet-java: apache/parquet-java#2971. The parquet-java implementation is not finished yet, so this work-in-progress PR depends on a locally built SNAPSHOT version of parquet-java.

Kontinuation · 2025-02-20T07:49:12Z

I found that it is not easy to upgrade the parquet dependency to the (not-released-yet) next version, because parquet-hadoop now uses a FileSystem API introduced in Hadoop 3: apache/parquet-java#3079. Upgrading parquet dependencies to the latest SNAPSHOT version results in the following failure when running tests in iceberg-data:

'org.apache.hadoop.fs.FutureDataInputStreamBuilder org.apache.hadoop.fs.FileSystem.openFile(org.apache.hadoop.fs.Path)'
java.lang.NoSuchMethodError: 'org.apache.hadoop.fs.FutureDataInputStreamBuilder org.apache.hadoop.fs.FileSystem.openFile(org.apache.hadoop.fs.Path)'
	at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:114)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:710)
	at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:194)
	at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:76)

We have to remove Hadoop 2 support and migrate to Hadoop 3 for all submodules. There is a stale PR working on this: #10932. I found that #10940 was closed as completed but there are still lots of submodule depending on Hadoop 2. I'd like to know how should we proceed to upgrade the parquet package. Should we upgrade dependencies to Hadoop 2 to Hadoop 3 to unblock the parquet upgrade? @szehon-ho @rdblue

pvary · 2025-02-20T09:35:13Z

I would raise this question on the dev list to get wider audience for the issue after collecting the modules effected.

Kontinuation added 2 commits February 20, 2025 15:08

add geometry and geography types to iceberg-api and iceberg-core

904f503

Add geometry and geography support for iceberg-parquet and iceberg-data

20c391a

github-actions bot added API parquet core data build labels Feb 20, 2025

Kontinuation mentioned this pull request Feb 20, 2025

Build: remove Hadoop 2 dependency #12348

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Parquet: Support reading/writing geometry and geography columns #12347

WIP Parquet: Support reading/writing geometry and geography columns #12347

Kontinuation commented Feb 20, 2025

Kontinuation commented Feb 20, 2025 •

edited

Loading

pvary commented Feb 20, 2025

WIP Parquet: Support reading/writing geometry and geography columns #12347

Are you sure you want to change the base?

WIP Parquet: Support reading/writing geometry and geography columns #12347

Conversation

Kontinuation commented Feb 20, 2025

Kontinuation commented Feb 20, 2025 • edited Loading

pvary commented Feb 20, 2025

Kontinuation commented Feb 20, 2025 •

edited

Loading