Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Parquet: Support reading/writing geometry and geography columns #12347

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Kontinuation
Copy link
Member

This PR depends on #12346. It implements part of the iceberg geo spec: #10981.

The iceberg spec requires that geometry and geography types in iceberg are mapped to BINARY physical types with GEOMETRY or GEOGRAPHY logical type annotations. These 2 spatial logical types were introduced to the Parquet format since apache/parquet-format#240, and there is still on-going effort to implement them in parquet-java: apache/parquet-java#2971. The parquet-java implementation is not finished yet, so this work-in-progress PR depends on a locally built SNAPSHOT version of parquet-java.

@Kontinuation
Copy link
Member Author

Kontinuation commented Feb 20, 2025

I found that it is not easy to upgrade the parquet dependency to the (not-released-yet) next version, because parquet-hadoop now uses a FileSystem API introduced in Hadoop 3: apache/parquet-java#3079. Upgrading parquet dependencies to the latest SNAPSHOT version results in the following failure when running tests in iceberg-data:

'org.apache.hadoop.fs.FutureDataInputStreamBuilder org.apache.hadoop.fs.FileSystem.openFile(org.apache.hadoop.fs.Path)'
java.lang.NoSuchMethodError: 'org.apache.hadoop.fs.FutureDataInputStreamBuilder org.apache.hadoop.fs.FileSystem.openFile(org.apache.hadoop.fs.Path)'
	at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:114)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:710)
	at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:194)
	at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:76)

We have to remove Hadoop 2 support and migrate to Hadoop 3 for all submodules. There is a stale PR working on this: #10932. I found that #10940 was closed as completed but there are still lots of submodule depending on Hadoop 2. I'd like to know how should we proceed to upgrade the parquet package. Should we upgrade dependencies to Hadoop 2 to Hadoop 3 to unblock the parquet upgrade? @szehon-ho @rdblue

@pvary
Copy link
Contributor

pvary commented Feb 20, 2025

I would raise this question on the dev list to get wider audience for the issue after collecting the modules effected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants