Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2417: Add support for geometry logical type #2971

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

zhangfengcdt
Copy link

This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.

Here is the proposal: apache/parquet-format#240

Jira

Tests

  • My PR adds the following unit tests: TestGeometryTypeRoundTrip

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@zhangfengcdt zhangfengcdt marked this pull request as draft July 26, 2024 15:39
@zhangfengcdt
Copy link
Author

CC: @jiayuasu @Kontinuation @wgtmac

parquet-column/pom.xml Outdated Show resolved Hide resolved
@zhangfengcdt zhangfengcdt marked this pull request as ready for review August 12, 2024 15:57
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! I have left some comments. I think we are reaching the finish line!

parquet-hadoop/pom.xml Outdated Show resolved Hide resolved
}

@Test
public void testEPSG4326BasicReadWriteGeometryValue() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests!

I think we are missing tests in following cases:

  • verify geometry type metadata is well preserved.
  • verify all kinds of geometry stats are preserved, including bbox, covering and geometry types.
  • verify geo stats in the column index have been generated.

I can do these later.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I have left a few comments regarding to the statistics. Please take a look. Thanks!

@jiayuasu
Copy link
Member

@wgtmac please take a look :-)

@wgtmac
Copy link
Member

wgtmac commented Feb 13, 2025

Sure, I will take a look. Thanks!

@Kontinuation
Copy link
Member

I am depending on this PR to build geo support for iceberg. I got lots of test failures when building this branch locally:

java.lang.NullPointerException
	at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:965)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.buildColumnChunkMetaData(ParquetMetadataConverter.java:1750)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1848)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1728)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:629)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:934)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:698)

NPE is thrown when reading parquet files without geo columns. Can we apply the following patch to resolve this problem?

diff --git a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
index 3efc9345..22e51783 100644
--- a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
+++ b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
@@ -961,6 +961,9 @@ public class ParquetMetadataConverter {
 
   static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics(
       GeospatialStatistics formatGeomStats, PrimitiveType type) {
+    if (formatGeomStats == null) {
+      return null;
+    }
     org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null;
     if (formatGeomStats.isSetBbox()) {
       BoundingBox formatBbox = formatGeomStats.getBbox();

Comment on lines +962 to +964
static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics(
GeospatialStatistics formatGeomStats, PrimitiveType type) {
org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics(
GeospatialStatistics formatGeomStats, PrimitiveType type) {
org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null;
static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics(
GeospatialStatistics formatGeomStats, PrimitiveType type) {
if (formatGeomStats == null) {
return null;
}
org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null;

@@ -1091,6 +1139,156 @@ public int hashCode() {
}
}

public static class GeometryLogicalTypeAnnotation extends LogicalTypeAnnotation {
private final String crs;
private final ByteBuffer metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata is not present in the latest spec. We should remove it.


@Override
LogicalTypeToken getType() {
return LogicalTypeToken.GEOMETRY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return LogicalTypeToken.GEOMETRY;
return LogicalTypeToken.GEOGRAPHY;

@@ -190,4 +209,12 @@ public void setMinMax(Binary min, Binary max) {
public BinaryStatistics copy() {
return new BinaryStatistics(this);
}

public void setGeometryStatistics(GeospatialStatistics geospatialStatistics) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should name it setGeospatialStatistics because we already have getGeospatialStatistics.

Comment on lines +1272 to +1279
@Override
public boolean equals(Object obj) {
if (!(obj instanceof GeometryLogicalTypeAnnotation)) {
return false;
}
GeometryLogicalTypeAnnotation other = (GeometryLogicalTypeAnnotation) obj;
return crs.equals(other.crs);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated to compare against GeographyLogicalTypeAnnotation objects

Comment on lines +275 to +281
@Override
public Optional<PrimitiveComparator> visit(
LogicalTypeAnnotation.GeometryLogicalTypeAnnotation geometryLogicalType) {
// ColumnOrder is undefined for GEOMETRY logical type. Use the default comparator for
// now.
return of(PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should implement visit for GeographyLogicalTypeAnnotation as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises another problem: currently this PR generates min and max statistics for geometry and geography columns, but that is explicitly prohibited by the latest Parquet geo spec:

The sort order used for GEOMETRY is undefined. When writing data, no min/max statistics should be saved for this type and if such non-compliant statistics are found during reading, they must be ignored.

if (geographyLogicalType.getCrs() != null) {
geographyType.setCrs(geographyLogicalType.getCrs());
}
if (geographyType.getAlgorithm() != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (geographyType.getAlgorithm() != null) {
if (geographyLogicalType.getEdgeAlgorithm() != null) {


public static class GeographyLogicalTypeAnnotation extends LogicalTypeAnnotation {
private final String crs;
private final String edgeAlgorithm;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define an enum for edge interpolation algorithms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are already auto generated from the thrift protocol in EdgeInterpolationAlgorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants