-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2417: Add support for geometry logical type #2971
base: master
Are you sure you want to change the base?
PARQUET-2417: Add support for geometry logical type #2971
Conversation
This PR is copied form this place: apache#1379
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
…e spherical edge is specified.
…apache-parquet-2417-geospatial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! I have left some comments. I think we are reaching the finish line!
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryUtils.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/BoundingBox.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/Covering.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/statistics/TestGeometryTypeRoundTrip.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Test | ||
public void testEPSG4326BasicReadWriteGeometryValue() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these tests!
I think we are missing tests in following cases:
- verify geometry type metadata is well preserved.
- verify all kinds of geometry stats are preserved, including bbox, covering and geometry types.
- verify geo stats in the column index have been generated.
I can do these later.
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I have left a few comments regarding to the statistics. Please take a look. Thanks!
parquet-column/src/main/java/org/apache/parquet/column/statistics/geometry/Covering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...t-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryStatistics.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...uet-column/src/main/java/org/apache/parquet/column/statistics/geometry/EnvelopeCovering.java
Outdated
Show resolved
Hide resolved
...t-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryStatistics.java
Outdated
Show resolved
Hide resolved
...t-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryStatistics.java
Outdated
Show resolved
Hide resolved
...t-column/src/main/java/org/apache/parquet/column/statistics/geometry/GeometryStatistics.java
Outdated
Show resolved
Hide resolved
…apache-parquet-2417-geospatial
…apache-parquet-2417-geospatial
@wgtmac please take a look :-) |
Sure, I will take a look. Thanks! |
I am depending on this PR to build geo support for iceberg. I got lots of test failures when building this branch locally:
NPE is thrown when reading parquet files without geo columns. Can we apply the following patch to resolve this problem? diff --git a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
index 3efc9345..22e51783 100644
--- a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
+++ b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
@@ -961,6 +961,9 @@ public class ParquetMetadataConverter {
static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics(
GeospatialStatistics formatGeomStats, PrimitiveType type) {
+ if (formatGeomStats == null) {
+ return null;
+ }
org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null;
if (formatGeomStats.isSetBbox()) {
BoundingBox formatBbox = formatGeomStats.getBbox(); |
static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics( | ||
GeospatialStatistics formatGeomStats, PrimitiveType type) { | ||
org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics( | |
GeospatialStatistics formatGeomStats, PrimitiveType type) { | |
org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null; | |
static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics( | |
GeospatialStatistics formatGeomStats, PrimitiveType type) { | |
if (formatGeomStats == null) { | |
return null; | |
} | |
org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null; |
@@ -1091,6 +1139,156 @@ public int hashCode() { | |||
} | |||
} | |||
|
|||
public static class GeometryLogicalTypeAnnotation extends LogicalTypeAnnotation { | |||
private final String crs; | |||
private final ByteBuffer metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata
is not present in the latest spec. We should remove it.
|
||
@Override | ||
LogicalTypeToken getType() { | ||
return LogicalTypeToken.GEOMETRY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return LogicalTypeToken.GEOMETRY; | |
return LogicalTypeToken.GEOGRAPHY; |
@@ -190,4 +209,12 @@ public void setMinMax(Binary min, Binary max) { | |||
public BinaryStatistics copy() { | |||
return new BinaryStatistics(this); | |||
} | |||
|
|||
public void setGeometryStatistics(GeospatialStatistics geospatialStatistics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should name it setGeospatialStatistics
because we already have getGeospatialStatistics
.
@Override | ||
public boolean equals(Object obj) { | ||
if (!(obj instanceof GeometryLogicalTypeAnnotation)) { | ||
return false; | ||
} | ||
GeometryLogicalTypeAnnotation other = (GeometryLogicalTypeAnnotation) obj; | ||
return crs.equals(other.crs); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be updated to compare against GeographyLogicalTypeAnnotation
objects
@Override | ||
public Optional<PrimitiveComparator> visit( | ||
LogicalTypeAnnotation.GeometryLogicalTypeAnnotation geometryLogicalType) { | ||
// ColumnOrder is undefined for GEOMETRY logical type. Use the default comparator for | ||
// now. | ||
return of(PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should implement visit
for GeographyLogicalTypeAnnotation
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises another problem: currently this PR generates min and max statistics for geometry and geography columns, but that is explicitly prohibited by the latest Parquet geo spec:
The sort order used for
GEOMETRY
is undefined. When writing data, no min/max statistics should be saved for this type and if such non-compliant statistics are found during reading, they must be ignored.
if (geographyLogicalType.getCrs() != null) { | ||
geographyType.setCrs(geographyLogicalType.getCrs()); | ||
} | ||
if (geographyType.getAlgorithm() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (geographyType.getAlgorithm() != null) { | |
if (geographyLogicalType.getEdgeAlgorithm() != null) { |
|
||
public static class GeographyLogicalTypeAnnotation extends LogicalTypeAnnotation { | ||
private final String crs; | ||
private final String edgeAlgorithm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define an enum for edge interpolation algorithms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are already auto generated from the thrift protocol in EdgeInterpolationAlgorithm.
This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.
Here is the proposal: apache/parquet-format#240
Jira
Tests
Commits
Documentation