You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using OpenSearch as a data source for Apache Spark, special handling is required for OpenSearch-specific data types, such as geo_point. Spark does not have a native equivalent for geo_point, so a standard approach must be established for efficient querying and geospatial analysis.
Related Issue: #1047
1.2 Objective
This RFC proposes an approach for integrating OpenSearch geo_point fields into Spark DataFrames, ensuring proper data representation and efficient querying without relying on external geospatial libraries.
2. Problem Statement
2.1 OpenSearch geo_point Representation
OpenSearch stores geo_point fields in multiple formats, including:
Array format:[longitude, latitude]
String format:"lat,lon" or "POINT(longitude latitude)"
1. Introduction
1.1 Background
When using OpenSearch as a data source for Apache Spark, special handling is required for OpenSearch-specific data types, such as
geo_point
. Spark does not have a native equivalent forgeo_point
, so a standard approach must be established for efficient querying and geospatial analysis.Related Issue: #1047
1.2 Objective
This RFC proposes an approach for integrating OpenSearch
geo_point
fields into Spark DataFrames, ensuring proper data representation and efficient querying without relying on external geospatial libraries.2. Problem Statement
2.1 OpenSearch geo_point Representation
OpenSearch stores
geo_point
fields in multiple formats, including:[longitude, latitude]
"lat,lon"
or"POINT(longitude latitude)"
{ "lat": value, "lon": value }
(ref: https://opensearch.org/docs/latest/field-types/supported-field-types/geo-point/)
Spark does not natively support
geo_point
, so an appropriate transformation is needed when retrieving data.2.2 Challenges
geo_point
, making data retrieval non-trivial.geo_point
fields efficiently is critical for large datasets.3. Proposed Solution
3.1 Mapping OpenSearch geo_point to Spark DataFrames
We propose normalizing
geo_point
fields into one of the following Spark representations:[lon, lat]
(Array)STRUCT<lon: DOUBLE, lat: DOUBLE>
{"lat": x, "lon": y}
(Object)STRUCT<lon: DOUBLE, lat: DOUBLE>
"lat,lon"
(String)STRUCT<lon: DOUBLE, lat: DOUBLE>
"POINT(lon lat)"
(WKT)STRING (WKT format)
3. Alternatives Considered
3.1 Keeping Original Representation
3.2 Keeping geo_point as JSON String
3.3 Using WKT Representation
4. Open Questions
geo_point
data?Discussion & Feedback
Please provide feedback by commenting on this RFC or suggesting changes via a pull request.
The text was updated successfully, but these errors were encountered: