Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Handling OpenSearch geo_point Data in Spark #1053

Open
ykmr1224 opened this issue Feb 14, 2025 · 0 comments
Open

[RFC] Handling OpenSearch geo_point Data in Spark #1053

ykmr1224 opened this issue Feb 14, 2025 · 0 comments

Comments

@ykmr1224
Copy link
Collaborator

1. Introduction

1.1 Background

When using OpenSearch as a data source for Apache Spark, special handling is required for OpenSearch-specific data types, such as geo_point. Spark does not have a native equivalent for geo_point, so a standard approach must be established for efficient querying and geospatial analysis.
Related Issue: #1047

1.2 Objective

This RFC proposes an approach for integrating OpenSearch geo_point fields into Spark DataFrames, ensuring proper data representation and efficient querying without relying on external geospatial libraries.

2. Problem Statement

2.1 OpenSearch geo_point Representation

OpenSearch stores geo_point fields in multiple formats, including:

Spark does not natively support geo_point, so an appropriate transformation is needed when retrieving data.

2.2 Challenges

  • Consistency: OpenSearch allows multiple formats for geo_point, making data retrieval non-trivial.
  • Query Compatibility: Efficient querying in Spark requires a structured geospatial format.
  • Performance Considerations: Converting and processing geo_point fields efficiently is critical for large datasets.

3. Proposed Solution

3.1 Mapping OpenSearch geo_point to Spark DataFrames

We propose normalizing geo_point fields into one of the following Spark representations:

OpenSearch Format Proposed Spark Representation
[lon, lat] (Array) STRUCT<lon: DOUBLE, lat: DOUBLE>
{"lat": x, "lon": y} (Object) STRUCT<lon: DOUBLE, lat: DOUBLE>
"lat,lon" (String) STRUCT<lon: DOUBLE, lat: DOUBLE>
"POINT(lon lat)" (WKT) STRING (WKT format)

3. Alternatives Considered

3.1 Keeping Original Representation

  • Pros: Retains full fidelity of OpenSearch data without transformation. Output will remain the same as original data format.
  • Cons: Requires additional processing in Spark for querying and analysis, leading to potential inefficiencies.
  • Use Case: If we push down all of geospatial operations to OpenSearch, this approach might be preferable.

3.2 Keeping geo_point as JSON String

  • Pros: Requires minimal transformation.
  • Cons: Harder to use in queries; requires custom parsing.

3.3 Using WKT Representation

  • Pros: Compatible with SQL-based geospatial processing.
  • Cons: Requires conversion before analysis, increasing processing overhead.

4. Open Questions

  1. Should we support multiple representations (e.g., WKT and Struct) or standardize on one?
  2. What optimizations should be applied when filtering OpenSearch geo_point data?

Discussion & Feedback

Please provide feedback by commenting on this RFC or suggesting changes via a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant