Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL: Add json_extract function #3262

Open
wants to merge 86 commits into
base: main
Choose a base branch
from

Conversation

kenrickyap
Copy link
Contributor

@kenrickyap kenrickyap commented Jan 23, 2025

Description

Implements json_extract function that would be useful to extract specific objects and scalars from an existing json_object.

Related Issues

Resolves #3211
Follow up issue for #3027

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

14yapkc1 and others added 30 commits January 3, 2025 10:54
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Co-authored-by: Andrew Carbonetto <[email protected]>
Signed-off-by: kenrickyap <[email protected]>
….java

Co-authored-by: Andrew Carbonetto <[email protected]>
Signed-off-by: kenrickyap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
YANG-DB
YANG-DB previously approved these changes Jan 30, 2025
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
@kenrickyap kenrickyap force-pushed the feature/json-extract branch from dfe5d3a to 75e9cc3 Compare January 31, 2025 14:57
Signed-off-by: Kenrick Yap <[email protected]>
YANG-DB
YANG-DB previously approved these changes Feb 3, 2025
Copy link
Collaborator

@acarbonetto acarbonetto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments to address

Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>

# Conflicts:
#	core/src/main/java/org/opensearch/sql/expression/DSL.java
#	core/src/main/java/org/opensearch/sql/expression/function/BuiltinFunctionName.java
#	core/src/main/java/org/opensearch/sql/expression/json/JsonFunctions.java
#	core/src/main/java/org/opensearch/sql/utils/JsonUtils.java
#	core/src/test/java/org/opensearch/sql/expression/json/JsonFunctionsTest.java
#	docs/user/ppl/functions/json.rst
#	integ-test/src/test/java/org/opensearch/sql/ppl/JsonFunctionsIT.java
#	integ-test/src/test/resources/json_test.json
#	ppl/src/main/antlr/OpenSearchPPLLexer.g4
#	ppl/src/main/antlr/OpenSearchPPLParser.g4
Signed-off-by: Kenrick Yap <[email protected]>
Signed-off-by: Kenrick Yap <[email protected]>
Comment on lines 72 to 82
Usage: `json_extract(doc, path[, path])` Extracts a JSON value from a json document based on the path specified.

Argument type: STRING, STRING

Return type: STRING/BOOLEAN/DOUBLE/INTEGER/NULL/STRUCT/ARRAY

- Up to 3 paths can be provided, and results of each `path` with be returned in an ARRAY.
- Returns an ARRAY if `path` points to multiple results (e.g. $.a[*]) or if the `path` points to an array.
- Return null if `path` is not valid, or if JSON `doc` is MISSING or NULL.
- Throws SemanticCheckException if `doc` or `path` is malformed.
- Throws ExpressionEvaluationException if `path` is missing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this documentation might need some updates for multiple paths?

  • Usage: json_extract(doc, path[, path[, path[[) (or another way to indicate that up to 3 paths can be provided)
  • Usage: "based on the path(s) specified"
  • Argument type: "STRING, STRING[, STRING[, STRING]] (or another way to indicate that up to 3 paths can be provided)
  • What about the case where only one of the paths is invalid? Should this read "Returns null if any path is not valid"? Similarly, should the last two bullets be updated to specify that they apply to "any path"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +194 to +219
@Test
public void test_json_extract() throws IOException {
JSONObject result;
result =
executeQuery(
String.format(
"source=%s | where json_valid(json_string) | eval"
+ " extracted=json_extract(json_string, '$.b') | fields test_name, extracted",
TEST_INDEX_JSON_TEST));
verifySchema(
result, schema("test_name", null, "string"), schema("extracted", null, "undefined"));
verifyDataRows(
result,
rows("json nested object", new JSONObject(Map.of("c", "3"))),
rows("json object", "2"),
rows("json array", null),
rows("json nested array", null),
rows("json scalar string", null),
rows("json scalar int", null),
rows("json scalar float", null),
rows("json scalar double", null),
rows("json scalar boolean true", null),
rows("json scalar boolean false", null),
rows("json empty string", null),
rows("json nested list", new JSONArray(List.of(Map.of("c", "2"), Map.of("c", "3")))));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed during yesterday's review session, we certainly don't need to test all edge cases in IT, but I think we probably want to have more than one test. At least, we should probably have one with two path strings and one with three path strings?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and maybe test the 4 path case where we get a parsing error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to test a couple of other error scenarios mentioned: such as null returns and Semantic errors for malformed json.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test IT test cases for use of multiple paths and for the cases when semantic errors are thrown. Though I am curious why we need to test for errors in this case as I cannot find any other IT test testing for errors.

In addition by the time the IT test receive errors they are received as ResponseErrors so had to check if error contains reference to semantic error.

List<ExprValue> resultList = new ArrayList<>();

for (ExprValue path : paths) {
System.out.println("Processing path: " + path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an errant debug statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

* @return ExprValue of value at given path of json string.
*/
public static ExprValue extractJson(ExprValue json, ExprValue... paths) {
List<ExprValue> resultList = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since know that the array list is always going to be the same since as paths, can we just initialize it with a specified initialCapacity to save some memory space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

}

private static ExprValue extractJsonPath(String json, String path) {
if (json.isEmpty() || json.equals("null")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate case for either of these? Seems like this would just throw a PathNotFoundException and returns LITERAL_NULL below if not (at least for "null", I'm not sure about the empty string). Probably makes sense to reduce branching here if possible. 🤷

Moreover, do we actually want to return LITERAL_NULL for the empty string? Is that an invalid JSON string? If so, does it make sense to raise the corresponding exception instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both "null" and "" are considered valid json strings by json_valid, since we expect that users will use json_valid to first filter for valid json then use json extract to avoid errors, we need these specific cases to be consistent with json_valid

assertEquals(expected, actual);
}

private static void execute_extract_json(ExprValue expected, String json, String path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be generalized to allow multiple paths, so you can use it for json_extract_search_list_of_paths as well?

Suggested change
private static void execute_extract_json(ExprValue expected, String json, String path) {
private static void execute_extract_json(ExprValue expected, String json, String... paths) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

}

@Test
void json_extract_throws_ExpressionEvaluationException() {
Copy link
Contributor

@currantw currantw Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could split execute_extract_json into two methods: execute_extract_json (which would just actually run jsonExtract) and test_extract_json (or assert_extract_json), which would call execute_extract_json and then actually do the comparison.

That way, test methods like this could call execute_extract_json, and no have to worry about calling DSL.literal 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}

@Test
void json_extract_search_arrays() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have tests for the following:

  • Multiple paths where one or more paths match an array
  • Multiple paths where one or more paths match more than one value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

@currantw currantw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed to ec6ff5e.


Argument type: STRING, STRING

Return type: STRING/BOOLEAN/DOUBLE/INTEGER/NULL/STRUCT/ARRAY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use JSON_STRING instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we always return an ARRAY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case that only one string is provided, we return the associated type which will not be a json string. e.g. if path "$.a" is provided from json_string "{"a": 1}: then a integer is returned


Return type: STRING/BOOLEAN/DOUBLE/INTEGER/NULL/STRUCT/ARRAY

- Up to 3 paths can be provided, and results of each `path` with be returned in an ARRAY.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phrasing is a little confusing.

Suggested change
- Up to 3 paths can be provided, and results of each `path` with be returned in an ARRAY.
- Up to 3 paths can be provided, and results of all possible `path`s will be returned in an ARRAY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved


- Up to 3 paths can be provided, and results of each `path` with be returned in an ARRAY.
- Returns an ARRAY if `path` points to multiple results (e.g. $.a[*]) or if the `path` points to an array.
- Return null if `path` is not valid, or if JSON `doc` is MISSING or NULL.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null or empty array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case of multiple paths an array of nulls would be returned

- Returns an ARRAY if `path` points to multiple results (e.g. $.a[*]) or if the `path` points to an array.
- Return null if `path` is not valid, or if JSON `doc` is MISSING or NULL.
- Throws SemanticCheckException if `doc` or `path` is malformed.
- Throws ExpressionEvaluationException if `path` is missing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not handled in code in this PR, this is the default error thrown if provided function does not matched defined schema for functions. Thrown by DefaultFunctionResolver.

Comment on lines +194 to +219
@Test
public void test_json_extract() throws IOException {
JSONObject result;
result =
executeQuery(
String.format(
"source=%s | where json_valid(json_string) | eval"
+ " extracted=json_extract(json_string, '$.b') | fields test_name, extracted",
TEST_INDEX_JSON_TEST));
verifySchema(
result, schema("test_name", null, "string"), schema("extracted", null, "undefined"));
verifyDataRows(
result,
rows("json nested object", new JSONObject(Map.of("c", "3"))),
rows("json object", "2"),
rows("json array", null),
rows("json nested array", null),
rows("json scalar string", null),
rows("json scalar int", null),
rows("json scalar float", null),
rows("json scalar double", null),
rows("json scalar boolean true", null),
rows("json scalar boolean false", null),
rows("json empty string", null),
rows("json nested list", new JSONArray(List.of(Map.of("c", "2"), Map.of("c", "3")))));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and maybe test the 4 path case where we get a parsing error

Comment on lines +194 to +219
@Test
public void test_json_extract() throws IOException {
JSONObject result;
result =
executeQuery(
String.format(
"source=%s | where json_valid(json_string) | eval"
+ " extracted=json_extract(json_string, '$.b') | fields test_name, extracted",
TEST_INDEX_JSON_TEST));
verifySchema(
result, schema("test_name", null, "string"), schema("extracted", null, "undefined"));
verifyDataRows(
result,
rows("json nested object", new JSONObject(Map.of("c", "3"))),
rows("json object", "2"),
rows("json array", null),
rows("json nested array", null),
rows("json scalar string", null),
rows("json scalar int", null),
rows("json scalar float", null),
rows("json scalar double", null),
rows("json scalar boolean true", null),
rows("json scalar boolean false", null),
rows("json empty string", null),
rows("json nested list", new JSONArray(List.of(Map.of("c", "2"), Map.of("c", "3")))));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to test a couple of other error scenarios mentioned: such as null returns and Semantic errors for malformed json.

Signed-off-by: Kenrick Yap <[email protected]>
| json empty string | | null |
+--------------------+--------------------------------------+-------------------------+

os> source=json_test | where test_name="json nested list" | eval json_extract=json_extract(json_string, '$.b[1].c') | fields test_name, json_string, json_extract
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be kind of you if you post examples and description to the PR

Description
>>>>>>>>>>>

Usage: `json_extract(doc, path[, path[, path])` Extracts a JSON value from a json document based on the path(s) specified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it accepts 1-3 paths. Why not array of paths then?


Example::

os> source=json_test | where json_valid(json_string) | eval json_extract=json_extract(json_string, '$.b') | fields test_name, json_string, json_extract
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it accept "legacy" path (path which starts with ".")? if yes, this should be tested and documented. If no, this should be documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add json_extract function to PPL
8 participants