Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL: Add json_set and json_extend command to spark-ppl #1038

Merged

Conversation

acarbonetto
Copy link
Contributor

@acarbonetto acarbonetto commented Feb 7, 2025

Description

Adds json_set and json_extend functions to the spark PPL UDF.

Related Issues

Resolves #996

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
@YANG-DB YANG-DB enabled auto-merge (squash) February 7, 2025 04:35
@YANG-DB YANG-DB disabled auto-merge February 7, 2025 04:35
@acarbonetto acarbonetto changed the title PPL: Add json_extend command to spark-ppl PPL: Add json_set and json_extend command to spark-ppl Feb 7, 2025
Signed-off-by: Andrew Carbonetto <[email protected]>
Copy link
Member

@LantaoJin LantaoJin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some ITs for testing integrated with multiple json functions

@@ -223,10 +223,6 @@ public enum BuiltinFunctionName {
JSON_EXTRACT(FunctionName.of("json_extract")),
JSON_KEYS(FunctionName.of("json_keys")),
JSON_VALID(FunctionName.of("json_valid")),
// JSON_DELETE(FunctionName.of("json_delete")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em, how did json_delete work if we didn't included here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not a built-in function, but a user-defined function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Built-in" here means that this function is built-in in PPL-on-Spark, not that it is built-in in Spark. Therefore, as long as the implementation is provided, those that do not require a user to provide their own implementation are considered "built-in" (we currently do not provide a public interface for user-defined functions, so the functions we add are all built-in functions). It seems due to this code

if (BuiltinFunctionName.of(function.getFuncName()).isEmpty()) {
ScalaUDF udf = SerializableUdf.visit(function.getFuncName(), args);
if(udf == null) {
throw new UnsupportedOperationException(function.getFuncName() + " is not a builtin function of PPL");
}
return udf;
, there is no need to declare them in this class. I think we need unified them in BuiltinFunctionName.( Not a blocker for this PR).

@LantaoJin
Copy link
Member

cc @qianheng-aws

Copy link
Contributor

@qianheng-aws qianheng-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add IT for this PR and ensure the examples you add in the document are all correct.

@@ -278,12 +301,57 @@ Example:
|{"teacher":["Alice","Tom","Walt"],"student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]} |
+-----------------------------------------------------------------------------------------------------------------------------------+


os> source=people | eval append = json_append(`{"school":{"teacher":["Alice"],"student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]}}`,array('school.teacher', 'Tom', 'Walt')) | head 1 | fields append
os> source=people | eval append = json_append(`{"school":{"teacher":["Alice"],"student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]}}`,array('school.teacher', array('Tom', 'Walt'))) | head 1 | fields append
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you ever checked this PPL? Does it work?

As far as I know, function array in Spark won't accept elements of different types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the latest example/IT test.
We can use a json-encoded string of arrays to do this. But passing the array does have abide by the rules of our Spark array object.


Example:

os> source=people | eval extend = json_extend(`{"teacher":["Alice"],"student":[{"name":"Bob","rank":1},{"name":"Charlie","rank":2}]}`, 'student', '{"name":"Tommy","rank":5}') | head 1 | fields extend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same problem as found in json_set. These 2 functions should only have 2 parameters as defined in the code while given 3 here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qianheng-aws the second parameter is actually a list of key-value pairs if I'm not mistaken - @acarbonetto is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

docs/ppl-lang/functions/ppl-json.md Outdated Show resolved Hide resolved
* @param depth - current traversal depth
* @param valueToUpdate - value to update
*/
static void updateNestedValue(Object currentObj, String[] pathParts, int depth, Object valueToUpdate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the code from appendNestedValue? These 2 functions seem to have mostly duplicated code with differences only in their final operations -- set or append.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same thing. I will try and consolidate today - maybe with another functional argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thought, I reckon we will end up having something similar in term of class structure:

JsonUtils {

updateNestedValue( ) {
 // Invoke METHOD_TRAVEERSE method with Lambda for the handling (Update). 
}

appendNestedValue( ) {
 // Invoke METHOD_TRAVEERSE method with Lambda for the handling (Append). 
}

private static METHOD_TRAVEERSE (FunctionalInterface ) {
}

}



Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the latest iteration @qianheng-aws
I've added traversal and update functions, as well passing in a lambda that does the actual implemented.
Let me know if you find any more candidates for optimization.

@YANG-DB
Copy link
Member

YANG-DB commented Feb 10, 2025

@acarbonetto can you plz check why CI failed ?

@acarbonetto
Copy link
Contributor Author

@acarbonetto can you plz check why CI failed ?

Sure. IT failed, but it doesn't look like it's from a failed test.

docs/ppl-lang/functions/ppl-json.md Outdated Show resolved Hide resolved
* @param depth - current traversal depth
* @param valueToUpdate - value to update
*/
static void updateNestedValue(Object currentObj, String[] pathParts, int depth, Object valueToUpdate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thought, I reckon we will end up having something similar in term of class structure:

JsonUtils {

updateNestedValue( ) {
 // Invoke METHOD_TRAVEERSE method with Lambda for the handling (Update). 
}

appendNestedValue( ) {
 // Invoke METHOD_TRAVEERSE method with Lambda for the handling (Append). 
}

private static METHOD_TRAVEERSE (FunctionalInterface ) {
}

}



Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Copy link
Contributor

@qianheng-aws qianheng-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI build failure due to scala style format issue. Please run sbt scalafmtAll before submitting your code.

Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
docs/ppl-lang/functions/ppl-json.md Outdated Show resolved Hide resolved
@@ -245,9 +270,12 @@ Example:

**Description**

`json_append(json_string, [path_key, list of values to add ])` appends values to end of an array within the json elements. Return the updated json object after appending .
`json_append(json_string, array(key1, value1, key2, value2, ...))` appends values to end of an array at key within the json elements. Returns the updated json object after appending.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update json_append syntax to align with SP2 syntax, which takes pairs of key/values. Tests also updated.

Signed-off-by: Andrew Carbonetto <[email protected]>
@YANG-DB YANG-DB self-requested a review February 13, 2025 19:17
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Signed-off-by: Andrew Carbonetto <[email protected]>
Copy link
Contributor

@qianheng-aws qianheng-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks for the contribution!

test("test json_extend() function: add single value key not found") {
val frame = sql(s"""
| source = $testTable
| | eval result = json_extend('$validJson7',array('headmaster', 'Tom')) | head 1 | fields result
Copy link
Member

@LantaoJin LantaoJin Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this 'Tom' works?
I see three different formats in this IT:

  1. '\"Foobar\"'
  2. '"Tom"'
  3. 'Tom'
    This is a bit confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is completely equal to 2 in scala.StringContext;

and 2 (""Tom"") will be transformed to 3 ("Tom") after parsing by jackson.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is completely equal to 2 in scala.StringContext;

I see.

and 2 (""Tom"") will be transformed to 3 ("Tom") after parsing by jackson.

Do you mean they are all equality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least equality in our code implementation.

@@ -223,10 +223,6 @@ public enum BuiltinFunctionName {
JSON_EXTRACT(FunctionName.of("json_extract")),
JSON_KEYS(FunctionName.of("json_keys")),
JSON_VALID(FunctionName.of("json_valid")),
// JSON_DELETE(FunctionName.of("json_delete")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Built-in" here means that this function is built-in in PPL-on-Spark, not that it is built-in in Spark. Therefore, as long as the implementation is provided, those that do not require a user to provide their own implementation are considered "built-in" (we currently do not provide a public interface for user-defined functions, so the functions we add are all built-in functions). It seems due to this code

if (BuiltinFunctionName.of(function.getFuncName()).isEmpty()) {
ScalaUDF udf = SerializableUdf.visit(function.getFuncName(), args);
if(udf == null) {
throw new UnsupportedOperationException(function.getFuncName() + " is not a builtin function of PPL");
}
return udf;
, there is no need to declare them in this class. I think we need unified them in BuiltinFunctionName.( Not a blocker for this PR).

@YANG-DB YANG-DB merged commit 8d0e591 into opensearch-project:main Feb 14, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PPL-Lang] PPL support json_set, json_extend functions
5 participants