Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added end-to-end PPL tests for Spark #1028

Merged
merged 2 commits into from
Feb 6, 2025

Conversation

normanj-bitquill
Copy link
Contributor

Description

Adding end-to-end PPL tests for Spark. The tests queries are run on the Spark master container using Spark Connect. OpenSearch is not used.

Tests include PPL queries form the Python test suite and the TPC-H queries.

Related Issues

#1022

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • New added source code should include a copyright header
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanj-bitquill this is great !!!

@penghuo penghuo added the enhancement New feature or request label Feb 5, 2025
@penghuo
Copy link
Collaborator

penghuo commented Feb 5, 2025

Thanks for adding these. I have a few questions:
1 Could you explain how the queries and results are generated?
2 If a new command is added, how should it be incorporated into the existing test sets?
3 Are these tests already integrated with GitHub Actions?

@normanj-bitquill
Copy link
Contributor Author

Thanks for adding these. I have a few questions: 1 Could you explain how the queries and results are generated? 2 If a new command is added, how should it be incorporated into the existing test sets? 3 Are these tests already integrated with GitHub Actions?

@penghuo

  1. The queries were taken from here and here

    The results were generated by running the queries on Spark and saving the results in CSV format. This creates a baseline that enables changes to the results to be detected.

  2. To add new tests, create the PPL files in e2e-test/src/test/resources/spark/queries/ppl

    Generate the results by running the queries on the Spark container when the integration test docker cluster is running and save the results in CSV format. The query and results have the same base filename. The query file ends with .ppl and the results file ends with .results.

  3. This is not yet integrated with GitHub actions, but that work is planned. After this PR, there will be another to add Async Query API tests. Once all of those tests are in, I will work on using these tests in the GitHub actions.

@YANG-DB
Copy link
Member

YANG-DB commented Feb 6, 2025

@normanj-bitquill do you think we can add another paragraph (in our E2E document) on how to add a new query to this baseline ?

Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I'll update the README.md

@normanj-bitquill
Copy link
Contributor Author

@YANG-DB I have updated the README.md.

@YANG-DB YANG-DB merged commit 4d6ba7d into opensearch-project:main Feb 6, 2025
4 checks passed
@normanj-bitquill normanj-bitquill deleted the e2e-spark-ppl-tests branch February 11, 2025 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants