diff --git a/docker/integ-test/.env b/docker/integ-test/.env index 7ab2cf5c2..7d8995956 100644 --- a/docker/integ-test/.env +++ b/docker/integ-test/.env @@ -11,6 +11,7 @@ SQL_APP_JAR=./spark-sql-application/target/scala-2.12/sql-job-assembly-0.7.0-SNA OPENSEARCH_NODE_MEMORY=512m OPENSEARCH_ADMIN_PASSWORD=C0rrecthorsebatterystaple. OPENSEARCH_PORT=9200 +OPENSEARCH_PA_PORT=9600 OPENSEARCH_DASHBOARDS_PORT=5601 S3_ACCESS_KEY=Vt7jnvi5BICr1rkfsheT S3_SECRET_KEY=5NK3StGvoGCLUWvbaGN0LBUf9N6sjE94PEzLdqwO diff --git a/docker/integ-test/docker-compose.yml b/docker/integ-test/docker-compose.yml index 8ff92c94a..9fe79dc22 100644 --- a/docker/integ-test/docker-compose.yml +++ b/docker/integ-test/docker-compose.yml @@ -137,7 +137,7 @@ services: target: /var/run/docker.sock ports: - ${OPENSEARCH_PORT:-9200}:9200 - - 9600:9600 + - ${OPENSEARCH_PA_PORT:-9600}:9600 expose: - "${OPENSEARCH_PORT:-9200}" - "9300" diff --git a/docker/integ-test/spark/spark-master-entrypoint.sh b/docker/integ-test/spark/spark-master-entrypoint.sh index a21c20643..51caa787b 100755 --- a/docker/integ-test/spark/spark-master-entrypoint.sh +++ b/docker/integ-test/spark/spark-master-entrypoint.sh @@ -1,5 +1,8 @@ #!/bin/bash +# Copyright OpenSearch Contributors +# SPDX-License-Identifier: Apache-2.0 + function start_spark_connect() { sc_version=$(ls -1 /opt/bitnami/spark/jars/spark-core_*.jar | sed -e 's/^.*\/spark-core_//' -e 's/\.jar$//' -e 's/-/:/') diff --git a/docs/docker/integ-test/README.md b/docs/docker/integ-test/README.md new file mode 100644 index 000000000..47b303abb --- /dev/null +++ b/docs/docker/integ-test/README.md @@ -0,0 +1,166 @@ +# Docker Cluster for Integration Testing + +## Introduction + +The docker cluster in `docker/integ-test` is designed to be used for integration testing. It supports the following +use cases: +1. Submitting queries directly to Spark in order to test the PPL extension for Spark. +2. Submitting queries directly to Spark that use the OpenSearch datasource. Useful for testing the Flint extension + for Spark. +3. Using the Async API to submit queries to the OpenSearch server. Useful for testing the EMR workflow and querying + S3/Glue datasources. A local container is run rather than using the AWS EMR service. + +The cluster consists of several containers and handles configuring them. No tables are created. + +## Overview + +![Docker Containers](images/integ-test-containers.png "Docker Containers") + +All containers run in a dedicated docker network. + +### OpenSearch Dashboards + +An OpenSearch dashboards server that is connected to the OpenSearch server. It is exposed to the host OS, +so it can be accessed with a browser. + +### OpenSearch + +An OpenSearch server. It is running in standalone mode. It is exposed to the host OS. It is configured to have +an S3/Glue datasource with the name `mys3`. System indices and system indices permissions are disabled. + +This container also has a docker volume used to persist data such as local indices. + +### Spark + +The Spark master node. It is configured to use an external Hive metastore in the container `metastore`. The +Spark master also has the Flint and PPL extensions installed. It can use locally built Jar files when building +the docker image. + +Spark Connect is also running in this container and can be used to easily issue queries to run. The port for +Spark Connect is exposed to the host OS. + +### Spark Worker + +The Spark worker node. It is configured to use an external Hive metastore in the container `metastore`. The +Spark worker also has the Flint and PPL extensions installed. It can use locally built Jar files when building +the docker image. + +### Spark Submit + +A temporary container that runs queries for an Async API session. It is started the OpenSearch container. It +does not connect to the Spark cluster and instead runs the queries locally. It will keep looking for more +queries to run until it reaches its timeout (3 minutes by default). + +The Spark submit container is configured to use an external Hive metastore in the container `metastore`. The +Flint and PPL extensions are installed. When building the docker image, locally built Jar files can be used. + +### Metastore (Hive) + +A Hive server that is used as a metastore for the Spark containers. It is configured to use the Minio +container in the bucket `integ-test`. + +This container also has a docker volume used to persist the metastore. + +### Minio (S3) + +A Minio server that acts as an S3 server. Is used as a part of the workflow of executing an S3/Glue query. +It will contain the S3 tables data. + +This container also has a docker volume used to persist the S3 data. + +### Configuration-Updated + +A temporary container that is used to configure the OpenSearch and Minio containers. It is run after both +of those have started up. For Minio, it will add the `integ-test` bucket and create an access key. For +OpenSearch, it will create the S3/Glue datasource and apply a cluster configuration. + +## Running the Cluster + +To start the cluster go to the directory `docker/integ-test` and use docker compose to start the cluster. When +starting the cluster, wait for the `spark-worker` container to finish starting up. It is the last container +to start. + +Start cluster in foreground: +```shell +docker compose up +``` + +Start cluster in the background: +```shell +docker compose up -d +``` + +Stopping the cluster: +```shell +docker compose down -d +``` + +## Creating Tables in S3 + +Tables need to be created in Spark as external tables. Their location must be set to a path under `s3a://integ-test/`. +Can use `spark-shell` on the Spark master container to do this: +```shell +docker exec it spark spark-shell +``` + +Example for creating a table and adding data: +```scala +spark.sql("CREATE EXTERNAL TABLE foo (id int, name varchar(100)) location 's3a://integ-test/foo'") +spark.sql("INSERT INTO foo (id, name) VALUES(1, 'Foo')") +``` + +## Configuration of the Cluster + +There are several settings that can be adjusted for the cluster. + +* SPARK_VERSION - the tag of the `bitnami/spark` docker image to use +* OPENSEARCH_VERSION - the tag of the `opensearchproject/opensearch` docker image to use +* DASHBOARDS_VERSION - the tag of the `opensearchproject/opensearch-dashboards` docker image to use +* MASTER_UI_PORT - port on the host OS to map to the master UI port (8080) of the Spark master +* MASTER_PORT - port on the host OS to map to the master port (7077) on the Spark master +* UI_PORT - port on the host OS to map to the UI port (4040) on the Spark master +* SPARK_CONNECT_PORT - port on the host OS to map to the Spark Connect port (15002) on the Spark master +* PPL_JAR - The relative path to the PPL extension Jar file. Must be within the base directory of this repository +* FLINT_JAR - The relative path to the Flint extension Jar file. Must be within the base directory of this + repository +* SQL_APP_JAR - The relative path to the SQL application Jar file. Must be within the base directory of this + repository +* OPENSEARCH_NODE_MEMORY - Amount of memory to allocate for the OpenSearch server +* OPENSEARCH_ADMIN_PASSWORD - Password for the admin user of the OpenSearch server +* OPENSEARCH_PORT - port on the host OS to map to port 9200 on the OpenSearch server +* OPENSEARCH_PA_PORT - port on the host OS to map to the performance analyzer port (9600) on the OpenSearch + server +* OPENSEARCH_DASHBOARDS_PORT - port on the host OS to map to the OpenSearch dashboards server +* S3_ACCESS_KEY - access key to create on the Minio container +* S3_SECRET_KEY - secret key to create on the Minio container + +## Async API Overview + +[Async API Interfaces](https://github.com/opensearch-project/sql/blob/main/docs/user/interfaces/asyncqueryinterface.rst) + +[Async API Documentation](https://opensearch.org/docs/latest/search-plugins/async/index/) + +The Async API is able to query S3/Glue datasources. This is done by calling the AWS EMR service to use a +docker container to run the query. The docker container uses Spark and is able to access the Glue catalog and +retrieve data from S3. + +For the docker cluster, Minio is used in place of S3. Docker itself is used in place of AWS EMR. + +![OpenSearch Async API Sequence Diagram](images/OpenSearch_Async_API.png "Sequence Diagram") + +1. Client submit a request to the async_search API endpoint +2. OpenSearch server creates a special index (if it doesn't exist). This index is used to store async API requests + along with some state information. +3. OpenSearch server checks if the query is for an S3/Glue datasource. If it is not, then OpenSearch can handle + the request on its own. +4. OpenSearch uses docker to start a new container to process queries for the current async API session. +5. OpenSearch returns the queryId and sessionId to the Client. +6. Spark submit docker container starts up. +7. Spark submit docker container searches for index from step 2 for a query in the current session to run. +8. Spark submit docker container creates a special OpenSearch index (if it doesn't exist). This index is used to + store the results of the async API queries. +9. Spark submit docker container looks up the table metadata from the `metastore` container. +10. Spark submit docker container retrieves the data from the Minio container. +11. Spark submit docker container writes the results to the OpenSearch index from step 7. +12. Client submits a request to the async_search results API endpoint using the queryId form step 5. +13. OpenSearch returns the results to the Client. diff --git a/docs/docker/integ-test/images/OpenSearch_Async_API.png b/docs/docker/integ-test/images/OpenSearch_Async_API.png new file mode 100644 index 000000000..25887bbcf Binary files /dev/null and b/docs/docker/integ-test/images/OpenSearch_Async_API.png differ diff --git a/docs/docker/integ-test/images/integ-test-containers.png b/docs/docker/integ-test/images/integ-test-containers.png new file mode 100644 index 000000000..92878e187 Binary files /dev/null and b/docs/docker/integ-test/images/integ-test-containers.png differ diff --git a/docs/spark-docker.md b/docs/docker/spark-docker.md similarity index 96% rename from docs/spark-docker.md rename to docs/docker/spark-docker.md index d1200e2b3..9c505f1cd 100644 --- a/docs/spark-docker.md +++ b/docs/docker/spark-docker.md @@ -19,7 +19,7 @@ sbt clean sbt assembly ``` -Refer to the [Developer Guide](../DEVELOPER_GUIDE.md) for more information. +Refer to the [Developer Guide](../../DEVELOPER_GUIDE.md) for more information. ## Using Docker Compose @@ -65,7 +65,7 @@ spark.sql("INSERT INTO test_table (id, name) VALUES(2, 'Bar')") spark.sql("source=test_table | eval x = id + 5 | fields x, name").show() ``` -For further information, see the [Spark PPL Test Instructions](ppl-lang/local-spark-ppl-test-instruction.md) +For further information, see the [Spark PPL Test Instructions](../ppl-lang/local-spark-ppl-test-instruction.md) ## Manual Setup diff --git a/docs/spark-emr-docker.md b/docs/docker/spark-emr-docker.md similarity index 98% rename from docs/spark-emr-docker.md rename to docs/docker/spark-emr-docker.md index 7eef4d250..e56295736 100644 --- a/docs/spark-emr-docker.md +++ b/docs/docker/spark-emr-docker.md @@ -18,7 +18,7 @@ sbt clean sbt assembly ``` -Refer to the [Developer Guide](../DEVELOPER_GUIDE.md) for more information. +Refer to the [Developer Guide](../../DEVELOPER_GUIDE.md) for more information. ## Using Docker Compose