Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new CI workflow and run unit tests using it #17700

Merged

Conversation

Akshat-Jain
Copy link
Contributor

@Akshat-Jain Akshat-Jain commented Feb 6, 2025

Description

This PR has 4 parts:

  1. Creation of a new workflow
  2. Moving unit tests (phase 1) to the new workflow
  3. Moving unit tests (phase 2) to a weekly cron workflow
  4. New reporting mechanism

I'll describe the above 4 parts separately below.

Creation of a new workflow

This PR adds 2 new workflow yml files:

  • ci.yml
  • worker.yml

Eventually, ci.yml is expected to be the new starting point for our entire CI workflow. However, right now, since we're migrating only the unit tests (phase 1) to use ci.yml, it's not the starting point for now (during this transition of individual parts one-by-one to the new workflow).

ci.yml is responsible for orchestrating the workflows with the help of a worker.yml.

worker.yml is responsible for doing the actual work, as per requested by ci.yml via the input arguments.

Motivation behind this change:

  • Currently, our workflows are spread across a bunch of yml files, making it difficult to understand the flows.
  • Having a central file ci.yml will make it much easier to understand the dependency graph between all the different workflows.
  • worker.yml takes a script as one of the input args. The idea behind this design is to make it easy for folks to be able to run the same scripts locally, whenever they require to run the workflows locally (during development or debugging). This becomes difficult to figure out when we have a bunch of yml files and we're passing around input args through them, hence we wanted to keep it as simple as possible to understand and reproduce locally.

Moving unit tests (phase 1) to the new workflow

The unit tests (phase 1) that was being run via unit-tests.yml, reusable-unit-tests.yml and unit_tests_script.sh - has been updated to run via ci.yml, worker.yml and run-unit-tests.sh. As a result, unit-tests.yml, reusable-unit-tests.yml and unit_tests_script.sh have also been removed since they're unused now (the same has been done for unit tests phase 2, as can be read in the next section).

Currently, we were running unit tests for 4 different set of modules: indexing, processing, server, and "everything else". This resulted in longer duration of the job, with processing modules taking 50+ minutes, for example. Also, the design with "everything else" wouldn't scale in the long term, since it's everything that isn't covered in the other 3 modules.

This design has been updated to run using regex patterns, on the starting character of the test class. We have the following in ci.yml:

  matrix:
    Dtest: [ "A*,B*", "C*", "D*,E*,F*", "G*,H*", "I*,J*", "K*,L*", "M*,N*", "O*,P*", "Q*", "R*", "S*", "T*,U*", "V*,W*,X*,Y*,Z*" ]

Hence, all unit test classes starting with A and B gets run as part of a single job, across all modules. Similar thing happens for C, similar for D,E,F, and so on.

The character groupings have been done to try and avoid skewness across them. This also allows scaling in the long term, as well as the flexibility to tune these groupings, if and when needed.

With this change, most groups finish in around 15-18 minutes, and a few of them finishing in around 25 minutes.

The reporting mechanism for this design of running unit tests is being covered in the New reporting mechanism section below.

Moving unit tests (phase 2) to a weekly cron workflow

For context, the unit tests phase 1 are being run against JDK 17, whereas the unit tests phase 2 are being run against JDKs 11 and 21.

It's very rare for a test to have different result in JDK 17 versus JDK 11/21, hence it is redundant to run unit tests phase 2 on every commit on every PR.

Hence, the unit tests phase 2 have been removed from the regular PR workflow, and instead have been added as a weekly cron workflow. This ensures that we are still running them once a week, and allows us to track failures, if any (which we ideally don't expect to have).

Unit tests phase 2 have also been updated to use the new workflow files and the new regex pattern approach, as done for unit tests phase 1.

New reporting mechanism

There are a few aspects to this:

  1. The unit test jobs are called as success when they were able to run all the tests as per the input parameters, even if some of the tests fail. This can be seen in ci.yml where we are passing -Dmaven.test.failure.ignore=true while running unit tests.
  2. All unit test jobs upload the surefire reports to individual artifact, with the same prefix.
  3. We have a reporting-unit-test-failures job, which downloads all the artifacts uploaded in point (2) - hence gathering all the individual surefire reports in the same place/hierarchy, and then runs a reporter (mikepenz/action-junit-report@v5) against those gathered surefire reports. I'll add some sample run links below for how the report looks like.
  4. We have a reporting-jacoco-coverage-failures job, which does something similar, and then uses create-jacoco-coverage-report.sh to create a jacoco report. This new script is mostly derived (with some simplifications) from the existing unit_tests_script.sh we had, but the overall logic and the way of reporting (that is, the final output) is the same as what we had currently.

Some examples for the unit test report:

  1. The failed tests can be seen in the Summary section of the workflow run: https://github.com/Akshat-Jain/druid/actions/runs/13152426940/attempts/1#summary-36703177495
  2. The failed tests can also be seen in the report-unit-test-failures job run itself: https://github.com/Akshat-Jain/druid/actions/runs/13152426940/job/36703177495 (you need to expand the Publish results dropdown).

Attaching sample images for both of the above examples below, for quicker reference:

image

Image for (1)

image

Image for (2)


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

.github/workflows/worker.yml Outdated Show resolved Hide resolved
.github/workflows/worker.yml Outdated Show resolved Hide resolved
.github/scripts/run-unit-tests.sh Outdated Show resolved Hide resolved
.github/scripts/make-hash-for-artifact-name.sh Outdated Show resolved Hide resolved
@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 7, 2025
Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a recent run produce the test report in <25m!
cool!

@kgyrtkirk kgyrtkirk merged commit 11eecb4 into apache:master Feb 7, 2025
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Dependencies Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 GHA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants