Create new CI workflow and run unit tests using it #17700

Akshat-Jain · 2025-02-06T06:04:07Z

Description

This PR has 4 parts:

Creation of a new workflow
Moving unit tests (phase 1) to the new workflow
Moving unit tests (phase 2) to a weekly cron workflow
New reporting mechanism

I'll describe the above 4 parts separately below.

Creation of a new workflow

This PR adds 2 new workflow yml files:

ci.yml
worker.yml

Eventually, ci.yml is expected to be the new starting point for our entire CI workflow. However, right now, since we're migrating only the unit tests (phase 1) to use ci.yml, it's not the starting point for now (during this transition of individual parts one-by-one to the new workflow).

ci.yml is responsible for orchestrating the workflows with the help of a worker.yml.

worker.yml is responsible for doing the actual work, as per requested by ci.yml via the input arguments.

Motivation behind this change:

Currently, our workflows are spread across a bunch of yml files, making it difficult to understand the flows.
Having a central file ci.yml will make it much easier to understand the dependency graph between all the different workflows.
worker.yml takes a script as one of the input args. The idea behind this design is to make it easy for folks to be able to run the same scripts locally, whenever they require to run the workflows locally (during development or debugging). This becomes difficult to figure out when we have a bunch of yml files and we're passing around input args through them, hence we wanted to keep it as simple as possible to understand and reproduce locally.

Moving unit tests (phase 1) to the new workflow

The unit tests (phase 1) that was being run via unit-tests.yml, reusable-unit-tests.yml and unit_tests_script.sh - has been updated to run via ci.yml, worker.yml and run-unit-tests.sh. As a result, unit-tests.yml, reusable-unit-tests.yml and unit_tests_script.sh have also been removed since they're unused now (the same has been done for unit tests phase 2, as can be read in the next section).

Currently, we were running unit tests for 4 different set of modules: indexing, processing, server, and "everything else". This resulted in longer duration of the job, with processing modules taking 50+ minutes, for example. Also, the design with "everything else" wouldn't scale in the long term, since it's everything that isn't covered in the other 3 modules.

This design has been updated to run using regex patterns, on the starting character of the test class. We have the following in ci.yml:

  matrix:
    Dtest: [ "A*,B*", "C*", "D*,E*,F*", "G*,H*", "I*,J*", "K*,L*", "M*,N*", "O*,P*", "Q*", "R*", "S*", "T*,U*", "V*,W*,X*,Y*,Z*" ]

Hence, all unit test classes starting with A and B gets run as part of a single job, across all modules. Similar thing happens for C, similar for D,E,F, and so on.

The character groupings have been done to try and avoid skewness across them. This also allows scaling in the long term, as well as the flexibility to tune these groupings, if and when needed.

With this change, most groups finish in around 15-18 minutes, and a few of them finishing in around 25 minutes.

The reporting mechanism for this design of running unit tests is being covered in the New reporting mechanism section below.

Moving unit tests (phase 2) to a weekly cron workflow

For context, the unit tests phase 1 are being run against JDK 17, whereas the unit tests phase 2 are being run against JDKs 11 and 21.

It's very rare for a test to have different result in JDK 17 versus JDK 11/21, hence it is redundant to run unit tests phase 2 on every commit on every PR.

Hence, the unit tests phase 2 have been removed from the regular PR workflow, and instead have been added as a weekly cron workflow. This ensures that we are still running them once a week, and allows us to track failures, if any (which we ideally don't expect to have).

Unit tests phase 2 have also been updated to use the new workflow files and the new regex pattern approach, as done for unit tests phase 1.

New reporting mechanism

There are a few aspects to this:

The unit test jobs are called as success when they were able to run all the tests as per the input parameters, even if some of the tests fail. This can be seen in ci.yml where we are passing -Dmaven.test.failure.ignore=true while running unit tests.
All unit test jobs upload the surefire reports to individual artifact, with the same prefix.
We have a reporting-unit-test-failures job, which downloads all the artifacts uploaded in point (2) - hence gathering all the individual surefire reports in the same place/hierarchy, and then runs a reporter (mikepenz/action-junit-report@v5) against those gathered surefire reports. I'll add some sample run links below for how the report looks like.
We have a reporting-jacoco-coverage-failures job, which does something similar, and then uses create-jacoco-coverage-report.sh to create a jacoco report. This new script is mostly derived (with some simplifications) from the existing unit_tests_script.sh we had, but the overall logic and the way of reporting (that is, the final output) is the same as what we had currently.

Some examples for the unit test report:

The failed tests can be seen in the Summary section of the workflow run: https://github.com/Akshat-Jain/druid/actions/runs/13152426940/attempts/1#summary-36703177495
The failed tests can also be seen in the report-unit-test-failures job run itself: https://github.com/Akshat-Jain/druid/actions/runs/13152426940/job/36703177495 (you need to expand the Publish results dropdown).

Attaching sample images for both of the above examples below, for quicker reference:

Image for (1)

Image for (2)

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

.github/scripts/create-jacoco-coverage-report.sh

.github/workflows/worker.yml

.github/scripts/run-unit-tests.sh

.github/scripts/make-hash-for-artifact-name.sh

kgyrtkirk

a recent run produce the test report in <25m!
cool!

Create new CI workflow and run unit tests using it

4f63cfe

github-actions bot added Area - Dependencies GHA labels Feb 6, 2025

Akshat-Jain added 2 commits February 6, 2025 12:04

Try removing actions-timeline

fe582b8

Add github.repository check in the cron workflow

25f17bb

kgyrtkirk reviewed Feb 6, 2025

View reviewed changes

Akshat-Jain added 2 commits February 7, 2025 10:41

Address review comments

0a3bd19

Re-group into a more optimal grouping

08db049

github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Feb 7, 2025

Akshat-Jain added 2 commits February 7, 2025 13:24

Add cache:maven

20bd1d4

Update groupings in the cron workflow as well

4584a99

kgyrtkirk approved these changes Feb 7, 2025

View reviewed changes

kgyrtkirk merged commit 11eecb4 into apache:master Feb 7, 2025
75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create new CI workflow and run unit tests using it #17700

Create new CI workflow and run unit tests using it #17700

Akshat-Jain commented Feb 6, 2025 •

edited

Loading

kgyrtkirk left a comment

Create new CI workflow and run unit tests using it #17700

Create new CI workflow and run unit tests using it #17700

Conversation

Akshat-Jain commented Feb 6, 2025 • edited Loading

Description

Creation of a new workflow

Moving unit tests (phase 1) to the new workflow

Moving unit tests (phase 2) to a weekly cron workflow

New reporting mechanism

kgyrtkirk left a comment

Choose a reason for hiding this comment

Akshat-Jain commented Feb 6, 2025 •

edited

Loading