Handle Case Where Recently Launched Worker Does Not Immediately Heartbeat #741

kmg-stripe · 2025-01-09T17:55:25Z

It seems like there is a race condition where a recently launched worker has not sent a heartbeat, the duration is still within the missed heartbeat threshold and the JobActor treats the lack of heartbeat with a resubmit.

Context

Explain context and other details for this pull request.

Checklist

./gradlew build compiles code correctly
Added new tests where applicable
./gradlew test passes all tests
Extended README or added javadocs where applicable

…beat It seems like there is a race condition where a recently launched worker has not sent a heartbeat, the duration is still within the missed heartbeat threshold and the JobActor treats the lack of heartbeat with a resubmit. If this is true, this can lead to mass resubmits when a new leader is elected.

github-actions · 2025-01-09T19:16:53Z

Test Results

620 tests ±0 610 ✅ ±0 7m 44s ⏱️ -7s
142 suites ±0 10 💤 ±0
142 files ±0 0 ❌ ±0

Results for commit 5d8d6ec. ± Comparison against base commit 292c419.

♻️ This comment has been updated with latest results.

kmg-stripe · 2025-01-09T21:06:43Z

Hey @Andyz26 ! When releasing the latest version of Mantis in our QA environment, we started noticing excessive resubmits. The problem goes away when we rollback. I "think" the issue could be related to the fix in #734 .

It looks like a recently "launched" worker that has not reported a heartbeat could be resubmitted by the master. I am wondering if we want to check the duration now() - launchedAt exceeds the heartbeat threshold before resubmitting.

If this seems reasonable, I can create another test (or add to the existing one). If not, can you think of a reason we would see excessive resubmits.

cc: @crioux-stripe

kmg-stripe · 2025-01-09T21:10:16Z

.github/workflows/nebula-ci.yml

@@ -48,7 +48,7 @@ jobs:
          CI_BRANCH: ${{ github.ref }}
          COVERALLS_REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      - name: Upload Test Results
-        uses: actions/upload-artifact@v3
+        uses: actions/upload-artifact@v4


this is unrelated, but was breaking CI because GH deprecated upload-artifact@v3

kmg-stripe · 2025-01-09T21:11:13Z

...s-control-plane-server/src/test/java/io/mantisrx/master/jobcluster/job/JobTestLifecycle.java

@@ -823,9 +823,9 @@ public void testNoHeartBeatAfterLaunchResubmit() {
            assertEquals(JobState.Accepted, resp4.getJobMetadata().get().getState());

            // 1 original submissions and 0 resubmits because of worker not in launched state with HB timeouts
-            verify(schedulerMock, times(2)).scheduleWorkers(any());
+            verify(schedulerMock, times(1)).scheduleWorkers(any());


just temporary to get the tests passing. i'll update the tests if we think the above logic is reasonable.

kmg-stripe · 2025-01-10T16:43:52Z

@Andyz26 - if you get a chance, could you approve the snapshot build? i was hoping to see if this fixes the issue we are seeing on our end. thanks!

Andyz26 · 2025-01-10T19:56:55Z

@kmg-stripe i think i understand the problem now and i think your solution in pr makes sense to me.
Also I would like to use this chance to understand better how you are dealing with the lifecycle events. In our setup the launch event is emitted when the worker is basically "ready to work" which is probably why we don't see this issue.

It would be great if we can consolidate on the lifecycle events expectations for future changes.

kmg-stripe · 2025-01-10T22:46:24Z

@kmg-stripe i think i understand the problem now and i think your solution in pr makes sense to me. Also I would like to use this chance to understand better how you are dealing with the lifecycle events. In our setup the launch event is emitted when the worker is basically "ready to work" which is probably why we don't see this issue.

It would be great if we can consolidate on the lifecycle events expectations for future changes.

@Andyz26 I believe we are not doing anything special regarding worker lifecycle. That is, I think we are relying on whatever the "default" behavior is. My impression is that "launched" means a worker has landed on an instance, but hasn't started yet. I'd say the time between "launched" and "started" can be from 10s of seconds up to a few minutes. @crioux-stripe can keep me honest here, if we are doing anything out of the ordinary.

Also - looks like gradle failed when publishing a snapshot. I cannot tell why based on the error:

Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

Not sure if this is a transient issue... Could you approve the latest snapshot job when you get a chance? Thanks!

crioux-stripe · 2025-01-13T17:51:32Z

Approving to see if I can get a snapshot triggered.

crioux-stripe · 2025-01-13T19:00:53Z

Tracing backwards our workers are getting set to launched in the JobWorker#processEvent(WorkerEvent) method as expected. I'll dig into what actually triggers that event...

The WorkerLaunched event is created in our ResourceClusterAwareSheculderActor#onSubmittedScheduleRequstEvent(SubmittedScheduleRequestEvent). I'm guessing the preceding line final TaskExecutorRegistration info = resourceCluster.getTaskExecutorInfo(taskExecutorID).join(); is much more sophisticated in the Netflix Titus implementation. as compared to our thin ASG wrapper. Created here.

Andyz26 · 2025-01-13T22:05:15Z

Maybe some operations during the worker startup was taking longer? I think we have a shorter window from submitted/launched to first HB sent thus this is not causing issues for us. I think the logic here is sound and we can merge and get a RC started.

crioux-stripe · 2025-01-13T22:12:33Z

Hrmm. Not really sure what is going on with this snapshot. I'm of the opinion we can probably merge this as is, but want to wait on @Andyz26 's approval.

On the subject of the lifecycle, @Andyz26 we're just relying on the default behavior I linked to above. Does the resource provider used internally at Netflix do something smarter with the agent states?

crioux-stripe · 2025-01-13T22:20:04Z

Ooops! Missed @Andyz26 's intervening comment. Will merge!

kmg-stripe requested review from calvin681, Andyz26, hmitnflx, fdc-ntflx and dtrager02 as code owners January 9, 2025 17:55

Update upload-artifact to v4

071afaa

Temporarily fix tests

07d6e9a

kmg-stripe commented Jan 9, 2025

View reviewed changes

Merge branch 'master' into kmg-fix-excessive-resubmit-issue

7ec0f12

kmg-stripe had a problem deploying to Integrate Pull Request January 10, 2025 19:27 — with GitHub Actions Failure

Merge branch 'master' into kmg-fix-excessive-resubmit-issue

5d8d6ec

crioux-stripe approved these changes Jan 13, 2025

View reviewed changes

kmg-stripe had a problem deploying to Integrate Pull Request January 13, 2025 18:50 — with GitHub Actions Failure

crioux-stripe merged commit 86a0916 into Netflix:master Jan 13, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Case Where Recently Launched Worker Does Not Immediately Heartbeat #741

Handle Case Where Recently Launched Worker Does Not Immediately Heartbeat #741

kmg-stripe commented Jan 9, 2025 •

edited

Loading

github-actions bot commented Jan 9, 2025 •

edited

Loading

kmg-stripe commented Jan 9, 2025

kmg-stripe Jan 9, 2025

kmg-stripe Jan 9, 2025

kmg-stripe commented Jan 10, 2025

Andyz26 commented Jan 10, 2025 •

edited

Loading

kmg-stripe commented Jan 10, 2025

crioux-stripe commented Jan 13, 2025

crioux-stripe commented Jan 13, 2025

Andyz26 commented Jan 13, 2025

crioux-stripe commented Jan 13, 2025

crioux-stripe commented Jan 13, 2025

Handle Case Where Recently Launched Worker Does Not Immediately Heartbeat #741

Handle Case Where Recently Launched Worker Does Not Immediately Heartbeat #741

Conversation

kmg-stripe commented Jan 9, 2025 • edited Loading

Context

Checklist

github-actions bot commented Jan 9, 2025 • edited Loading

Test Results

kmg-stripe commented Jan 9, 2025

kmg-stripe Jan 9, 2025

Choose a reason for hiding this comment

kmg-stripe Jan 9, 2025

Choose a reason for hiding this comment

kmg-stripe commented Jan 10, 2025

Andyz26 commented Jan 10, 2025 • edited Loading

kmg-stripe commented Jan 10, 2025

crioux-stripe commented Jan 13, 2025

crioux-stripe commented Jan 13, 2025

Andyz26 commented Jan 13, 2025

crioux-stripe commented Jan 13, 2025

crioux-stripe commented Jan 13, 2025

kmg-stripe commented Jan 9, 2025 •

edited

Loading

github-actions bot commented Jan 9, 2025 •

edited

Loading

Andyz26 commented Jan 10, 2025 •

edited

Loading