Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Case Where Recently Launched Worker Does Not Immediately Heartbeat #741

Merged

Conversation

kmg-stripe
Copy link
Collaborator

@kmg-stripe kmg-stripe commented Jan 9, 2025

It seems like there is a race condition where a recently launched worker has not sent a heartbeat, the duration is still within the missed heartbeat threshold and the JobActor treats the lack of heartbeat with a resubmit.

Context

Explain context and other details for this pull request.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

…beat

It seems like there is a race condition where a recently launched worker has
not sent a heartbeat, the duration is still within the missed heartbeat
threshold and the JobActor treats the lack of heartbeat with a resubmit.
If this is true, this can lead to mass resubmits when a new leader is elected.
Copy link

github-actions bot commented Jan 9, 2025

Test Results

620 tests  ±0   610 ✅ ±0   7m 44s ⏱️ -7s
142 suites ±0    10 💤 ±0 
142 files   ±0     0 ❌ ±0 

Results for commit 5d8d6ec. ± Comparison against base commit 292c419.

♻️ This comment has been updated with latest results.

@kmg-stripe
Copy link
Collaborator Author

Hey @Andyz26 ! When releasing the latest version of Mantis in our QA environment, we started noticing excessive resubmits. The problem goes away when we rollback. I "think" the issue could be related to the fix in #734 .

It looks like a recently "launched" worker that has not reported a heartbeat could be resubmitted by the master. I am wondering if we want to check the duration now() - launchedAt exceeds the heartbeat threshold before resubmitting.

If this seems reasonable, I can create another test (or add to the existing one). If not, can you think of a reason we would see excessive resubmits.

cc: @crioux-stripe

@@ -48,7 +48,7 @@ jobs:
CI_BRANCH: ${{ github.ref }}
COVERALLS_REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- name: Upload Test Results
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated, but was breaking CI because GH deprecated upload-artifact@v3

@@ -823,9 +823,9 @@ public void testNoHeartBeatAfterLaunchResubmit() {
assertEquals(JobState.Accepted, resp4.getJobMetadata().get().getState());

// 1 original submissions and 0 resubmits because of worker not in launched state with HB timeouts
verify(schedulerMock, times(2)).scheduleWorkers(any());
verify(schedulerMock, times(1)).scheduleWorkers(any());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just temporary to get the tests passing. i'll update the tests if we think the above logic is reasonable.

@kmg-stripe
Copy link
Collaborator Author

@Andyz26 - if you get a chance, could you approve the snapshot build? i was hoping to see if this fixes the issue we are seeing on our end. thanks!

@kmg-stripe kmg-stripe had a problem deploying to Integrate Pull Request January 10, 2025 19:27 — with GitHub Actions Failure
@Andyz26
Copy link
Collaborator

Andyz26 commented Jan 10, 2025

@kmg-stripe i think i understand the problem now and i think your solution in pr makes sense to me.
Also I would like to use this chance to understand better how you are dealing with the lifecycle events. In our setup the launch event is emitted when the worker is basically "ready to work" which is probably why we don't see this issue.

It would be great if we can consolidate on the lifecycle events expectations for future changes.

@kmg-stripe
Copy link
Collaborator Author

@kmg-stripe i think i understand the problem now and i think your solution in pr makes sense to me. Also I would like to use this chance to understand better how you are dealing with the lifecycle events. In our setup the launch event is emitted when the worker is basically "ready to work" which is probably why we don't see this issue.

It would be great if we can consolidate on the lifecycle events expectations for future changes.

@Andyz26 I believe we are not doing anything special regarding worker lifecycle. That is, I think we are relying on whatever the "default" behavior is. My impression is that "launched" means a worker has landed on an instance, but hasn't started yet. I'd say the time between "launched" and "started" can be from 10s of seconds up to a few minutes. @crioux-stripe can keep me honest here, if we are doing anything out of the ordinary.

Also - looks like gradle failed when publishing a snapshot. I cannot tell why based on the error:

Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

Not sure if this is a transient issue... Could you approve the latest snapshot job when you get a chance? Thanks!

@crioux-stripe
Copy link
Collaborator

Approving to see if I can get a snapshot triggered.

@kmg-stripe kmg-stripe had a problem deploying to Integrate Pull Request January 13, 2025 18:50 — with GitHub Actions Failure
@crioux-stripe
Copy link
Collaborator

Tracing backwards our workers are getting set to launched in the JobWorker#processEvent(WorkerEvent) method as expected. I'll dig into what actually triggers that event...

The WorkerLaunched event is created in our ResourceClusterAwareSheculderActor#onSubmittedScheduleRequstEvent(SubmittedScheduleRequestEvent). I'm guessing the preceding line final TaskExecutorRegistration info = resourceCluster.getTaskExecutorInfo(taskExecutorID).join(); is much more sophisticated in the Netflix Titus implementation. as compared to our thin ASG wrapper. Created here.

@Andyz26
Copy link
Collaborator

Andyz26 commented Jan 13, 2025

Maybe some operations during the worker startup was taking longer? I think we have a shorter window from submitted/launched to first HB sent thus this is not causing issues for us. I think the logic here is sound and we can merge and get a RC started.

@crioux-stripe
Copy link
Collaborator

Hrmm. Not really sure what is going on with this snapshot. I'm of the opinion we can probably merge this as is, but want to wait on @Andyz26 's approval.

On the subject of the lifecycle, @Andyz26 we're just relying on the default behavior I linked to above. Does the resource provider used internally at Netflix do something smarter with the agent states?

@crioux-stripe
Copy link
Collaborator

Ooops! Missed @Andyz26 's intervening comment. Will merge!

@crioux-stripe crioux-stripe merged commit 86a0916 into Netflix:master Jan 13, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants