-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Case Where Recently Launched Worker Does Not Immediately Heartbeat #741
Handle Case Where Recently Launched Worker Does Not Immediately Heartbeat #741
Conversation
…beat It seems like there is a race condition where a recently launched worker has not sent a heartbeat, the duration is still within the missed heartbeat threshold and the JobActor treats the lack of heartbeat with a resubmit. If this is true, this can lead to mass resubmits when a new leader is elected.
Hey @Andyz26 ! When releasing the latest version of Mantis in our QA environment, we started noticing excessive resubmits. The problem goes away when we rollback. I "think" the issue could be related to the fix in #734 . It looks like a recently "launched" worker that has not reported a heartbeat could be resubmitted by the master. I am wondering if we want to check the duration If this seems reasonable, I can create another test (or add to the existing one). If not, can you think of a reason we would see excessive resubmits. cc: @crioux-stripe |
@@ -48,7 +48,7 @@ jobs: | |||
CI_BRANCH: ${{ github.ref }} | |||
COVERALLS_REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} | |||
- name: Upload Test Results | |||
uses: actions/upload-artifact@v3 | |||
uses: actions/upload-artifact@v4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unrelated, but was breaking CI because GH deprecated upload-artifact@v3
@@ -823,9 +823,9 @@ public void testNoHeartBeatAfterLaunchResubmit() { | |||
assertEquals(JobState.Accepted, resp4.getJobMetadata().get().getState()); | |||
|
|||
// 1 original submissions and 0 resubmits because of worker not in launched state with HB timeouts | |||
verify(schedulerMock, times(2)).scheduleWorkers(any()); | |||
verify(schedulerMock, times(1)).scheduleWorkers(any()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just temporary to get the tests passing. i'll update the tests if we think the above logic is reasonable.
@Andyz26 - if you get a chance, could you approve the snapshot build? i was hoping to see if this fixes the issue we are seeing on our end. thanks! |
@kmg-stripe i think i understand the problem now and i think your solution in pr makes sense to me. It would be great if we can consolidate on the lifecycle events expectations for future changes. |
@Andyz26 I believe we are not doing anything special regarding worker lifecycle. That is, I think we are relying on whatever the "default" behavior is. My impression is that "launched" means a worker has landed on an instance, but hasn't started yet. I'd say the time between "launched" and "started" can be from 10s of seconds up to a few minutes. @crioux-stripe can keep me honest here, if we are doing anything out of the ordinary. Also - looks like gradle failed when publishing a snapshot. I cannot tell why based on the error:
Not sure if this is a transient issue... Could you approve the latest snapshot job when you get a chance? Thanks! |
Approving to see if I can get a snapshot triggered. |
Tracing backwards our workers are getting set to launched in the JobWorker#processEvent(WorkerEvent) method as expected. I'll dig into what actually triggers that event... The |
Maybe some operations during the worker startup was taking longer? I think we have a shorter window from submitted/launched to first HB sent thus this is not causing issues for us. I think the logic here is sound and we can merge and get a RC started. |
Hrmm. Not really sure what is going on with this snapshot. I'm of the opinion we can probably merge this as is, but want to wait on @Andyz26 's approval. On the subject of the lifecycle, @Andyz26 we're just relying on the default behavior I linked to above. Does the resource provider used internally at Netflix do something smarter with the agent states? |
Ooops! Missed @Andyz26 's intervening comment. Will merge! |
It seems like there is a race condition where a recently launched worker has not sent a heartbeat, the duration is still within the missed heartbeat threshold and the JobActor treats the lack of heartbeat with a resubmit.
Context
Explain context and other details for this pull request.
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests