Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self hosted runner are going offline suddenly and it is not connected. #3501

Open
akhilp6 opened this issue Oct 10, 2024 · 13 comments
Open

Self hosted runner are going offline suddenly and it is not connected. #3501

akhilp6 opened this issue Oct 10, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@akhilp6
Copy link

akhilp6 commented Oct 10, 2024

Describe the bug
We are using self hosted macos machines for our Github actions. It was working fine till yesterday but its suddenly getting disconnected and going to offline. And when I check the runner logs in _diag, I did find this

What's not working?

[2024-10-10 21:30:29Z INFO Terminal] WRITE LINE:
[2024-10-10 21:30:29Z INFO RSAFileKeyManager] Loading RSA key parameters from file /System/Volumes/Data/export/home/tester/actions-runner/.credentials_rsaparams
[2024-10-10 21:30:30Z ERR  GitHubActionsService] POST request to https://tokenghub.actions.githubusercontent.com/_apis/oauth2/token/****** failed. HTTP Status: BadRequest
[2024-10-10 21:30:30Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[2024-10-10 21:30:30Z ERR  MessageListener] Catch exception during create session.
[2024-10-10 21:30:30Z ERR  MessageListener] GitHub.Services.OAuth.VssOAuthTokenRequestException: Registration was not found or is not medium trust. ClientType:
   at GitHub.Services.OAuth.VssOAuthTokenProvider.OnGetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
   at GitHub.Services.Common.IssuedTokenProvider.GetTokenOperation.GetTokenAsync(VssTraceActivity traceActivity)
   at GitHub.Services.Common.IssuedTokenProvider.GetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
   at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken)
   at GitHub.Runner.Listener.MessageListener.CreateSessionAsync(CancellationToken token)
[2024-10-10 21:30:30Z ERR  MessageListener] Test oauth app registration.
[2024-10-10 21:30:30Z INFO RSAFileKeyManager] Loading RSA key parameters from file /System/Volumes/Data/export/home/tester/actions-runner/.credentials_rsaparams
[2024-10-10 21:30:30Z ERR  GitHubActionsService] POST request to https://tokenghub.actions.githubusercontent.com/_apis/oauth2/token/***** failed. HTTP Status: BadRequest
[2024-10-10 21:30:30Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[2024-10-10 21:30:30Z INFO MessageListener] Retriable exception: Registration was not found or is not medium trust. ClientType:
[2024-10-10 21:30:30Z INFO MessageListener] Sleeping for 30 seconds before retrying.

I have been seeing this issue from couple of days and its happening for all our hosts.

I did see some online thread, some cases were clock of the server was off( which is not true in our case)

Runner Version and Platform

Runner version we are using is the latest 2.319.1
OS of the machine running the runner? OSX

Runner and Worker's Diagnostic Logs

Added the runner logs above.

@akhilp6 akhilp6 added the bug Something isn't working label Oct 10, 2024
@luketomlinson
Copy link
Contributor

@akhilp6 do you have a session Id for that runner?

@ktaggart
Copy link

ktaggart commented Oct 15, 2024

@luketomlinson: I am having the same issue with all of our hosts, and coincidentally it started at approximately the same time as @akhilp6.
Runner version: 2.319.1
OS: Ubuntu 24.04 LTS

Any updates on this issue?

@ktaggart
Copy link

@luketomlinson, how do I find the session ID in order to get this issue moving forward? Our runners have been down for too long and I'd like to help resolve this issue ASAP.

@luketomlinson
Copy link
Contributor

Hey @ktaggart, typically it would be from the runner logs as a queryParameter.

We identified a bug on our side that was incorrectly deleting some runners. We've fixed this issue, but you'll need to re-register to get them back online.

Let me know if this still appears to be happening.

@ktaggart
Copy link

ktaggart commented Oct 16, 2024

Hey @ktaggart, typically it would be from the runner logs as a queryParameter.

We identified a bug on our side that was incorrectly deleting some runners. We've fixed this issue, but you'll need to re-register to get them back online.

Let me know if this still appears to be happening.

@luketomlinson, thanks for the response. I am pretty new to this and inherited our cluster. Can you point me to the registration docs?

@luketomlinson
Copy link
Contributor

@akhilp6
Copy link
Author

akhilp6 commented Oct 16, 2024

@luketomlinson Is there a new runner version which we need to use to re-register them?

@ktaggart
Copy link

@luketomlinson

Sure thing @ktaggart. https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners#adding-a-self-hosted-runner-to-a-repository.

I followed the instructions, created all the runners, added our labels, but they now sit idle and do not accept new jobs:
"Waiting for a runner to pick up this job..."

I compare the new .runner file to the old one and noted some rather stark differences:
old_runner

{
  "agentId": 642353,
  "agentName": "group.node_1",
  "poolId": 458,
  "poolName": "group",
  "serverUrl": "https://pipelines.actions.githubusercontent.com/<string>",
  "gitHubUrl": "https://github.com/enterprises/corp",
  "workFolder": "_work"
}

new_runner

{
  "agentId": 1,
  "agentName": "group.node_1",
  "poolId": 1,
  "poolName": "Default",
  "serverUrl": "https://pipelinesghubeus10.actions.githubusercontent.com/<string>",
  "gitHubUrl": "https://github.com/corp-repos/repo",
  "workFolder": "_work"
}

When I attempted to use the poolName "group" I received an error indicating that the poolName "group" was not found, and I could only use "Default."

Also, the old_runner gitHubUrl is completely invalid - I have no idea how that could have worked previously, but it did. When I tried using that url, I received a "404 not found" error, which made sense.

I am wondering what the significance of the poolId and agentId are?

Any suggestions on how I can get my runner working/accepting jobs?

@ktaggart
Copy link

ktaggart commented Oct 21, 2024

@luketomlinson, some more info.

After examining the runner log, I see the following:

Loading RSA key parameters from file /github_runner/runners/node_1/corp/001/.credentials_rsaparams
POST request to https://vstoken.actions.githubusercontent.com/_apis/oauth2/token/<token> failed. HTTP Status: BadRequest

I am not sure how the token that github provided is bad, but the token shown above is not the same token shown in the .credentials file. Both the .credentials and .credentials_rsaparams files were created the same day and the same time, but are somehow out of sync.

ETA:
It appears that I was incorrect about the creation dates of the various .credentials_rsaparams:

Oct 17 13:05 /github_runner/runners/node_1/corp/001/.credentials
Oct 17 13:03 /github_runner/runners/node_1/corp/001/.credentials_rsaparams
Oct 17 13:07 /github_runner/runners/node_2/corp/001/.credentials
Oct 17 11:03 /github_runner/runners/node_2/corp/001/.credentials_rsaparams
Oct 17 13:10 /github_runner/runners/node_3/corp/001/.credentials
Dec  5  2022 /github_runner/runners/node_3/corp/001/.credentials_rsaparams
Oct 17 13:12 /github_runner/runners/node_4/corp/001/.credentials
Dec  5  2022 /github_runner/runners/node_4/corp/001/.credentials_rsaparams
Oct 17 13:14 /github_runner/runners/node_5/corp/001/.credentials
Dec  5  2022 /github_runner/runners/node_5/corp/001/.credentials_rsaparams
Oct 17 13:16 /github_runner/runners/node_6/corp/001/.credentials
Dec  5  2022 /github_runner/runners/node_6/corp/001/.credentials_rsaparams
Oct 17 13:18 /github_runner/runners/node_7/corp/001/.credentials
Dec  5  2022 /github_runner/runners/node_7/corp/001/.credentials_rsaparams
Oct 17 13:21 /github_runner/runners/node_8/corp/001/.credentials
Dec  5  2022 /github_runner/runners/node_8/corp/001/.credentials_rsaparams
Oct 17 13:23 /github_runner/runners/node_9/corp/001/.credentials
Nov 23  2023 /github_runner/runners/node_9/corp/001/.credentials_rsaparams
Oct 17 13:24 /github_runner/runners/node_10/corp/001/.credentials
Dec 30  2023 /github_runner/runners/node_10/corp/001/.credentials_rsaparams
Oct 17 13:26 /github_runner/runners/node_11/corp/001/.credentials
Dec 28  2023 /github_runner/runners/node_11/corp/001/.credentials_rsaparams
Oct 17 13:27 /github_runner/runners/node_12/corp/001/.credentials
Nov 23  2023 /github_runner/runners/node_12/corp/001/.credentials_rsaparams
Oct 17 13:29 /github_runner/runners/node_13/corp/001/.credentials
Dec 28  2023 /github_runner/runners/node_13/corp/001/.credentials_rsaparams

All of them were created before the corresponding .credentials file; ranging from minutes to years. I do not understand what is going on here, but I would assume that when I created the new runners that old/invalid credentials would be cleared out.

Please comment.

@luketomlinson
Copy link
Contributor

Hi @ktaggart,

Where are you trying to register the runner? Runners can be registered at the repo, org, or enterprise level. That is determined by the gitHubUrl you provide to the config script.

During that registration process, you choose a Runner Group (aka Pool) so PoolName == Runner Group Name.

Once runners are registered, you'll need to make sure that runner group has access to the repository in question if the runner is registered at the org or enterprise level.

Since you are re-registering runners, I would recommend deleting all of those .credentials files. They will be re-created when you re-register.

@ktaggart
Copy link

ktaggart commented Oct 21, 2024

@luketomlinson, thanks for the response.

I followed the instructions you linked, and used the gitHubUrl that was provided via that process, e.g.,
./config.sh --url https://github.com/corp/repo --token <TOKEN>

In the above example the runner is being registered at the repo level, our internal corp github repo.

For poolName, I was only able to use 'Default' as when I tried using our previous group name, I received an error indicating that group name was not available, and I could only use 'Default'. I noted that in a previous message.

Q: Can I change that to our old group name manually by editing the .runner files, or will that cause an issue?

The runners do have access to the repo, as all the runners appear in the repo runners section and show as idle. and after the config was completed, I received a 'Connected to GitHub' message

As far as deleting the .credentials files, before running config.sh, I had to delete the existing .runner file in each runner directory in order to proceed, which resulted in a 'Removed .credentials' message while creating each new runner.

After that, a new .credentials file was in place, with some old .credentials_rsaparams.

Q: Are you suggesting that I delete the .runner, .credentials and .credentials_rsaparams files and reinstall each runner?

Please clarify.

Thx.

@luketomlinson
Copy link
Contributor

luketomlinson commented Oct 22, 2024

Hi @ktaggart,

From your previous comment, it appears the old runners were registered at the enterprise level with a different group.

To answer your question briefly, changing the .runner file group field will have no effect. It's just a representation of how the runner is registered.

If you are still having issues, I'd recommend going through our support channels, rather than a public issue.

@cesarjorgemartinez
Copy link

+1 same problem with MacOS 14.7 version 2.319.1 and 2.320.0.
Time is ok, the runner is not touched in many time, and the use is frequently each day.

Suddenly broken:

[2024-10-25 09:50:05Z INFO Terminal] WRITE LINE: 
[2024-10-25 09:50:05Z INFO Terminal] WRITE LINE: 
[2024-10-25 09:50:05Z INFO RSAFileKeyManager] Loading RSA key parameters from file .../.credentials_rsaparams
[2024-10-25 09:50:05Z INFO RSAFileKeyManager] Loading RSA key parameters from file .../.credentials_rsaparams
[2024-10-25 09:50:06Z ERR  GitHubActionsService] POST request to https://vstoken.actions.githubusercontent.com/_apis/oauth2/token/xxxxxx failed. HTTP Status: BadRequest, AFD Ref: Ref A: xxxxx Ref B: xxxxxxx Ref C: 2024-10-25T09:50:05Z
[2024-10-25 09:50:06Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[2024-10-25 09:50:06Z ERR  MessageListener] Catch exception during create session.
[2024-10-25 09:50:06Z ERR  MessageListener] GitHub.Services.OAuth.VssOAuthTokenRequestException: Registration was not found or is not medium trust. ClientType: 
   at GitHub.Services.OAuth.VssOAuthTokenProvider.OnGetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
   at GitHub.Services.Common.IssuedTokenProvider.GetTokenOperation.GetTokenAsync(VssTraceActivity traceActivity)
   at GitHub.Services.Common.IssuedTokenProvider.GetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
   at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken)
   at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken)
   at GitHub.Runner.Listener.MessageListener.CreateSessionAsync(CancellationToken token)
[2024-10-25 09:50:06Z ERR  MessageListener] Test oauth app registration.
[2024-10-25 09:50:06Z INFO RSAFileKeyManager] Loading RSA key parameters from file .../.credentials_rsaparams
[2024-10-25 09:50:07Z ERR  GitHubActionsService] POST request to https://vstoken.actions.githubusercontent.com/_apis/oauth2/token/xxx failed. HTTP Status: BadRequest, AFD Ref: Ref A: xxx Ref B: xxx Ref C: 2024-10-25T09:50:06Z
[2024-10-25 09:50:07Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[2024-10-25 09:50:07Z INFO MessageListener] Retriable exception: Registration was not found or is not medium trust. ClientType: 
[2024-10-25 09:50:07Z ERR  Terminal] WRITE ERROR: 2024-10-25 09:50:07Z: Runner connect error: Registration was not found or is not medium trust. ClientType: . Retrying until reconnected.
[2024-10-25 09:50:07Z INFO MessageListener] Sleeping for 30 seconds before retrying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants