Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on WaitAllPodVolumesProcessed in velero 1.15.2 #8657

Open
Gui13 opened this issue Jan 27, 2025 · 1 comment
Open

Panic on WaitAllPodVolumesProcessed in velero 1.15.2 #8657

Gui13 opened this issue Jan 27, 2025 · 1 comment

Comments

@Gui13
Copy link

Gui13 commented Jan 27, 2025

What steps did you take and what happened:

During a scheduled backup, velero rebooted due to a panic waiting for all podVolumeBackups to be finished.

What did you expect to happen:

No panic ;-)

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Here are the logs:

2025-01-27T04:01:42+01:00    {"time":"2025-01-27T03:01:42.51626203Z","_p":"F","log":"time=\"2025-01-27T03:01:42Z\" level=info msg=\"auth with the storage account access key\" backupLocation=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync logSource=\"/go/pkg/mod/github.com/vmware-tanzu/[email protected]/pkg/util/azure/storage.go:101\" pluginName=velero-plugin-for-microsoft-azure"}
2025-01-27T04:01:42+01:00    {"time":"2025-01-27T03:01:42.591304615Z","_p":"F","log":"time=\"2025-01-27T03:01:42Z\" level=info msg=\"plugin process exited\" backupLocation=velero/default cmd=/plugins/velero-plugin-for-microsoft-azure controller=backup-sync id=26381 logSource=\"pkg/plugin/clientmgmt/process/logrus_adapter.go:80\" plugin=/plugins/velero-plugin-for-microsoft-azure"}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.658807814Z","_p":"F","log":"panic: sync: WaitGroup is reused before previous Wait has returned"}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.658841385Z","_p":"F","log":""}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.658845486Z","_p":"F","log":"goroutine 180044 [running]:"}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.659631212Z","_p":"F","log":"sync.(*WaitGroup).Wait(0xc0012a0fd0?)"}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.659640691Z","_p":"F","log":"\t/usr/local/go/src/sync/waitgroup.go:118 +0x74"}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.659644046Z","_p":"F","log":"github.com/vmware-tanzu/velero/pkg/podvolume.(*backupper).WaitAllPodVolumesProcessed.func2()"}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.659647509Z","_p":"F","log":"\t/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:332 +0x53"}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.6596526Z","_p":"F","log":"created by github.com/vmware-tanzu/velero/pkg/podvolume.(*backupper).WaitAllPodVolumesProcessed in goroutine 484"}
2025-01-27T04:01:47+01:00    {"time":"2025-01-27T03:01:47.659663401Z","_p":"F","log":"\t/go/src/github.com/vmware-tanzu/velero/pkg/podvolume/backupper.go:330 +0x113"}
2025-01-27T04:01:48+01:00    {"time":"2025-01-27T03:01:48.28284558Z","_p":"F","log":"time=\"2025-01-27T03:01:48Z\" level=info msg=\"setting log-level to INFO\" logSource=\"pkg/cmd/server/server.go:110\""}
2025-01-27T04:01:48+01:00    {"time":"2025-01-27T03:01:48.282876929Z","_p":"F","log":"time=\"2025-01-27T03:01:48Z\" level=info msg=\"Starting Velero server v1.15.1 (32499fc287815058802c1bc46ef620799cca7392-dirty)\" logSource=\"pkg/cmd/server/server.go:112\""}
2025-01-27T04:01:48+01:00    {"time":"2025-01-27T03:01:48.282880497Z","_p":"F","log":"time=\"2025-01-27T03:01:48Z\" level=info msg=\"1 feature flags enabled [EnableCSI]\" logSource=\"pkg/cmd/server/server.go:114\""}
2025-01-27T04:01:48+01:00    {"time":"2025-01-27T03:01:48.323417205Z","_p":"F","log":"time=\"2025-01-27T03:01:48Z\" level=info msg=\"plugin process exited\" cmd=/velero id=16 logSource=\"pkg/plugin/clientmgmt/process/logrus_adapter.go:80\" plugin=/velero"}

Anything else you would like to add:

Environment:

  • Velero version (use velero version): 1.15.2
  • Velero features (use velero client config get features): DataMover is used
  • Kubernetes version (use kubectl version): 1.30
  • Kubernetes installer & version: Azure AKS
  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Gui13 Gui13 changed the title Panic on Panic on WaitAllPodVolumesProcessed in velero 1.15.2 Jan 27, 2025
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jan 27, 2025

First of all, the release is 1.15.1 instead of 1.15.2, see logs:
level=info msg=\"Starting Velero server v1.15.1

Secondly, from the first glance, I don't think this problem is related to 1.15.1 or 1.15.2. In another word, this problem may be a legacy problem that also happens in previous releases.
The logics of b.wg in WaitAllPodVolumesProcessed is unchanged in 1.15.

We need further checks, but I guess the cause of the problem is here:

		volumeBackup := newPodVolumeBackup(backup, pod, volume, repoIdentifier, b.uploaderType, pvc)
		if err := veleroclient.CreateRetryGenerateName(b.crClient, b.ctx, volumeBackup); err != nil {
			errs = append(errs, err)
			continue
		}
		b.wg.Add(1)

PVB is created by CreateRetryGenerateName and then b.wg.Add is called.
If the PVB successes/fails immediately, b.wg.Done may be called before b.wg.Add.

@Gui13 Please let us know in which case you saw this panic, and whether it is consistent in your env. Please also share velero log bundle when the problem happened.

cc @reasonerjt @ywk253100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants