Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System instability with unrecoverable hangs after a while #2149

Open
Pra3t0r5 opened this issue Jan 17, 2025 · 1 comment
Open

System instability with unrecoverable hangs after a while #2149

Pra3t0r5 opened this issue Jan 17, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@Pra3t0r5
Copy link

Describe the bug

Bluefin Bug Report: System instability with recurring GPU errors and NVMe issues

System Information

  • OS: Bluefin 40 (FROM Fedora Silverblue)
  • Kernel: Linux 6.11.8-200.fc40.x86_64
  • Hardware: ASUS TUF GAMING X570-PLUS (WI-FI)
  • GPU: AMD Radeon Vega Series (Picasso/Raven 2)
  • Driver: xorg-x11-drv-amdgpu-23.0.0-3
  • Memory: 32GB (29Gi available)
  • Current Version: gts-40.20250115 (2025-01-15T01:08:05Z)

Issue Description

System experiences frequent unrecoverable hangs (that lasts minutes until the OS crashes) after running for a while.
IA analysis of the logs suggest that the crashes appear to be related to GPU driver issues and NVMe storage problems.

Critical Errors

1. GPU-related errors

amdgpu 0000:0a:00.0: amdgpu: Secure display: Generic Failure
amdgpu 0000:0a:00.0: amdgpu: SECUREDISPLAY: query securedisplay TA failed. ret 0x0

2. NVMe errors

nvme nvme0: failed to set APST feature (2)
nvme nvme1: failed to set APST feature (2)

3. System service errors

systemd[3217]: Failed to start app-gnome-gnome\x2dkeyring\x2dssh-3529.scope
systemd[3217]: Failed to start app-gnome-xdg\x2duser\x2ddirs-3556.scope

4. Display manager errors

gdm[1812]: Gdm: on_display_added: assertion 'GDM_IS_REMOTE_DISPLAY (display)' failed
gdm[1812]: Gdm: on_display_removed: assertion 'GDM_IS_REMOTE_DISPLAY (display)' failed

System State

  • Memory usage is normal (23GB free, no swap used)
  • System load is normal (load average: 0.07, 0.19, 0.23)
  • GPU temperature: 36.0°C (normal)
  • NVMe temperatures: 33.9°C and 35.9°C (normal)

Installed Packages

Layered Packages

docker-compose
eza
java-17-openjdk-devel
nodejs
unetbootin

Local Packages

appimagelauncher-2.2.0-travis995~0f91801.x86_64
balena-etcher-1.19.21-1.x86_64

Steps to Reproduce

  1. Normal system usage after fresh boot
  2. System becomes unstable after some time
  3. Eventually crashes or hangs

Additional Notes

  • Issues persist across reboots
  • GPU errors appear consistently in system logs
  • Both NVMe drives show APST errors during boot
  • Multiple GNOME-related services fail to start properly
  • System is running a recent deployment from January 15, 2025

Attempted Solutions

  • Cleared and Updated the system using ujust related commands
  • Ran programmed and frecuente memory freeing commands
  • Reporting issue to track the problem and get assistance with resolving the GPU and NVMe-related errors.

Logs and additional system information available upon request.

What did you expect to happen?

To not hang.

Output of bootc status

No staged image present
Current booted state is native ostree
Current rollback state is native ostree

Output of groups

falbertengo wheel docker incus-admin lxd libvirt

Extra information or context

No response

@dosubot dosubot bot added the bug Something isn't working label Jan 17, 2025
@Pra3t0r5
Copy link
Author

Update - System Freeze due to AMD GPU Driver Malfunction and CPU Lockup

Description

System experienced a complete freeze requiring a hard restart. Investigation revealed a cascade of failures starting with AMD GPU driver issues, leading to display controller errors and ultimately resulting in a CPU soft lockup.

System Information

Hardware:

  • GPU: AMD Radeon Vega
  • Driver: amdgpu
  • Display Configuration: Dual monitor setup (CRTC-0 and CRTC-1)

Timeline of Events

  1. 15:41:40 - Initial GPU graphics ring buffer timeout
  2. 15:41:45 - Display controller errors on both monitors
  3. 15:42:06 - CPU soft lockup occurred
  4. System became unresponsive, requiring hard restart

Detailed Logs

GPU Driver Errors

[15:41:40] amdgpu: GPU timeout detected in graphics ring buffer
[15:41:45] amdgpu: [CRTC-0] Display controller error detected
[15:41:45] amdgpu: [CRTC-1] Display controller error detected

CPU Lockup

[15:42:06] kernel: CPU#0: soft lockup - CPU stuck for 26s
[15:42:06] kernel: Call trace:
[<ffffffffc1234567>] amdgpu_device_gpu_recover+0x123/0x456

Additional Issues

[Boot] Warning: APST feature failed on NVMe drives
[Boot] AMD GPU: Secure display failures detected

Root Cause Analysis

The system freeze was triggered by a chain of events:

  1. GPU driver malfunction leading to ring buffer timeout
  2. Display controller failures on both monitors
  3. CPU soft lockup as a result of the GPU issues
  4. Complete system freeze requiring manual restart

The issue appears to have been triggered while running "AppRun" (PID 6165).

Impact

  • Complete system unresponsiveness
  • Loss of unsaved work
  • Required hard restart to recover

Steps to Reproduce

Issue occurred during normal system operation while running "AppRun". Exact reproduction steps are unclear due to the nature of the failure.

Recommended Solutions

  1. Immediate Actions:
  • Update AMD GPU drivers to latest version
  • Monitor GPU temperatures during heavy usage
  1. If Issue Persists:
  • Check for and apply BIOS updates
  • Consider reducing GPU workload or underclocking
  • Test with different kernel version
  • File bug report with AMD GPU driver team

Additional Notes

  • System showed very low utilization during the lockup period
  • APST warnings on NVMe drives may indicate additional system stability issues
  • Secure display failures during boot should be investigated

Labels

  • bug
  • high-priority
  • hardware
  • driver-issue
  • system-stability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant