Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save danielrosehill/f6c9d86509617733783c999c1d9238cd to your computer and use it in GitHub Desktop.

Select an option

Save danielrosehill/f6c9d86509617733783c999c1d9238cd to your computer and use it in GitHub Desktop.
AMD Radeon RX 7700 XT GPU Freezes on Ubuntu 25.10 with KDE Plasma (Wayland) - Diagnosis and Remediation

AMD Radeon RX 7700 XT GPU Freezes on Ubuntu 25.10 - Diagnosis & Remediation

System Configuration

  • GPU: AMD Radeon RX 7700 XT / 7800 XT (Navi 32, gfx1101)
  • OS: Ubuntu 25.10
  • Kernel: 6.17.0-6-generic
  • Desktop Environment: KDE Plasma (Wayland)
  • Mesa: 25.2.3-1ubuntu1
  • Driver: amdgpu (in-kernel)

Important: GPU freezes began after upgrading to Ubuntu 25.10

Diagnosis Summary

Analysis of system logs reveals multiple issues causing GPU instability:

1. SMU Firmware/Driver Interface Mismatch

amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000040
amdgpu: SMU driver if version not matched

Impact: The System Management Unit (SMU) version mismatch can cause:

  • Power management instability
  • GPU frequency scaling issues
  • Thermal management problems
  • Random freezes/hangs

2. GPU Scheduler Errors

[drm:drm_sched_entity_push_job [gpu_sched]] *ERROR* Trying to push to a killed entity

Impact: GPU jobs are being terminated prematurely, indicating:

  • GPU hang/recovery cycles
  • Driver failing to properly queue work
  • Potential memory management issues

3. Display Manager Workqueue Issues

workqueue: dm_irq_work_func [amdgpu] hogged CPU for >10000us 5 times

Impact: Display manager interrupt handling is taking too long, which can cause:

  • Screen freezes
  • Desktop compositor stuttering
  • System responsiveness issues

4. TLB Fence Work Backlog

Multiple amdgpu_tlb_fence_work entries in workqueue indicate memory management issues with the GPU's Translation Lookaside Buffer.

Root Causes

Primary Cause: Immature Kernel 6.17 Support

Ubuntu 25.10 uses kernel 6.17.0-6, which is very new and still in development. RDNA3 (Navi 32) support in this kernel version has stability issues:

  • GPU scheduler regressions
  • SMU firmware compatibility issues
  • Memory management bugs

Contributing Factors

  1. Wayland Compositor Issues: KDE Plasma on Wayland with bleeding-edge amdgpu stack can trigger compositor hangs
  2. Mesa 25.2.3: Very recent Mesa release may have regressions not yet fixed
  3. Firmware Mismatch: SMU firmware interface version incompatibility

Remediation Steps

Solution 1: Downgrade to Stable Kernel (Recommended)

The most effective solution is to use a more stable kernel:

# Install a stable kernel (6.11 or 6.8 LTS)
sudo apt install linux-image-6.11.0-generic linux-headers-6.11.0-generic

# Or use the LTS kernel
sudo apt install linux-image-6.8.0-generic linux-headers-6.8.0-generic

# Reboot and select the older kernel from GRUB menu
sudo reboot

After rebooting, verify you're on the stable kernel:

uname -r

If stable, you can remove the problematic 6.17 kernel:

sudo apt remove linux-image-6.17.0-6-generic linux-headers-6.17.0-6-generic

Solution 2: Add amdgpu Kernel Parameters

Add stability-improving kernel parameters to /etc/default/grub:

sudo nano /etc/default/grub

Modify the GRUB_CMDLINE_LINUX_DEFAULT line to include:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amdgpu.ppfeaturemask=0xffffffff amdgpu.gpu_recovery=1 amdgpu.runpm=0"

Parameter explanations:

  • amdgpu.ppfeaturemask=0xffffffff - Enables all power play features (may help with SMU issues)
  • amdgpu.gpu_recovery=1 - Enables GPU hang detection and recovery
  • amdgpu.runpm=0 - Disables runtime power management (reduces SMU-related hangs)

Update GRUB and reboot:

sudo update-grub
sudo reboot

Solution 3: Update Firmware

Ensure you have the latest AMD GPU firmware:

sudo apt update
sudo apt install --reinstall linux-firmware
sudo update-initramfs -u
sudo reboot

Solution 4: Switch to X11 Session (Temporary Workaround)

If Wayland is contributing to freezes:

  1. Log out of KDE Plasma
  2. At the login screen (SDDM), click the session selector (usually bottom-left)
  3. Select "Plasma (X11)" instead of "Plasma (Wayland)"
  4. Log in and test stability

Solution 5: Disable GPU Power Management Features

If runtime power management is causing issues, create a udev rule:

sudo nano /etc/udev/rules.d/99-amdgpu-power.rules

Add:

KERNEL=="card0", SUBSYSTEM=="drm", DRIVERS=="amdgpu", ATTR{device/power_dpm_force_performance_level}="high"

Reload udev:

sudo udevadm control --reload-rules
sudo udevadm trigger

Solution 6: Monitor for Kernel/Mesa Updates

Ubuntu 25.10 is still receiving updates. Monitor for:

  • Kernel updates that fix RDNA3 issues
  • Mesa updates with amdgpu stability fixes
  • Firmware updates

Check for updates regularly:

sudo apt update && sudo apt upgrade

Verification Steps

After applying fixes, verify stability:

  1. Check kernel messages for errors:

    sudo dmesg | grep -i amdgpu | grep -i error
  2. Monitor GPU status:

    rocm-smi
  3. Check for scheduler errors:

    sudo journalctl -b | grep "killed entity"
  4. Stress test GPU (optional):

    # Run a GPU-intensive task to test stability
    glxgears -fullscreen
    # Or use a benchmark tool

Expected Outcome

  • Best case: Downgrading to kernel 6.11 or 6.8 should eliminate freezes
  • Good case: Kernel parameters + firmware update improve stability significantly
  • Acceptable: X11 session provides stable environment until Wayland issues are resolved

Additional Resources

Long-term Recommendations

  1. Stay on LTS kernels for production systems requiring stability
  2. Test bleeding-edge kernels in VMs or on non-critical systems first
  3. Monitor AMD GPU mailing lists for known RDNA3 issues
  4. Consider filing a bug report with Ubuntu/kernel.org if issues persist on stable kernels

Note: This diagnosis was generated using Claude Code based on actual system logs from an affected system. GPU freezes began immediately after upgrading to Ubuntu 25.10. While the information provided here is based on real diagnostic data and established troubleshooting practices, users should validate these remediation steps in their specific environment and maintain proper backups before making system changes.

If you experience similar issues, please test the solutions in order (kernel downgrade first) and report your results to help the community.

Comments are disabled for this gist.