Errors cause graphics output to lock up but mouse still moves, keyboard is dead, libvirt guests are OK, SSH access is OK. Can't shutdown cleanly over SSH, keyboard input doesn't work. Requires hard power-off via power button hold.
syslog looks like:
Dec 19 08:10:48 kaim-eeyore kernel: [88492.249393] radeon 0000:03:00.0: ring 0 stalled for more than 10248msec
Dec 19 08:10:48 kaim-eeyore kernel: [88492.249395] radeon 0000:03:00.0: ring 3 stalled for more than 10248msec
Dec 19 08:10:48 kaim-eeyore kernel: [88492.249398] radeon 0000:03:00.0: GPU lockup (current fence id 0x000000000007de00 last fence id 0x000000000007df67 on ring 3)
Dec 19 08:10:48 kaim-eeyore kernel: [88492.249402] radeon 0000:03:00.0: GPU lockup (current fence id 0x0000000000035dda last fence id 0x0000000000035e12 on ring 0)
Dec 19 08:10:49 kaim-eeyore kernel: [88492.761408] radeon 0000:03:00.0: ring 0 stalled for more than 10760msec
Dec 19 08:10:49 kaim-eeyore kernel: [88492.761410] radeon 0000:03:00.0: ring 3 stalled for more than 10760msec
Dec 19 08:10:49 kaim-eeyore kernel: [88492.761412] radeon 0000:03:00.0: GPU lockup (current fence id 0x000000000007de00 last fence id 0x000000000007df67 on ring 3)
Dec 19 08:10:49 kaim-eeyore kernel: [88492.761415] radeon 0000:03:00.0: GPU lockup (current fence id 0x0000000000035dda last fence id 0x0000000000035e12 on ring 0)
Dec 19 08:10:49 kaim-eeyore kernel: [88493.448697] radeon 0000:03:00.0: Saved 4196 dwords of commands on ring 0.
Dec 19 08:10:49 kaim-eeyore kernel: [88493.448865] radeon 0000:03:00.0: GPU softreset: 0x0000004C
Happens with both radeon and amdgpu drivers. Seems to happen faster if using Chromium, regardless of if hardware accel is enabled/disabled.
Current kernel command line attempt to fix this by disabling DPM:
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on radeon.si_support=0 amdgpu.si_support=1 amdgpu.dc=1 amdgpu.dpm=0 isolcpus=10,11,22,23"
Some people complain that disabling DPM results in very high fan speeds and lots of noise. I don't notice any difference for my setup. My setup is AMD FirePro W5000 with dual display port outputs inside of a Dell Precision T5810 with Xeon E5-2687Wv4 and a USB powered 6 inch fan on top to push hot air out from under my desk.
Current theory is that with DPM (dynamic power management) enabled that there are power state transitions which the Linux drivers aren't handling correctly which cause a lockup internal to the card. Hopefully this will be fixed with new firmware or with updates to the Linux driver stack.
Disabling DPM seems to solve one lockup issue but not all of them. I now do still see lockups due to DMA failures like:
Hopefully Linux 4.19 or 4.20 will help but current Debian testing 4.19 can't boot on my system as it doesn't seem have a properly created initramfs as it won't enumerate USB devices as early and it cannot find my raid or cryptsetup config to find the root file system and I haven't debugged why this is happening.