Skip to content

Instantly share code, notes, and snippets.

@bradfa
Last active January 8, 2019 17:15
Show Gist options
  • Save bradfa/5da477672d7a8cfb18b3cfeff371a652 to your computer and use it in GitHub Desktop.
Save bradfa/5da477672d7a8cfb18b3cfeff371a652 to your computer and use it in GitHub Desktop.

Errors cause graphics output to lock up but mouse still moves, keyboard is dead, libvirt guests are OK, SSH access is OK. Can't shutdown cleanly over SSH, keyboard input doesn't work. Requires hard power-off via power button hold.

syslog looks like:

Dec 19 08:10:48 kaim-eeyore kernel: [88492.249393] radeon 0000:03:00.0: ring 0 stalled for more than 10248msec
Dec 19 08:10:48 kaim-eeyore kernel: [88492.249395] radeon 0000:03:00.0: ring 3 stalled for more than 10248msec
Dec 19 08:10:48 kaim-eeyore kernel: [88492.249398] radeon 0000:03:00.0: GPU lockup (current fence id 0x000000000007de00 last fence id 0x000000000007df67 on ring 3)
Dec 19 08:10:48 kaim-eeyore kernel: [88492.249402] radeon 0000:03:00.0: GPU lockup (current fence id 0x0000000000035dda last fence id 0x0000000000035e12 on ring 0)
Dec 19 08:10:49 kaim-eeyore kernel: [88492.761408] radeon 0000:03:00.0: ring 0 stalled for more than 10760msec
Dec 19 08:10:49 kaim-eeyore kernel: [88492.761410] radeon 0000:03:00.0: ring 3 stalled for more than 10760msec
Dec 19 08:10:49 kaim-eeyore kernel: [88492.761412] radeon 0000:03:00.0: GPU lockup (current fence id 0x000000000007de00 last fence id 0x000000000007df67 on ring 3)
Dec 19 08:10:49 kaim-eeyore kernel: [88492.761415] radeon 0000:03:00.0: GPU lockup (current fence id 0x0000000000035dda last fence id 0x0000000000035e12 on ring 0)
Dec 19 08:10:49 kaim-eeyore kernel: [88493.448697] radeon 0000:03:00.0: Saved 4196 dwords of commands on ring 0.
Dec 19 08:10:49 kaim-eeyore kernel: [88493.448865] radeon 0000:03:00.0: GPU softreset: 0x0000004C

Happens with both radeon and amdgpu drivers. Seems to happen faster if using Chromium, regardless of if hardware accel is enabled/disabled.

Current kernel command line attempt to fix this by disabling DPM:

GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on radeon.si_support=0 amdgpu.si_support=1 amdgpu.dc=1 amdgpu.dpm=0 isolcpus=10,11,22,23"

Some people complain that disabling DPM results in very high fan speeds and lots of noise. I don't notice any difference for my setup. My setup is AMD FirePro W5000 with dual display port outputs inside of a Dell Precision T5810 with Xeon E5-2687Wv4 and a USB powered 6 inch fan on top to push hot air out from under my desk.

Current theory is that with DPM (dynamic power management) enabled that there are power state transitions which the Linux drivers aren't handling correctly which cause a lockup internal to the card. Hopefully this will be fixed with new firmware or with updates to the Linux driver stack.

@bradfa
Copy link
Author

bradfa commented Jan 3, 2019

Disabling DPM seems to solve one lockup issue but not all of them. I now do still see lockups due to DMA failures like:

Jan  3 13:07:49 kaim-eeyore kernel: [106020.575998] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=115231, last emitted seq=115233
Jan  3 13:07:49 kaim-eeyore kernel: [106020.576002] [drm] GPU recovery disabled.

Hopefully Linux 4.19 or 4.20 will help but current Debian testing 4.19 can't boot on my system as it doesn't seem have a properly created initramfs as it won't enumerate USB devices as early and it cannot find my raid or cryptsetup config to find the root file system and I haven't debugged why this is happening.

@bradfa
Copy link
Author

bradfa commented Jan 7, 2019

Pro tip: if you have an initrd which doesn't work and was built using an automated tool, don't go and rebuild all of your other initrds using the same tool as you'll likely end up crying ;)

Regarding graphics things, I'm giving up now on my AMD card. Switched back to using nouveau on my Quadro NVS 310 as although it's not perfect, at least it doesn't crash randomly all the time.

@bradfa
Copy link
Author

bradfa commented Jan 8, 2019

So even with running Debian stable and Linux 4.9 and older AMD firmware using the radeon driver (not amdgpu) the crashes still happen. I think this might actually be a hardware bug. My home APU system does not have stability issues like this, although that's not a direct comparison.
My W5000 card was bought used so it may have suffered some kind of ESD event or similar issue maybe? Dunno.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment