Skip to content

Instantly share code, notes, and snippets.

@Toliak
Last active May 1, 2024 09:15
Show Gist options
  • Save Toliak/86340b839b45f2c6fa4337ba6d8e971b to your computer and use it in GitHub Desktop.
Save Toliak/86340b839b45f2c6fa4337ba6d8e971b to your computer and use it in GitHub Desktop.
Huawei Matebook D16 RLEF-X NVMe Laptop Disk is read-only after sleep problem research and fix

I have bought laptop Huawei Matebook D16 RLEF-X. Tried to use Arch Linux and encountered with the problem:

  1. Turn on the suspend mode (sudo systemctl suspend)
  2. Close the laptop lid
  3. Wait a few minutes
  4. Open the laptop back
  5. Use mount to see that the disk is read-only. Moreover, it determines like read-only, however, it is just broken. No zsh history, no executables, no way to poweroff without power button long-press.

INXI shrinked output:

System:
  Host: archlinux Kernel: 6.4.2-arch1-1-linux arch: x86_64 bits: 64
    Desktop: i3 v: 4.22 Distro: Arch Linux
Machine:
  Type: Laptop System: HUAWEI product: RLEF-XX v: M1010
    serial: <superuser required>
  Mobo: HUAWEI model: RLEF-XX-PCB v: M1010 serial: <superuser required>
    UEFI: HUAWEI v: 1.26 date: 01/30/2023
....
Drives:
  Local Storage: total: 476.94 GiB used: 63.92 GiB (13.4%)
  ID-1: /dev/nvme0n1 model: PCIe-8 SSD 512GB size: 476.94 GiB

Research part

My first step was to search something like huawei matebook disk read-only after suspend. To my surprise, I have found this thread created at 2023 that just mention the case I have described above. The recommendation in the thread: turn IOMMU into the soft mode

Unfortunately, turning the kernel option iommu=soft in the GRUB did not change anything. (I have not regenerated the grub.cfg, just launched edited cmdline in the GRUB menu).

I searched more and found thread with the similar issue on ASUS laptop and on IdeaPad. The first one I suddenly skipped (actually, I will reach same thoughts a bit later). The second seems to be working, however it is a bit.. expensive. My expectations did not include laptop disassembling and changing NVMe just after one usage day :)

Another idea from here consist in adding the kernel option acpiphp.disable=1. No matter how sad it is, the solution also brings no positive results.

Meanwhile, I found something about Wi-Fi and NVMe conflict or about turning off TPM. But playing with BIOS settings achieved no results.

My next step was to make more tests and capture a bit more information (than just disk is broken after suspend) about the situation. I have booted Arch Linux ISO from the USB-drive, therefore, the running OS does not depend on the NVMe. Further, I mounted one of the NVMe's paritions (Linux root partition). Then, activated suspend mode and replayed actions, described at the top.

After the returning from the "laptop anabiosis", USB-live OS worked fine, but the disk was read-only-broken. I have checked dmesg and found "the root" of the problem:

nvme 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)

Honorable mention: after "breaking" mounted NVMe in USB-live OS, laptop's BIOS lost the GRUB. This was solved by regenerating the grub config with grub-mkconfig

Well, my monkey-googling query can be specified. I was firmly convinced that this new detailed problem is well-known and surely already resolved. However, in fact I just went deeper into the Linux problems swamp.

The Google results I found can be divided into two categories:

  • "The graveyard" of 2018-* threads with the same or extremely similar problem
  • Email dump with conversations about the kernel changes (or something else... you know, those sites that are just plain text with incomprehensible context and obscure pieces of code on C)

Typical result or last-message in the graveyard-member thread looks like:

  • Oh, I will switch to Windows
  • Just changed the NVMe and now it works
  • I have tried YYY and it did not help. Any more ideas? (*message created 4 years ago*)
  • Yet another kernel parameter that does not work

After digging up the graves, I purely coincidental attempted to read the second-category-result that describes the kernel patch, that disables D3Cold for specified PCI device. For my luck, the patch was not complicated, so, I left the idea to do something like that for later.

My last resort (except the kernel patch) was to change /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed from 1 to 0. As I thought, the attempt was failed (the cause will be described below).

No more resorts, no suggestions. The only way is to patch the kernel.

Related links

Briefly summarized links:

The solution part

  1. Find the PCI Vendor and PCI Class of the NVMe
lspci -vvvvvnn

...
01:00.0 Non-Volatile memory controller [0108]: Silicon Motion, Inc. Device [126f:1001] (rev 03) (prog-if 02 [NVM Express])
...

126f is VendorID, 0108 is ClassID.

  1. Setup the kernel build system. I am using Arch Linux (btw) and the guide on Arch Linux Wiki has exhaustive information about the setup.

Except a little point, 2.1 Avoid creating the doc. The provided patch is not correct for the 6.4.2 kernel, so, I removed make _htmldocs and "$pkgbase-docs" manually. Also, I modified _make function and add -j$(nproc) to make command.

  1. Optional part, that I used just to check build system. Build the kernel (without changes in the sources). Start the makepkg -s and leave the laptop for a while (30 minutes -- 1 hour, approximately)

  2. Insert the define, that disabled D3Cold on the specified device, somewhere near DECLARE_PCI_FIXUP_CLASS_EARLY for deprecated ATA devices.

DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SILICON_POWER, PCI_ANY_ID, 0x0108, 8, quirk_no_ata_d3);
// PCI_VENDOR_ID_SILICON_POWER is my define that equals to 0x126f

Optionally, I added few debug prints.

My full patch diff looks like:

diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/drivers/pci/pci.c src.new/archlinux-linux/drivers/pci/pci.c
--- src/archlinux-linux/drivers/pci/pci.c       2023-07-09 18:07:45.873293132 +0300
+++ src.new/archlinux-linux/drivers/pci/pci.c   2023-07-09 18:06:52.939961065 +0300
@@ -1445,6 +1445,7 @@
         * This device is quirked not to be put into D3, so don't put it in
         * D3
         */
+       pci_info(dev, "dev->dev_flags %llx\n", dev->dev_flags);
        if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
                return 0;
 
diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/drivers/pci/quirks.c src.new/archlinux-linux/drivers/pci/quirks.c
--- src/archlinux-linux/drivers/pci/quirks.c    2023-07-09 18:07:45.873293132 +0300
+++ src.new/archlinux-linux/drivers/pci/quirks.c        2023-07-09 18:06:52.939961065 +0300
@@ -1340,6 +1340,7 @@
 /* Some ATA devices break if put into D3 */
 static void quirk_no_ata_d3(struct pci_dev *pdev)
 {
+       pci_info(pdev, "quirk_no_ata_d3 called\n");
        pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
 }
 /* Quirk the legacy ATA devices only. The AHCI ones are ok */
@@ -1355,6 +1356,10 @@
 DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_VIA, PCI_ANY_ID,
                                PCI_CLASS_STORAGE_IDE, 8, quirk_no_ata_d3);
 
+/* Do not suspend NVMe */
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_SILICON_POWER, PCI_ANY_ID,
+                               0x0108, 8, quirk_no_ata_d3);
+
 /*
  * This was originally an Alpha-specific thing, but it really fits here.
  * The i82375 PCI/EISA bridge appears as non-classified. Fix that.
diff --color --unified --recursive '--exclude=.git' --text src/archlinux-linux/include/linux/pci_ids.h src.new/archlinux-linux/include/linux/pci_ids.h
--- src/archlinux-linux/include/linux/pci_ids.h 2023-07-09 18:07:45.883293132 +0300
+++ src.new/archlinux-linux/include/linux/pci_ids.h     2023-07-09 18:07:05.963294086 +0300
@@ -3120,4 +3120,6 @@
 
 #define PCI_VENDOR_ID_NCUBE            0x10ff
 
+#define PCI_VENDOR_ID_SILICON_POWER            0x126f
+
 #endif /* _LINUX_PCI_IDS_H */
  1. Compile the kernel (if you have completed p.3, the compilation will be done faster) and install it
  2. Regenerate grub.cfg, reboot your laptop and check the dmesg. You should see messages about the quirk.
sudo dmesg | grep quirk
[    0.337952] pci 0000:01:00.0: quirk_no_ata_d3 called
[    1.939159] nvme 0000:01:00.0: platform quirk: setting simple suspend

The first one is my debug message, the second one already exists in Linux.

  1. Check the suspend mode as described at the top.

Meanwhile: why d3cold_allowed is not working?

Function quirk_no_ata_d3 sets pci->dev_flags |= PCI_DEV_FLAGS_NO_D3;. Sysfs d3cold_allowed modifies dev->d3cold_allowed field.

The d3cold_allowed is being used in pci_dev_check_d3cold function, that, in its turn, being used only in bridge update function pci_bridge_d3_update.

However, the PCI_DEV_FLAGS_NO_D3 is being checked in pci_set_power_state function. That function does not have d3cold_allowed checks (or at least, I cannot see it), hence, d3cold_allowed change in sysfs is useless in the context of the described problem.

Conslusion

The solution seems to be the only way to fix the problem. The largest caveat of it is that every kernel update via pacman seem to be a recompilation headache.

I believe this post will help someone to finally fix the annoying issue with NVMe. Being encountered with such problems, I sincerely glad to realize that the percent of Linux-desktop laptops is still not below the zero.


Update 2023.09.11

I've just found here that the problem can be fixed using the kernel parameter nvme_core.default_ps_max_latency_us=0

@Toliak
Copy link
Author

Toliak commented Nov 23, 2023

@munashige
Hm... could you provide your cat /etc/os-release and uname -a?
I believe I have live-USB with Ubuntu and I would like to check it this week. Feel free to mention me again, if I will not provide any additional information after this week.

I have tested the kernel and the arguments on Arch Linux distro

@munashige
Copy link

munashige commented Nov 23, 2023

Sure, here are the results from debain 12.

Fun fact: suspend and even hibernation works well in live usb mode for debian/ubuntu-based distros and fedora. But once you install the system itself, all nasty symptoms will kick in.

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

@Toliak
Copy link
Author

Toliak commented Nov 23, 2023

@munashige
Yes, as mentioned here Live-USB will work correctly. Problems will be with the manually mounted partition.
So, steps are:

  1. Boot live-usb
  2. Mount somewhere nvme partition
  3. Try to read/write the mounted partition -- should be ok
  4. Suspend/Wakeup
  5. Try to read/write the mounted partition again -- should fail

@munashige
Copy link

munashige commented Dec 20, 2023

Alright, I have finally fixed the issue. Here is what helped me and can help users that do not use arch-based distros.
The issue was with linux kernel 5.x that caused suspension issues, battery drains, and even occasional system slowdowns (regardless of the provided solutions above).

  1. Installed Linux Mint Edge edition with kernel 6.2
  2. Edited GRUB_CMDLINE_LINUX_DEFAULT of /etc/default/grub.
  3. Pasted just this nvme_core.default_ps_max_latency_us=0
  4. Updated grub with sudo update-grub and rebooted.

Thus, the experience so far is great, linux now feels even snappier than windows. I must add that linux is quite power hungry on d16, so using auto-cpufreq is a must.
Thanks for this gist and also thanks everyone for the provided help 👍

@strange-dv
Copy link

You can't even imagine how grateful I am to you. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment