Skip to content

Instantly share code, notes, and snippets.

@xmrk-btc
Last active March 30, 2024 08:54
Show Gist options
  • Save xmrk-btc/fee0a090a97888bbda8ae33ccf30ce61 to your computer and use it in GitHub Desktop.
Save xmrk-btc/fee0a090a97888bbda8ae33ccf30ce61 to your computer and use it in GitHub Desktop.
xmrk node possible data loss notes
see 1st comment
@xmrk-btc
Copy link
Author

xmrk-btc commented Feb 26, 2024

How it occured

  1. few days before I did lncli --deletepayments. Also, started running liquid a week before the data loss, storing the blockchain on sda, pruned to 30GB.
  2. lncli stop
  3. installing updates, this took 30-60 minutes
  4. sync
  5. reboot (does not work, systemd got unresponsive)
  6. reboot -f
  7. lnd starts, it takes much more time than usually, lnd logs show rescans of 1000 last blocks when starting any channel mgr, the machine with bitcoind keeps reading from disk, channels stay inactive or are being closed.
  8. stopped lnd after 15 minutes
  9. saw penalty transaction, stopped lnd, did zpool scrub, got no error.
  10. started lnd again, running for 20 minutes, stopped again

Environment

  • lnd 0.17.3 64 bits running on Ryzen laptop w 32 GB RAM - will refer to it as lnd-laptop.
  • channel.db was around 20GB, so most of it should be cached
  • three disks:
    • sda - internal SSD. Some sectors are unreadable. Stores / and /home using btrfs. Both were mounted without discard option - this plus liquid probably killed the disk. The mentioned apt update was on sda.
    • sdb, sdc - external USB connected SSD disks, store lnd_data as ZFS RAID1, using zfs option sync=standard (never changed this)
  • Debian 12
  • zfs kernel module from bookworm-backports. Version zfs-2.1.12-0-g86783d7d9-dist running before the restart, upgraded to zfs-2.2.2-0-g494aaaed8-dist just before the fatal reboot.
  • lnd data encrypted by native ZFS encryption
  • bitcoind on another machine (call it bitcoin-laptop), 8 GB RAM laptop with 2 HDDs. (This machine was not restarted). Not pruned, with txindex. Bitcoind's chainstate directory on ZFS, using 4GB of lnd-laptop's memory as l2arc: lnd-laptop exposes 4GB ramdisk via iSCSI, and bitcoin-laptop connects and uses that iSCSI device as l2arc.
  • lnd-laptop has damaged internal SDD (sda), this probably caused systemd to not respond. Also got strange errors (SIGSEGV) from lnd when running it later for SCB recovery.

Misc

  • compacting channel.db with chantools - no error
  • channel.backup up to date - contained all opened channel, even those opened less than 1000 blocks ago, located on the same ZFS filesystem. So this is not a filesystem-wide rollback.
  • had similar problem in September 2023 - the main problem was that lnd did not start, but I also suffered smaller data loss. I was using the same 2 USB disks as today, with different computer (Raspberry Pi 4 then).
  • my channel with Blockstream Store probably did not suffer data loss - my LocalHtlcIndex (as seen by doing chantools dumpchannels) was the same as what peer reported when my node connected while doing recovery. I assume their node is always online, so it is strange there would be no update for a week.

Tests

  • on the same zfs pool that suffered data loss, just turned off compression, 500 MB testfile
  • tested using diskchecker.pl, see https://brad.livejournal.com/2116715.html
  • diskchecker.pl on lnd-laptop with zfs kernel module ver. 2.2.2, did sync; reboot -f and verify was ok.
  • diskchecker.pl on old RPi 4 with 4GB RAM, zfs v2.1.5-1ubuntu6~22.04.2. Tried checking write cache while test was running (sdparm --get=WCE /dev/sd?), this caused some problem because sdparm froze and writing stopped. Did reboot -f (without even doing sync) shortly after and verify was ok.
  • repeated the same test on RPi but without sdparm, also ok
  • sdparm returns (on rpi, so disks are renamed)
# sdparm --get=WCE /dev/sda
   /dev/sda: Samsung   SSD 870 EVO 500G  0
WCE not found in Caching (SBC) mode page
# sdparm --get=WCE /dev/sdb
   /dev/sdb: 6iY  �b(HJ�C��O%�  0959
mode sense (10): transport: Host_status=0x03 [DID_TIME_OUT]
Driver_status=0x00 [DRIVER_OK]

WCE           1  [cha: y, def:  1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment