xmrk-btc/gist:fee0a090a97888bbda8ae33ccf30ce61

Last active March 30, 2024 08:54

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/xmrk-btc/fee0a090a97888bbda8ae33ccf30ce61.js"></script>
Save xmrk-btc/fee0a090a97888bbda8ae33ccf30ce61 to your computer and use it in GitHub Desktop.

Download ZIP

xmrk node possible data loss notes

Raw

gistfile1.txt

see 1st comment

Author

xmrk-btc commented Feb 26, 2024 •

edited

Loading

How it occured

few days before I did lncli --deletepayments. Also, started running liquid a week before the data loss, storing the blockchain on sda, pruned to 30GB.
lncli stop
installing updates, this took 30-60 minutes
sync
reboot (does not work, systemd got unresponsive)
reboot -f
lnd starts, it takes much more time than usually, lnd logs show rescans of 1000 last blocks when starting any channel mgr, the machine with bitcoind keeps reading from disk, channels stay inactive or are being closed.
stopped lnd after 15 minutes
saw penalty transaction, stopped lnd, did zpool scrub, got no error.
started lnd again, running for 20 minutes, stopped again

Environment

lnd 0.17.3 64 bits running on Ryzen laptop w 32 GB RAM - will refer to it as lnd-laptop.
channel.db was around 20GB, so most of it should be cached
three disks:
- sda - internal SSD. Some sectors are unreadable. Stores / and /home using btrfs. Both were mounted without discard option - this plus liquid probably killed the disk. The mentioned apt update was on sda.
- sdb, sdc - external USB connected SSD disks, store lnd_data as ZFS RAID1, using zfs option sync=standard (never changed this)
Debian 12
zfs kernel module from bookworm-backports. Version zfs-2.1.12-0-g86783d7d9-dist running before the restart, upgraded to zfs-2.2.2-0-g494aaaed8-dist just before the fatal reboot.
lnd data encrypted by native ZFS encryption
bitcoind on another machine (call it bitcoin-laptop), 8 GB RAM laptop with 2 HDDs. (This machine was not restarted). Not pruned, with txindex. Bitcoind's chainstate directory on ZFS, using 4GB of lnd-laptop's memory as l2arc: lnd-laptop exposes 4GB ramdisk via iSCSI, and bitcoin-laptop connects and uses that iSCSI device as l2arc.
lnd-laptop has damaged internal SDD (sda), this probably caused systemd to not respond. Also got strange errors (SIGSEGV) from lnd when running it later for SCB recovery.

Misc

compacting channel.db with chantools - no error
channel.backup up to date - contained all opened channel, even those opened less than 1000 blocks ago, located on the same ZFS filesystem. So this is not a filesystem-wide rollback.
had similar problem in September 2023 - the main problem was that lnd did not start, but I also suffered smaller data loss. I was using the same 2 USB disks as today, with different computer (Raspberry Pi 4 then).
my channel with Blockstream Store probably did not suffer data loss - my LocalHtlcIndex (as seen by doing chantools dumpchannels) was the same as what peer reported when my node connected while doing recovery. I assume their node is always online, so it is strange there would be no update for a week.

Tests

on the same zfs pool that suffered data loss, just turned off compression, 500 MB testfile
tested using diskchecker.pl, see https://brad.livejournal.com/2116715.html
diskchecker.pl on lnd-laptop with zfs kernel module ver. 2.2.2, did sync; reboot -f and verify was ok.
diskchecker.pl on old RPi 4 with 4GB RAM, zfs v2.1.5-1ubuntu6~22.04.2. Tried checking write cache while test was running (sdparm --get=WCE /dev/sd?), this caused some problem because sdparm froze and writing stopped. Did reboot -f (without even doing sync) shortly after and verify was ok.
repeated the same test on RPi but without sdparm, also ok
sdparm returns (on rpi, so disks are renamed)

# sdparm --get=WCE /dev/sda
   /dev/sda: Samsung   SSD 870 EVO 500G  0
WCE not found in Caching (SBC) mode page
# sdparm --get=WCE /dev/sdb
   /dev/sdb: 6iY  �b(HJ�C��O%�  0959
mode sense (10): transport: Host_status=0x03 [DID_TIME_OUT]
Driver_status=0x00 [DRIVER_OK]

WCE           1  [cha: y, def:  1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment