Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
dm-crypt + dm-integrity + dm-raid = awesome!
#!/usr/bin/env bash
#
# Author: Markus (MawKKe) ekkwam@gmail.com
# Date: 2018-03-19
#
#
# What?
#
# Linux dm-crypt + dm-integrity + dm-raid (RAID1)
#
# = Secure, redundant array with data integrity protection
#
# Why?
#
# You see, RAID1 is dead simple tool for disk redundancy,
# but it does NOT protect you from bit rot. There is no way
# for RAID1 to distinguish which drive has the correct data if rot occurs.
# This is a silent killer.
#
# But with dm-integrity, you can now have error detection
# at the block level. But it alone does not provide error correction
# and is pretty useless with just one disk (disks fail, shit happens).
#
# But if you use dm-integrity *below* RAID1, now you have disk redundancy,
# AND error checking AND error correction. Invalid data received from
# a drive will cause a checksum error which the RAID array notices and
# replaces with correct data.
#
# If you throw encryption into the mix, you'll have secure,
# redundant array. Oh, and the data integrity can be protected with
# authenticated encryption, so no-one can tamper your data maliciously.
#
# How cool is that?
#
# Also: If you use RAID1 arrays as LVM physical volumes, the overall
# architecture is quite similar to ZFS! All with native Linux tools,
# and no hacky solaris compatibility layers or licencing issues!
#
# (I guess you can use whatever RAID level you want, but RAID1 is the
# simplest and fastest to set up)
#
#
# Let's try it out!
#
# ---
# NOTE: The dm-integrity target is available since Linux kernel version 4.12.
# NOTE: This example requires LUKS2 which is only recently released (2018-03)
# NOTE: The authenticated encryption is still experimental (2018-03)
# ---
set -eux
# 1) Make dummy disks
cd /tmp
truncate -s 500M disk1.img
truncate -s 500M disk2.img
# Format the disk with luksFormat:
dd if=/dev/urandom of=key.bin bs=512 count=1
cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 disk1.img key.bin
cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 disk2.img key.bin
# The luksFormat's might take a while since the --integrity causes the disks to be wiped.
# dm-integrity is usually configured with 'integritysetup' (see below), but as
# it happens, cryptsetup can do all the integrity configuration automatically if
# the --integrity flag is specified.
# Open/attach the encrypted disks
cryptsetup luksOpen disk1.img disk1luks --key-file key.bin
cryptsetup luksOpen disk2.img disk2luks --key-file key.bin
# Create raid1:
mdadm \
--create \
--verbose --level 1 \
--metadata=1.2 \
--raid-devices=2 \
/dev/md/mdtest \
/dev/mapper/disk1luks \
/dev/mapper/disk2luks
# Create a filesystem, add to LVM volume group, etc...
mkfs.ext4 /dev/md/mdtest
# Cool! Now you can 'scrub' the raid setup, which verifies
# the contents of each drive. Ordinarily detecting an error would
# be problematic, but since we are now using dm-integrity, the raid1
# *knows* which one has the correct data, and is able to fix it automatically.
#
# To scrub the array:
#
# $ echo check > /sys/block/md127/md/sync_action
#
# ... wait a while
#
# $ dmesg | tail -n 30
#
# You should see
#
# [957578.661711] md: data-check of RAID array md127
# [957586.932826] md: md127: data-check done.
#
#
# Let's simulate disk corruption:
#
# $ dd if=/dev/urandom of=disk2.img seek=30000 count=30 bs=1k conv=notrunc
#
# (this writes 30kB of random data into disk2.img)
#
#
# Run scrub again:
#
# $ echo check > /sys/block/md127/md/sync_action
#
# ... wait a while
#
# $ dmesg | tail -n 30
#
# Now you should see
# ...
# [959146.618086] md: data-check of RAID array md127
# [959146.962543] device-mapper: crypt: INTEGRITY AEAD ERROR, sector 39784
# [959146.963086] device-mapper: crypt: INTEGRITY AEAD ERROR, sector 39840
# [959154.932650] md: md127: data-check done.
#
# But now if you run scrub yet again:
# ...
# [959212.329473] md: data-check of RAID array md127
# [959220.566150] md: md127: data-check done.
#
# And since we didn't get any errors a second time, we can deduce that the invalid
# data was repaired automatically.
#
# Great! We are done.
#
# --------
#
# If you don't need encryption, then you can use 'integritysetup' instead of cryptsetup.
# It works in similar fashion:
#
# $ integritysetup format --integrity sha256 disk1.img
# $ integritysetup format --integrity sha256 disk2.img
# $ integritysetup open --integrity sha256 disk1.img disk1int
# $ integritysetup open --integrity sha256 disk2.img disk2int
# $ mdadm --create ...
#
# ...and so on. Though now you can detect and repair disk errors but have no protection
# against malicious cold-storage attacks. Data is also readable by anybody.
#
# 2018-03 NOTE:
#
# if you override the default --integrity value (whatever it is) during formatting,
# then you must specify it again when opening, like in the example above. For some
# reason the algorithm is not autodetected. I guess there is no header written onto
# disk like is with LUKS ?
#
# ----------
#
# Read more:
# https://fosdem.org/2018/schedule/event/cryptsetup/
# https://gitlab.com/cryptsetup/cryptsetup/wikis/DMCrypt
# https://gitlab.com/cryptsetup/cryptsetup/wikis/DMIntegrity
# https://mirrors.edge.kernel.org/pub/linux/utils/cryptsetup/v2.0/v2.0.0-rc0-ReleaseNotes
@rbn920

This comment has been minimized.

Copy link

@rbn920 rbn920 commented May 20, 2019

Thank you, just the example I was looking for. Out of curiosity, do you have any idea if there would be issues combining this approach with bchache? ie, creating a bcache (in writeback mode) device using the raid you set up as a backing device?

@railroadeda3

This comment has been minimized.

Copy link

@railroadeda3 railroadeda3 commented Mar 6, 2020

Thank you. That's an interesting setup. The title says dm-raid, though the command lines use md-raid. Are they combined now or is this actually md-raid? LVM also has mirroring, would be interesting if that combines with integrity the same way. You could have per-lv mirroring.

@mkszuba

This comment has been minimized.

Copy link

@mkszuba mkszuba commented Mar 12, 2020

This should work with both dm-raid and LVM mirroring, as they both use md under the bonnet (or to be precise, LVM mirroring uses dm-raid and that in turn uses md).

@mkszuba

This comment has been minimized.

Copy link

@mkszuba mkszuba commented Mar 16, 2020

I have confirmed that this works with LVM. If in the above you replace the call to mdadm with all the steps necessary to create a mirrored logical volume on top of the two LUKS devices and instead of writing to a sysfs file you call something along the lines of lvchange --syncaction check /dev/testvg/testlv, it results in the same integrity-error messages appearing in the kernel log the first - and only the first - time said command is called.

@khimaros

This comment has been minimized.

Copy link

@khimaros khimaros commented Aug 28, 2020

I've been experimenting with something similar.

The primary difference is that I run dm-integrity as a separate layer under the md-raid array to avoid encrypting once per disk.

This is only relevant for larger disk arrays and I'm not sure if this is a overall performance improvement. Need benchmarks.

The full stack is: physical disk > dm-integrity > md-raid > dm-crypt > lvm > ext4

This ticks most of the boxes for me:

[x] mainline kernel support
[x] snapshotting
[x] block checksumming
[x] reliable parity striping (eg. with raid6)
[x] full disk encryption
[?] remote root fs unlock with dropbear-initramfs

I haven't verified initramfs behavior with this setup for use on a root partition.

I've used md directly instead of lvm-raid because under the hood lvm-raid uses md anyway and I'm more familiar with the tooling.

@khimaros

This comment has been minimized.

Copy link

@khimaros khimaros commented Sep 1, 2020

tl;dr, on a system with four 10GiB drives in a RAID-5 configuration using the setup described above, 200 bytes of randomly distributed corruption across two drives (in non-overlapping stripes) could result in unrecoverable failure of the entire array.

I've been battle testing this setup for the past few days. It is much easier to repair bad checksums on a dm-integrity device before stopping the md array. If you stop the array and re-assemble, any dm-integrity device with checksum errors will refuse to reattach.

For more detail and instructions to reproduce these failure modes, take a look at https://github.com/khimaros/raid-explorations#dm-integrity--md-considered-harmful

@tomato42

This comment has been minimized.

Copy link

@tomato42 tomato42 commented Sep 27, 2020

A counterargument that dm-integrity with MD-RAID does work: https://securitypitfalls.wordpress.com/2018/05/08/raid-doesnt-work/

@khimaros

This comment has been minimized.

Copy link

@khimaros khimaros commented Sep 27, 2020

@tomato42 -- I read that article and came to the opposite conclusion:

"While the functionality necessary to provide detection and correction of silent data corruption is available in the Linux kernel, the implementation likely will need few tweaks to not excerbate situations where the hardware is physically failing, not just returning garbage. Passing additional metadata about the I/O errors from the dm-integrity layer to the md layer could be a potential solution."

In other words, it's close, but as currently implemented it can often make problems worse.

@tomato42

This comment has been minimized.

Copy link

@tomato42 tomato42 commented Sep 27, 2020

@khimaros I wrote that article 2 years ago. Since then that bug was fixed: https://www.spinics.net/lists/raid/msg63187.html

@khimaros

This comment has been minimized.

Copy link

@khimaros khimaros commented Sep 27, 2020

@tomato42 -- aha! Thank you for that article. I referenced it during my own explorations.

Do you happen to know what combination of md tools and kernel version contains the required patch? Have you had a chance to revisit your own tests now that it has landed?

I've recently been running through some of the other tests with Debian Bullseye so it may be time to do a full refresh and annotate the tests w/ kernel+tools versions for each.

@tomato42

This comment has been minimized.

Copy link

@tomato42 tomato42 commented Sep 27, 2020

Unfortunately no, I don't know which exact versions have the fixes. I only know that they are in current version of RHEL-8. I've also verified that the behaviour is as expected: even gigabytes of read errors caused by checksum failures don't cause the dm-integrity volumes to be kicked from the array.

@Salamandar

This comment has been minimized.

Copy link

@Salamandar Salamandar commented Oct 3, 2020

@tomato42 @khimaros So the setup you're presenting here is RAID over LUKS+integrity.
If i understand properly, that's done that way to be able to use RAID to give the correct data and detect disks failures.

Is it possible to do LUKS over RAID over dm-Integrity ?
I'd prefer having a single encrypted partition to having multiple ones. Unless you tell me there'are good reasons for having multiple LUKS below RAID.

@tomato42

This comment has been minimized.

Copy link

@tomato42 tomato42 commented Oct 3, 2020

It's not us that present the RAID over LUKS+integrity setup :)
In my article I'm describing RAID over dm-integrity. Yes, I do it to detect and correct disk failures (both vocal—when the disk returns read errors—and silent—when disk just returns garbage instead of data previously written)

Yes, it's possible to do LUKS over RAID over dm-integrity. If you want both encryption and protection against disk failures, I'd suggest to do it like this. Using LUKS below RAID has the unfortunate effect that you then have to encrypt data multiple times, so you will get worse performance than with LUKS above RAID.

One reason to do RAID over LUKS with integrity is that it's much easier to setup (the only difference is the use of special options for formatting the LUKS volume, opening and using is as with regular LUKS, so you can use most of the guides explaining the setup and migration). As dm-integrity is much newer, setting it up is much more manual and thus complicated. I've recently wrote an article on how to do it in Fedora 31, RHEL 8, CentOS 8 and Archlinux: https://securitypitfalls.wordpress.com/2020/09/27/making-raid-work-dm-integrity-with-md-raid/

@Salamandar

This comment has been minimized.

Copy link

@Salamandar Salamandar commented Oct 4, 2020

@tomato42 Yes, I didn't even think about the fact that raid over luks needs multiple data encryptions. That's one more argument for LUKS over RAID instead.

One reason to do RAID over LUKS with integrity is that it's much easier to setup

Well… In my own head LUKS over RAID is easier to understand because RAID is at "hardware level" and LUKS is at "OS level". But dm-integrity is also at OS level so… :/

I've recently wrote an article on how to do it in Fedora 31, RHEL 8, CentOS 8 and Archlinux: https://securitypitfalls.wordpress.com/2020/09/27/making-raid-work-dm-integrity-with-md-raid/

Thanks a lot. I'm keeping it and if you want comments about the Debian implementation I can make you a feedback.

@tomato42

This comment has been minimized.

Copy link

@tomato42 tomato42 commented Oct 4, 2020

@Salamandar

Well… In my own head LUKS over RAID is easier to understand because RAID is at "hardware level" and LUKS is at "OS level". But dm-integrity is also at OS level so… :/

well, if we're talking about Linux, there's no limit to shenanigans with block devices :)

you can have LVM, on top of dm-crypt on top of md-raid, on top of dm-integrity on top of loop devices that use regular files, on an LVM...

and this is only about directly attached storage, with network based devices it can get really crazy

Thanks a lot. I'm keeping it and if you want comments about the Debian implementation I can make you a feedback.

sure, feel free to add a comment about Debian specific changes to the setup steps

@Salamandar

This comment has been minimized.

Copy link

@Salamandar Salamandar commented Oct 5, 2020

Yeah, my daily job is about networked attached storage hardware so… setups can be funny somedays. Well, I just received my NAS today so I'll start playing with your dm-integrity tutorial.

@khimaros

This comment has been minimized.

Copy link

@khimaros khimaros commented Oct 8, 2020

Unfortunately no, I don't know which exact versions have the fixes. I only know that they are in current version of RHEL-8. I've also verified that the behaviour is as expected: even gigabytes of read errors caused by checksum failures don't cause the dm-integrity volumes to be kicked from the array.

According to the Git tags on torvalds/linux@b76b471 the fix you referenced should be included in any kernel released after 5.4-rc1.

I re-ran my tests on linux 5.8.10 and mdadm 4.1. Debian Bullseye. The result there was very positive and the raid6 array survived even 1MB+ random corruption to 2/4 disks. manual scrub identified and corrected the corruption.

My take-away from this is that dm-integrity + md is DANGEROUS for unpatched kernels <5.4-rc1, but seems to be quite reliable for kernels including torvalds/linux@b76b471

@khimaros

This comment has been minimized.

Copy link

@khimaros khimaros commented Oct 8, 2020

@tomato42 -- following up on this after running some tests where I intentionally corrupted beyond raid6 parameters (100K+ randomized corruption to 4/4 disks in the array). in cases where md doesn't have enough parity information to recalculate the correct value, checksum failures continue in perpetuity even following md --action=repair.

one solution I've found is to run fsck.ext4 -c -y -f to add these to the bad block list at the filesystem layer, but I'm curious if you're aware of any other solutions either at the md or dm level such as recalculating the integrity journal? are you aware of any way to identify corrupted files based on these kernel messages?

@tomato42

This comment has been minimized.

Copy link

@tomato42 tomato42 commented Oct 8, 2020

if there is not enough redundancy left in the array to recover a sector, then there's only one way to fix it: write to that sector some valid data

so the way I'd do it, is use something like dd if=/dev/md0 of=/dev/null bs=4096 to find the first failing block, and then use dd if=/dev/zero of=/dev/md0 bs=4096 seek=<number of valid blocks>

mapping a file to the bad block is rather hard, and it depends on used file system; but then again you can do a tar cf /dev/null /file/system/on/md and tar will complain about the read errors...

(yeah, rather brute force approach, but will definitely work)

@flower1024

This comment has been minimized.

Copy link

@flower1024 flower1024 commented Apr 7, 2021

If you are looking for performance it is better to keep the checksumming data on another device.
you can't do that with cryptsetup but with integritysetup.

i have them on an ssd.

4x integritysetup -> mdraid -> cryptsetup -> ext4

@ggeorgovassilis

This comment has been minimized.

Copy link

@ggeorgovassilis ggeorgovassilis commented Aug 26, 2021

Thanks for the writeup! There seems to be a significant performance issue when involving a combination of RAID 6 and dm-crypt.

I tried out the following setting: baseline is a RAID6 with 4 rotational HDDs + dm-raid + dm-crypt which I converted one disk at a time to dm-integrity + dm-raid + dm-crypt. The first disk resynced in about 12 hours (a full disk resync usually takes 10 hours), the second disk resynced at about 10 mb/s so I stopped the process after a few hours and reverted back to my original setup. CPU load was at no point particularly high.

I wrote about the issue here: https://blog.georgovassilis.com/2021/05/02/dm-raid-and-dm-integrity-performance-issue/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment