Skip to content

Instantly share code, notes, and snippets.

@ccy
Last active June 15, 2023 04:36
Show Gist options
  • Save ccy/2b24165a416aaec76cf3cf5ad7b9f9b2 to your computer and use it in GitHub Desktop.
Save ccy/2b24165a416aaec76cf3cf5ad7b9f9b2 to your computer and use it in GitHub Desktop.
Linux - Installation

Storage Layout

Production 1

Storage

Emergency

Redundant Boot

vmem LVM

ZFS

EFI

/boot

LVM

LV

LV

Storage

Cache

Log

Unuse

1

SSD #1

500M

500M

/

SSD #2

500M

500M

/

NVMe

swap

app

SSD / HDD

draid3:1s

NVMe

[check]

NVMe

10G PLP

Production 2 - Compact

Storage

Emergency

Redundant Boot

vmem LVM

ZFS

EFI

/boot

LVM

LV

LV

Storage

Cache

Log

Unuse

1

NVMe #1

500M

500M

/

swap

app

[check]

NVMe #2

500M

500M

/

10G PLP

SSD / HDD

draid3:1s

Example

This is an example of machine configured with:

  1. EFI Partition in mdadm RAID1

  2. Boot Partition in mdadm RAID1

  3. Root partition in Thinly Provisioned LVM volume

Tip
Attempt to use UUID in configuration files whenever possible
blkid
blkid
/dev/md126: UUID="B895-E0BC" TYPE="vfat"
/dev/nvme0n1p1: UUID="02190f3f-386e-3c26-fe11-da342ee44207" UUID_SUB="e6abd74f-e861-4cca-de2b-b0371a7fe964" LABEL="fb3-a2:boot_efi" TYPE="linux_raid_member" PARTLABEL="Linux filesystem" PARTUUID="ba833d9e-2e05-45b3-bb6f-65a19b838594"
/dev/nvme0n1p2: UUID="0e424203-d116-836a-4311-45b0d1eac1b5" UUID_SUB="b4ed317f-38e6-c1ec-4873-a48833dba0fc" LABEL="fb3-a2:boot" TYPE="linux_raid_member" PARTLABEL="Linux filesystem" PARTUUID="3bc08e6b-641c-451b-8a2a-007e6e05e7e4"
/dev/md127: UUID="7584555f-a08a-411f-a663-043d18041d07" TYPE="xfs"
/dev/nvme1n1p2: UUID="0e424203-d116-836a-4311-45b0d1eac1b5" UUID_SUB="b256fed4-7a00-cc77-570e-233e44ba1369" LABEL="fb3-a2:boot" TYPE="linux_raid_member" PARTLABEL="Linux filesystem" PARTUUID="40ae1e9d-c486-45cc-8a75-e24849e242aa"
/dev/nvme1n1p1: UUID="02190f3f-386e-3c26-fe11-da342ee44207" UUID_SUB="00e1724a-f755-d3f1-4f3c-4d3a4bf016ff" LABEL="fb3-a2:boot_efi" TYPE="linux_raid_member" PARTLABEL="Linux filesystem" PARTUUID="aeb6561b-9571-4c19-b21b-9afac2c88e53"
/dev/mapper/system-root: UUID="f0ae0fdc-bec3-477c-8a79-dbdac5f0358d" TYPE="xfs"
/etc/fstab
UUID=f0ae0fdc-bec3-477c-8a79-dbdac5f0358d /                       xfs     defaults        0 0
UUID=7584555f-a08a-411f-a663-043d18041d07 /boot                   xfs     defaults        0 0
UUID=B895-E0BC          /boot/efi               vfat    umask=0077,shortname=winnt 0 2
/etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M rd.lvm.lv=system/root rd.md.uuid=0e424203:d116836a:431145b0:d1eac1b5"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true
GRUB_DEVICE_UUID=f0ae0fdc-bec3-477c-8a79-dbdac5f0358d
Note
Kernel parameters rd.lvm.lv cannot support UUID.
Note
Add GRUB_DEVICE_UUID as root file system UUID
/boot/grub2/grub.cfg
search --no-floppy --fs-uuid --set=root --hint='mduuid/0e424203d116836a431145b0d1eac1b5'  7584555f-a08a-411f-a663-043d18041d07
...
search --no-floppy --fs-uuid --set=root 7584555f-a08a-411f-a663-043d18041d07
...
set kernelopts="root=/dev/mapper/system-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M rd.lvm.lv=system/root rd.md.uuid=0e424203:d116836a:431145b0:d1eac1b5 "
...
Warning
bug #64291: grub2-probe fail to get fs_uuid of LVM thin volume
efibootmgr
efibootmgr -v
BootCurrent: 0001
Timeout: 1 seconds
BootOrder: 0001,0000,0020,0021,0005,0014,0015,0016,0017
Boot0000* Rocky Linux   HD(1,GPT,ba833d9e-2e05-45b3-bb6f-65a19b838594,0x100,0x1f500)/File(\EFI\ROCKY\SHIMX64.EFI)
Boot0001* Rocky Linux   HD(1,GPT,
aeb6561b-9571-4c19-b21b-9afac2c88e53,0x100,0x1f500)/File(\EFI\ROCKY\SHIMX64.EFI)
Boot0005* UEFI: Built-in EFI Shell      VenMedia(5023b95c-db26-429b-a648-bd47664c8012)..BO
Boot0014  UEFI: PXE IP4 P1 Intel(R) I210 Gigabit  Network Connection    PciRoot(0x0)/Pci(0x1,0x2)/Pci(0x0,0x0)/Pci(0x4,0x0)/Pci(0x0,0x0)/MAC(d05099de9976,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0015  UEFI: PXE IP6 P1 Intel(R) I210 Gigabit  Network Connection    PciRoot(0x0)/Pci(0x1,0x2)/Pci(0x0,0x0)/Pci(0x4,0x0)/Pci(0x0,0x0)/MAC(d05099de9976,0)/IPv6([::]:<->[::]:,0,0)..BO
Boot0016  UEFI: PXE IP4 P0 Intel(R) I210 Gigabit  Network Connection    PciRoot(0x0)/Pci(0x1,0x2)/Pci(0x0,0x0)/Pci(0x5,0x0)/Pci(0x0,0x0)/MAC(d05099de9975,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0017  UEFI: PXE IP6 P0 Intel(R) I210 Gigabit  Network Connection    PciRoot(0x0)/Pci(0x1,0x2)/Pci(0x0,0x0)/Pci(0x5,0x0)/Pci(0x0,0x0)/MAC(d05099de9975,0)/IPv6([::]:<->[::]:,0,0)..BO
Boot0020* UEFI OS       HD(1,GPT,ba833d9e-2e05-45b3-bb6f-65a19b838594,0x100,0x1f500)/File(\EFI\BOOT\BOOTX64.EFI)..BO
Boot0021* UEFI OS       HD(1,GPT,aeb6561b-9571-4c19-b21b-9afac2c88e53,0x100,0x1f500)/File(\EFI\BOOT\BOOTX64.EFI)..BO

Redundant Installation

Note
For UEFI system and GPT based partition using grub2 bootloader.

Prepare storage devices for redundant installation

Tip
Always use 4096 block size for filesystem
Tip
Format NVMe physical block size to 4096 whenever possible before using.

Storage devices has 2 common physical sector size: 512 (512n) and advanced format 4096 (4kn) bytes. There even has storage devices supports 512 logical sector size in 4096 physical sector using 512e emulation mode. The 512e storage device report in smartctl as:

Sector Sizes:     512 bytes logical, 4096 bytes physical
Table 1. Common storage devices
Storage 512n 4kn 512e

SATA HDD

obsolete

[check]

[check]

SATA SSD

[check]

[question]

[check]

NVMe M.2

Format LBA

[close]

Warning
Avoid mixing storage devices of different physical sector size in a redundant array. Failure or data corruption may happen.

For example, add mix physical volumes to a LVM group:

# query physical/logical block size and file system block size
blockdev -v --getss --getpbsz --getbsz /dev/nvme0n1
get logical block (sector) size: 512
get physical block (sector) size: 512
get blocksize: 4096

# query physical/logical block size and file system block size
blockdev -v --getss --getpbsz --getbsz /dev/nvme1n1
get logical block (sector) size: 4096
get physical block (sector) size: 4096
get blocksize: 4096

vgcreate vgroup0 /dev/nvme0n1p1 /dev/nvme1n1p1
  Devices have inconsistent logical block sizes (512 and 4096).
  See lvm.conf allow_mixed_block_sizes.

Format NVMe to supported LBA size

Unlike other storage devices, most newer NVMe storage allow user to choose different physical sector size with nvme-cli:

# Install nvme-cli utility
dnf install -y nvme-cli

# Set device name
DEV=/dev/nvme0n1

# Query supported LBA
nvme id-ns -H $DEV
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better

# Format to 4096
nvme format --lbaf=1 $DEV

# Re-query
nvme id-ns -H $DEV
LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
LBA Format  1 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0x1 Better (in use)

# Query Intel Optane P4801x
DEV=/dev/nvme1n1
nvme format --lbaf=1 $DEV

LBA Format  0 : Metadata Size: 0   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good (in use)
LBA Format  1 : Metadata Size: 8   bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
LBA Format  2 : Metadata Size: 16  bytes - Data Size: 512 bytes - Relative Performance: 0x2 Good
LBA Format  3 : Metadata Size: 0   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format  4 : Metadata Size: 8   bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format  5 : Metadata Size: 64  bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
LBA Format  6 : Metadata Size: 128 bytes - Data Size: 4096 bytes - Relative Performance: 0 Best
Warning
Intel NVMe storage device fail to work with nvme-cli
# Download Intel MAS CLI tools
curl -LO https://downloadmirror.intel.com/763590/Intel_MAS_CLI_Tool_Linux_2.2.zip

# Unzip
unzip Intel_MAS_CLI_Tool_Linux_2.2.zip

# Install
dnf install intelmas-2.2.18-0.i386.rpm

# Show available NVMe devices
intelmas show -all -intelssd

# Format Intel NVMe to Format 3
# It takes some time to finish
intelmas start -intelssd 0 -nvmeformat LBAFormat=3
WARNING! You have selected to format the drive!
Proceed with the format? (Y|N): y
Formatting...(This can take several minutes to complete)

- Intel Optane(TM) SSD DC P4801X Series PHKM926000T6100D -

Status : NVMeFormat successful.

Linux System booting sequence

Stage File System Mount Point Files

1. Power On

machine

2. UEFI firmware

machine, POST

3. Grub2 boot loader

FAT32

/boot/efi

grubx64.efi

4. Linux Kernel

[mdadm, lvm] + [ext4, xfs]

/

/boot/vmlinuz.img
/boot/initrd.img

5. init, mount root file system

[mdadm, lvm] + [ext4, xfs]

/

Disk layout in single disk installation

A grub2 bootable Linux installation consists of minimum 2 partitions:

  • EFI Partition

  • Root File System

sgdisk -p /dev/

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         1050623   512.0 MiB   EF00
   2         1050624       196362239   93.1 GiB    8300

The underlying file systems of each partition:

lsblk -o NAME,FSTYPE,FSVER,MOUNTPOINT /dev/sda

NAME   FSTYPE FSVER MOUNTPOINT
sda
├─sda1 vfat   FAT32 /boot/efi
├─sda2 ext4   1.0   /

The /boot/efi partition reported as FAT32:

file -s /dev/sda1

/dev/sda1: DOS/MBR boot sector, code offset 0x58+2, OEM-ID "mkfs.fat", sectors/cluster 8, Media descriptor 0xf8, sectors/track 32, heads 64, hidden sectors 2048, sectors 1048576 (volumes > 32 MB), FAT (32 bit), sectors/FAT 1024, reserved 0x1, serial number 0xc49d266f, unlabeled

And the directory hierarchy for /boot:

tree /boot -L 1 --dirsfirst

/boot
├── efi
├── grub
├── config-5.10.0-21-amd64
├── initrd.img-5.10.0-21-amd64
├── System.map-5.10.0-21-amd64
└── vmlinuz-5.10.0-21-amd64

In above example, /dev/sda1 is

  • EFI partition formatted is FAT32

  • mount to /boot/efi

This allow machine boot to UEFI mode look for EFI binaries in the EFI partition (Type code: EF00) followed by vmlinuz and initrd in /boot.

Note
/dev/sda2 mount to /boot (ext4)
/dev/sda1 mount to /boot/efi (fat32)
Table 2. Common partition code
Partition Code Name

8200

Linux swap

8300

Linux filesystem

8E00

Linux LVM

BF01

Solaris /usr & Mac ZFS

BF07

Solaris Reserved 1EF00 EFI system partition

FD00

Linux RAID

Redundant disk solution: mdadm and LVM

A redundant installation requires minimum 2 storage devices to survive from unexpected disk failure.

There are two popular redundant disk solutions: mdadm and LVM

LVM is more flexible to manage than mdadm. Besides, LVM also support RAID type logical volume.

Important
EFI partition must be FAT32 file system.
Partition mount point mdadm LVM Remark

EFI

/boot/efi

[check]

[remove]

UEFI can’t access LVM volume group.
Use --metadata 1.0 for mdadm

boot

/boot

[check]

[question]

Can Grub2 access LVM volume?

root

/

[check]

[check] [check]

It seems impossible to use mdadm to host the EFI partition as mdadm partition type code is FD00 (Linux RAID).

According to 1 2, mdadm support creating partition with --metadata 1.0 to put the RAID metadata at the end and expose the native file system signature to UEFI.

Grub2 load linux kernel and init ramdisk image in /boot at later stage. It is unknown if grub2 can access LVM volume to load files from /boot directory.

Root file system in the partition is mounted by init process during vmlinuz kernel booting. The kernel binaries can built with mdadm, LVM or other modules easily.

Prepare redundant disk layout for Linux

This is a basic disk layout that can support the redundant usage:

Partition RAID File System Mount Point Remarks

1

mdadm

FAT32

/boot/efi

metadata=1.0

2

mdadm

ext4 / xfs

/boot

Some distro’s kickstart don’t allow /boot in LVM

3

LVM

ext4 / xfs

/

Thinly Provisioned

Boot disk partition layout

sgdisk -p /dev/sda

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         1232895   601.0 MiB   FD00
   2         1232896         3332095   1.0 GiB     FD00
   3         3332096       234440703   110.2 GiB   8E00

Linux RAID EFI partition expose as FAT32 partition

file -s /dev/sda1

/dev/sdo2: DOS/MBR boot sector, code offset 0x58+2, OEM-ID "mkfs.fat", sectors/cluster 8, Media descriptor 0xf8, sectors/track 4, sectors 1230720 (volumes > 32 MB), FAT (32 bit), sectors/FAT 1200, reserved 0x1, serial number 0x68eba4a1, unlabeled

mdadm raid layout

cat /proc/mdstat

Personalities : [raid1] [raid6] [raid5] [raid4]
md126 : active raid1 sda2[2] sdb2[1]
      615360 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : active raid1 sda1[2] sdb1[1]
      1047552 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: 

mdadm  --detail --scan

ARRAY /dev/md/boot metadata=1.2 name=localhost.localdomain:boot UUID=f0a6acbc:37355709:0f42d29d:9f2f7b8e
ARRAY /dev/md/boot_efi metadata=1.0 name=localhost.localdomain:boot_efi UUID=67f3946b:2a724f50:9f582167:e7cdf1f2

root file system in LVM

pvs
  PV           VG     Fmt  Attr PSize    PFree
  /dev/nvme1n1 swap   lvm2 a--   931.51g <693.09g
  /dev/sda3    system lvm2 a--  <110.20g <107.87g
  /dev/sdb3    system lvm2 a--  <110.20g <107.87g

vgs
  VG     #PV #LV #SN Attr   VSize    VFree
  system   1   1   0 wz--n- <220.40g  215.73g

lvs
  LV           VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  firebird_tmp swap   -wi-ao---- 119.21g
  root         system rwi-aor---  <2.33g                                    100.00

Rebuild initrd/initramfs for LVM based root file system

Making changes to LVM based root file system like convert logical volume from linear to mirror may cuase system failing to mount root file system during boot up.

It is now the time to rebuild initramfs:

After rename root file system logical volume:

# Optional: Update rd.lvm.lv in current grub config file to reflect new volume group name:
vi /etc/default/grub

Rebuild initramfs:

# Update grub.cfg to reflect new changes
sudo grub2-mkconfig -o "$(readlink -e /etc/grub2.cfg)"

# Make initramfs: Set version variable using current version string
VER=$(uname -r)

# Optional: Make initramfs: make a backup
sudo cp /boot/initramfs-$VER.img /boot/initramfs-$VER.img.backup

# Make initramfs: redhat/centos/fedora/rocklinux
sudo dracut -f /boot/initramfs-$VER.img $VER

Repair redundant disk

Useful scripts and commands to repair mdadm device:

# Define a new disk
NEW=/dev/sde

# zap the new disk
sgdisk -Z $NEW

# Optional: Clone partition from existing raid device
# or use gdisk define new partition manually
sgdisk /dev/sdb -R $NEW

# Wipe all partition's file system signature
wipefs -a ${NEW}p*

# Randomize partition GUIDs to avoid conflict with existing raid device in last replicate operation
sgdisk -G $NEW

# Refresh partition tables
partprobe

# Optional: empty mdadm metadata of new partition
mdadm --zero-superblock $NEW[1-2]

# Optional: scan for available mdadm device and mount
sudo mdadm --assemble --scan
mount /boot /boot/efi

# Add new device to current raid device, modify the md #ID before execute
mdadm /dev/md127 --add ${NEW}1
mdadm /dev/md126 --add ${NEW}2

Useful scripts and commands to repair LVM volume:

# Define new LVM parttition
NEW=/dev/sde3

# Initialize physical volume
pvcreate ${NEW}

# Add new physical volume to system volume group
vgextend system ${NEW}

# Remove missing PVs in system volume group
vgreduce --removemissing --force system

# Repair a logical volume in volume group
lvconvert --repair system/root

# How to repair all LVs in VG at once?

Format file system 512 to 4096 blocksize

tar -zcf /tmp/boot-efi.tgz -C /boot/efi .

umount /boot/efi

. <(blkid -o export /dev/md126)
echo $UUID

mkfs.vfat -F32 -S 4096 -i ${UUID/-} /dev/md126

mount /boot/efi
tar -zxvf /tmp/boot-efi.tgz -C /boot/efi

tar -zcf /tmp/boot.tgz -C /boot .

Issue: diskfilter writes are not supported

After setup redundant boot, grub may prompt

error: ../../grub-core/disk/diskfilter.c:916:diskfilter writes are not supported

at boot up screen.

A workaround solution is

# Optional: Don't use grubenv, rename it
sudo mv /boot/grub2/grubenv /boot/grub2/grubenv.old

# Optional: Remove the grubenv
sudo rm /boot/grub2/grubenv

Troubleshoot Booting

Here are some possible scenarios that cause Linux fail to boot:

  1. Migrate storage devices

  2. Rename root file system in LVM using lvchange or cockpit.

Update grub.cfg and initramfs

Most Linux booting issue can be solved by re-build grub.cfg and initial RAM disk.

First, boot system into a compatible Linux live ISO/CD and drop to a shell console.

Next, mount both boot and root file systems:

# Optional: Activate boot deivces stored in mdadm raid devices
mdadm --assemble --scan

# Optional: Activate LVM volume for root file system
pvs
vgchange -ay vg-new

# mount root filesystem. Example: LVM volume
mount /dev/mapper/vg-new /mnt

# mount boot filesystem. Example: mdadm volume
mount /dev/md126 /mnt/boot

# mount EFI device. Example: mdadm volume
mount /dev/md127 /mnt/boot/efi

# Change to root filesystem
mount --bind /proc /mnt/proc
mount --bind /dev /mnt/dev
mount --bind /sys /mnt/sys
chroot /mnt

The system has mount to existing file systems. We can start fixing the issue now.

# Update rd.lvm.lv in current grub config file to reflect new volume group name:
vi /etc/default/grub

# optional: Disable probing other OS
cat << EOF | tee -a /etc/default/grub
GRUB_DISABLE_OS_PROBER=true
EOF

# Update grub.cfg to reflect new changes
grub2-mkconfig -o "$(readlink -e /etc/grub2.cfg)"

# Make initramfs: Switch to /boot
cd /boot

# Make initramfs: Set version variable using current version string
VER=$(uname -r)

# Make initramfs: or set a static version string if version of live system is different to actual system version
VER=5.14.0-162.18.1.el9_1.x86_64

# Make initramfs: make a backup
cp initramfs-$VER.img initramfs-$VER.img.backup

# Make initramfs: redhat based to build initramfs
dracut -f /boot/initramfs-$VER.img $VER

The fix is complete now. Finally, tidy the system and reboot:

# Exit chroot
exit

# umount boot and root
umount /mnt/boot/efi /mnt/boot /mnt

# reboot
reboot

Optional: Update EFI boot entries

Note
UEFI firmware should detect all available EFI Partitions in storage devices automatically and offer for booting. In general, it is not necessary to update these boot entries.
# Show current boot entries
efibootmgr -v

# Example: Remove unuse boot entries
BOOT=0006; efibootmgr -B -b $BOOT

# Add missing boot entries for MDADM EFI partition

# Get OS variables
. /etc/os-release

# Add boot entry for 1st device
DEV=/dev/sda
efibootmgr -c -d $DEV -p 2 -L "$NAME" -l "\EFI\\$ID\shimx64.efi"

# Add boot entry for 2nd device
DEV=/dev/sdb
efibootmgr -c -d $DEV -p 2 -L "$NAME" -l "\EFI\\$ID\shimx64.efi"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment