Skip to content

Instantly share code, notes, and snippets.

@Mausy5043
Last active March 6, 2022 16:01
Show Gist options
  • Save Mausy5043/8cf402a753fa70cd65e9 to your computer and use it in GitHub Desktop.
Save Mausy5043/8cf402a753fa70cd65e9 to your computer and use it in GitHub Desktop.
RAID-6 failing/failed disk exchange procedure
# The failing drive is indicated by it's device name.
# Because device names like `/dev/sde` are not static the failing drive could be any one of the disks in the array.
# Use this command to link the static WWN-id to its device name in the logging:
journalctl --since 2020-03-09 |grep /dev/sda |grep -e "WWN\|Prefail"
# Disks with problems today are:
WWN:5-0014ee-6055a237b and WWN:5-0014ee-605a043e2
The [237b] (POH:43332hrs) has 12 SMART errors (UNC) @ 42085hrs (T+1247) and an extended offline test failed @ 41930hrs
The [43e2] (POH:42023hrs) has 6 SMART errors (UNC) @ 41839hrs (T+184) and 35387hrs
# My Problem: /dev/sde is failing
# Procedure below is based on information gathered from raid.wiki.kernel.org
#
# Find the serialnumber of the drive so we can find it easily in the enclosure
$ sudo hdparm -i /dev/sde | grep SerialNo
Model=ST3000DM001-9YN166, FwRev=CC4B, SerialNo=S1F0JZ8B
# Inspect the current partitioning information of the drives
# to see output of `parted`, `sfdisk` and `fdisk`
$ sudo fdisk -l /dev/sde
# Typically a RAID drive is used in its entirety.
# So, the info is purely FYI & just-in-case.
# 1. Tell mdadm to fail the drive (if it has not already done so)
$ sudo mdadm --manage /dev/md0 --fail /dev/sde1
# Now the RAID will be in a degraded state. Monitor `mdadm` regularly
# and wait to make sure `mdadm` has come to peace with the situation.
$ sudo mdadm --detail /dev/md0; cat /proc/mdstat
# 2. Tell mdadm to remove the drive
$ sudo mdadm --manage /dev/md0 --remove /dev/sde1
# Check again to confirm the removal.
$ sudo mdadm --detail /dev/md0; cat /proc/mdstat
# 3. Shutdown the server
$ sudo poweroff
# 4a. Open the enclosure
# 4b. Identify the drive by serialnumber
# 4c. Remove the faulty drive
# 4d. Note the serialnumber of the new replacement drive
# 4e. Insert the replacement drive
# 4f. Close the enclosure
# 4g. Power on
# 5. Confirm the device name of the new drive
$ sudo hdparm -i /dev/sde | grep SerialNo
# If this doesn't return the correct serialnumber try sdb, sdc or sdd
#wwn: sudo smartctl --all /dev/disk/by-id/wwn-0x50014ee60507b79c
# 6. Partition the new disk
$ sudo parted /dev/sde
GNU Parted 3.2
Using /dev/sde
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mktable gpt
(parted) mkpart primary ext4 1049kB 3001GB
(parted) print
Model: ATA WDC WD30EFRX-68E (scsi)
Disk /dev/sde: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 3001GB 3001GB ext4 primary
(parted) quit
Information: You may need to update /etc/fstab
# Using `fdisk`:
$ sudo fdisk /dev/sdb
g # to create gpt partition table
n # add partitiontable (accept all defaults)
t # type = Linux RAID (29)
w # write changes to disk
# 7. Add the disk to the array
$ sudo mdadm --manage /dev/md0 --add /dev/sde1
mdadm: added /dev/sde1
$ sudo mdadm --detail /dev/md0; cat /proc/mdstat # Monitor the recovery.
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid6 sde1[4] sdd1[2] sdb1[0] sdc1[1]
5860267008 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [UUU_]
[>....................] recovery = 0.0% (698880/2930133504) finish=558.8min speed=87360K/sec
unused devices: <none>
# Looking good sofar!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment