Last active
March 6, 2022 16:01
-
-
Save Mausy5043/8cf402a753fa70cd65e9 to your computer and use it in GitHub Desktop.
RAID-6 failing/failed disk exchange procedure
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# The failing drive is indicated by it's device name. | |
# Because device names like `/dev/sde` are not static the failing drive could be any one of the disks in the array. | |
# Use this command to link the static WWN-id to its device name in the logging: | |
journalctl --since 2020-03-09 |grep /dev/sda |grep -e "WWN\|Prefail" | |
# Disks with problems today are: | |
WWN:5-0014ee-6055a237b and WWN:5-0014ee-605a043e2 | |
The [237b] (POH:43332hrs) has 12 SMART errors (UNC) @ 42085hrs (T+1247) and an extended offline test failed @ 41930hrs | |
The [43e2] (POH:42023hrs) has 6 SMART errors (UNC) @ 41839hrs (T+184) and 35387hrs | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# My Problem: /dev/sde is failing | |
# Procedure below is based on information gathered from raid.wiki.kernel.org | |
# | |
# Find the serialnumber of the drive so we can find it easily in the enclosure | |
$ sudo hdparm -i /dev/sde | grep SerialNo | |
Model=ST3000DM001-9YN166, FwRev=CC4B, SerialNo=S1F0JZ8B | |
# Inspect the current partitioning information of the drives | |
# to see output of `parted`, `sfdisk` and `fdisk` | |
$ sudo fdisk -l /dev/sde | |
# Typically a RAID drive is used in its entirety. | |
# So, the info is purely FYI & just-in-case. | |
# 1. Tell mdadm to fail the drive (if it has not already done so) | |
$ sudo mdadm --manage /dev/md0 --fail /dev/sde1 | |
# Now the RAID will be in a degraded state. Monitor `mdadm` regularly | |
# and wait to make sure `mdadm` has come to peace with the situation. | |
$ sudo mdadm --detail /dev/md0; cat /proc/mdstat | |
# 2. Tell mdadm to remove the drive | |
$ sudo mdadm --manage /dev/md0 --remove /dev/sde1 | |
# Check again to confirm the removal. | |
$ sudo mdadm --detail /dev/md0; cat /proc/mdstat | |
# 3. Shutdown the server | |
$ sudo poweroff | |
# 4a. Open the enclosure | |
# 4b. Identify the drive by serialnumber | |
# 4c. Remove the faulty drive | |
# 4d. Note the serialnumber of the new replacement drive | |
# 4e. Insert the replacement drive | |
# 4f. Close the enclosure | |
# 4g. Power on | |
# 5. Confirm the device name of the new drive | |
$ sudo hdparm -i /dev/sde | grep SerialNo | |
# If this doesn't return the correct serialnumber try sdb, sdc or sdd | |
#wwn: sudo smartctl --all /dev/disk/by-id/wwn-0x50014ee60507b79c | |
# 6. Partition the new disk | |
$ sudo parted /dev/sde | |
GNU Parted 3.2 | |
Using /dev/sde | |
Welcome to GNU Parted! Type 'help' to view a list of commands. | |
(parted) mktable gpt | |
(parted) mkpart primary ext4 1049kB 3001GB | |
(parted) print | |
Model: ATA WDC WD30EFRX-68E (scsi) | |
Disk /dev/sde: 3001GB | |
Sector size (logical/physical): 512B/4096B | |
Partition Table: gpt | |
Disk Flags: | |
Number Start End Size File system Name Flags | |
1 1049kB 3001GB 3001GB ext4 primary | |
(parted) quit | |
Information: You may need to update /etc/fstab | |
# Using `fdisk`: | |
$ sudo fdisk /dev/sdb | |
g # to create gpt partition table | |
n # add partitiontable (accept all defaults) | |
t # type = Linux RAID (29) | |
w # write changes to disk | |
# 7. Add the disk to the array | |
$ sudo mdadm --manage /dev/md0 --add /dev/sde1 | |
mdadm: added /dev/sde1 | |
$ sudo mdadm --detail /dev/md0; cat /proc/mdstat # Monitor the recovery. | |
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] | |
md0 : active raid6 sde1[4] sdd1[2] sdb1[0] sdc1[1] | |
5860267008 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [UUU_] | |
[>....................] recovery = 0.0% (698880/2930133504) finish=558.8min speed=87360K/sec | |
unused devices: <none> | |
# Looking good sofar! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment