Mausy5043/20200531

## 20200531

# The failing drive is indicated by it's device name.
# Because device names like `/dev/sde` are not static the failing drive could be any one of the disks in the array.
# Use this command to link the static WWN-id to its device name in the logging:
journalctl --since 2020-03-09 |grep /dev/sda |grep -e "WWN\|Prefail"

# Disks with problems today are:
WWN:5-0014ee-6055a237b and WWN:5-0014ee-605a043e2

The [237b] (POH:43332hrs) has 12 SMART errors (UNC) @ 42085hrs (T+1247) and an extended offline test failed @ 41930hrs
The [43e2] (POH:42023hrs) has  6 SMART errors (UNC) @ 41839hrs (T+184) and 35387hrs


## replace_procedure.sh
# My Problem: /dev/sde is failing
# Procedure below is based on information gathered from raid.wiki.kernel.org
#
# Find the serialnumber of the drive so we can find it easily in the enclosure
$ sudo hdparm -i /dev/sde | grep SerialNo
  Model=ST3000DM001-9YN166, FwRev=CC4B, SerialNo=S1F0JZ8B

# Inspect the current partitioning information of the drives
# to see output of `parted`, `sfdisk` and `fdisk`
$ sudo fdisk -l /dev/sde
# Typically a RAID drive is used in its entirety.
# So, the info is purely FYI & just-in-case.

# 1. Tell mdadm to fail the drive (if it has not already done so)
$ sudo mdadm --manage /dev/md0 --fail /dev/sde1

# Now the RAID will be in a degraded state. Monitor `mdadm` regularly
# and wait to make sure `mdadm` has come to peace with the situation.
$ sudo mdadm --detail /dev/md0; cat /proc/mdstat

# 2. Tell mdadm to remove the drive
$ sudo mdadm --manage /dev/md0 --remove /dev/sde1

# Check again to confirm the removal.
$ sudo mdadm --detail /dev/md0; cat /proc/mdstat

# 3. Shutdown the server
$ sudo poweroff

# 4a. Open the enclosure
# 4b. Identify the drive by serialnumber
# 4c. Remove the faulty drive
# 4d. Note the serialnumber of the new replacement drive
# 4e. Insert the replacement drive
# 4f. Close the enclosure
# 4g. Power on

# 5. Confirm the device name of the new drive
$ sudo hdparm -i /dev/sde | grep SerialNo
  # If this doesn't return the correct serialnumber try sdb, sdc or sdd
#wwn: sudo smartctl --all /dev/disk/by-id/wwn-0x50014ee60507b79c

# 6. Partition the new disk
$ sudo parted /dev/sde
GNU Parted 3.2
Using /dev/sde
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mktable gpt
(parted) mkpart primary ext4 1049kB  3001GB
(parted) print
Model: ATA WDC WD30EFRX-68E (scsi)
Disk /dev/sde: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  3001GB  3001GB  ext4         primary
(parted) quit
Information: You may need to update /etc/fstab

# Using `fdisk`:
$ sudo fdisk /dev/sdb

g # to create gpt partition table
n # add partitiontable (accept all defaults)
t # type = Linux RAID (29)
w # write changes to disk

# 7. Add the disk to the array
$ sudo mdadm --manage /dev/md0 --add /dev/sde1
  mdadm: added /dev/sde1

$ sudo mdadm --detail /dev/md0; cat /proc/mdstat  # Monitor the recovery.
  Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
  md0 : active raid6 sde1[4] sdd1[2] sdb1[0] sdc1[1]
      5860267008 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [UUU_]
      [>....................]  recovery =  0.0% (698880/2930133504) finish=558.8min speed=87360K/sec

  unused devices: <none>

# Looking good sofar!

	# The failing drive is indicated by it's device name.
	# Because device names like `/dev/sde` are not static the failing drive could be any one of the disks in the array.
	# Use this command to link the static WWN-id to its device name in the logging:
	journalctl --since 2020-03-09 \|grep /dev/sda \|grep -e "WWN\\|Prefail"

	# Disks with problems today are:
	WWN:5-0014ee-6055a237b and WWN:5-0014ee-605a043e2

	The [237b] (POH:43332hrs) has 12 SMART errors (UNC) @ 42085hrs (T+1247) and an extended offline test failed @ 41930hrs
	The [43e2] (POH:42023hrs) has 6 SMART errors (UNC) @ 41839hrs (T+184) and 35387hrs
	# My Problem: /dev/sde is failing
	# Procedure below is based on information gathered from raid.wiki.kernel.org
	#
	# Find the serialnumber of the drive so we can find it easily in the enclosure
	$ sudo hdparm -i /dev/sde \| grep SerialNo
	Model=ST3000DM001-9YN166, FwRev=CC4B, SerialNo=S1F0JZ8B

	# Inspect the current partitioning information of the drives
	# to see output of `parted`, `sfdisk` and `fdisk`
	$ sudo fdisk -l /dev/sde
	# Typically a RAID drive is used in its entirety.
	# So, the info is purely FYI & just-in-case.

	# 1. Tell mdadm to fail the drive (if it has not already done so)
	$ sudo mdadm --manage /dev/md0 --fail /dev/sde1

	# Now the RAID will be in a degraded state. Monitor `mdadm` regularly
	# and wait to make sure `mdadm` has come to peace with the situation.
	$ sudo mdadm --detail /dev/md0; cat /proc/mdstat

	# 2. Tell mdadm to remove the drive
	$ sudo mdadm --manage /dev/md0 --remove /dev/sde1

	# Check again to confirm the removal.
	$ sudo mdadm --detail /dev/md0; cat /proc/mdstat

	# 3. Shutdown the server
	$ sudo poweroff

	# 4a. Open the enclosure
	# 4b. Identify the drive by serialnumber
	# 4c. Remove the faulty drive
	# 4d. Note the serialnumber of the new replacement drive
	# 4e. Insert the replacement drive
	# 4f. Close the enclosure
	# 4g. Power on

	# 5. Confirm the device name of the new drive
	$ sudo hdparm -i /dev/sde \| grep SerialNo
	# If this doesn't return the correct serialnumber try sdb, sdc or sdd
	#wwn: sudo smartctl --all /dev/disk/by-id/wwn-0x50014ee60507b79c

	# 6. Partition the new disk
	$ sudo parted /dev/sde
	GNU Parted 3.2
	Using /dev/sde
	Welcome to GNU Parted! Type 'help' to view a list of commands.
	(parted) mktable gpt
	(parted) mkpart primary ext4 1049kB 3001GB
	(parted) print
	Model: ATA WDC WD30EFRX-68E (scsi)
	Disk /dev/sde: 3001GB
	Sector size (logical/physical): 512B/4096B
	Partition Table: gpt
	Disk Flags:

	Number Start End Size File system Name Flags
	1 1049kB 3001GB 3001GB ext4 primary
	(parted) quit
	Information: You may need to update /etc/fstab

	# Using `fdisk`:
	$ sudo fdisk /dev/sdb

	g # to create gpt partition table
	n # add partitiontable (accept all defaults)
	t # type = Linux RAID (29)
	w # write changes to disk

	# 7. Add the disk to the array
	$ sudo mdadm --manage /dev/md0 --add /dev/sde1
	mdadm: added /dev/sde1

	$ sudo mdadm --detail /dev/md0; cat /proc/mdstat # Monitor the recovery.
	Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
	md0 : active raid6 sde1[4] sdd1[2] sdb1[0] sdc1[1]
	5860267008 blocks super 1.2 level 6, 512k chunk, algorithm 2 [4/3] [UUU_]
	[>....................] recovery = 0.0% (698880/2930133504) finish=558.8min speed=87360K/sec

	unused devices: <none>

	# Looking good sofar!