Skip to content

Instantly share code, notes, and snippets.

@DroidFreak32
Last active January 26, 2019 10:46
Show Gist options
  • Save DroidFreak32/b35605458f7b084155c94814eb03e3ea to your computer and use it in GitHub Desktop.
Save DroidFreak32/b35605458f7b084155c94814eb03e3ea to your computer and use it in GitHub Desktop.
What are the things that may indicate that a disk is about to fail?
HDDs:
Number of damaged/bad sectors,
Increased heat output
Increase in clicking or some other noise in the HDD
Problems in reading or writing data
Initially IBM intriduced a disk monitoring technology for their disks, named "Predictive Failure Analysis", but it was a basic software that give a binary result "device OK" or "drive likely to fail soon".
Later, another variant "intellisafe" was created by CompaQ along with Seagate, Quantum & Conner, that measured the health parameters - Drive vendor was free to choose thae parameters and their treshold values.
Then all these vendors including WD joint together to develop a std named SMART.
It is a monitoring technology that is supported by most of the modern hard drives.
With SMART, different internal and external problems with the drive may be monitored and reported to the user.
Some of the problems reported by SMART is critical and ignoreing them is just waiting for the disk to fail and data to be lost.
There are some other parameters that inform about a potential problem in the future and may not require immediate acction ot be taken.
Some parameters:
How long it takes the drive to speed up from 0 to 7200 RPM,
R/W Errors
Temp
Bad sectors relocated
Unrecoverable errors
Wear leveling count (SSD)
Total bytes Written (SSD)
Each drive manufacturer defines a set of attributes, and sets threshold values beyond which attributes should not pass under normal operation. Each attribute has a raw value, whose meaning is entirely up to the drive manufacturer (but often corresponds to counts or a physical unit, such as degrees Celsius or seconds), a normalized value, which ranges from 1 to 253 (with 1 representing the worst case and 253 representing the best) and a worst value, which represents the lowest recorded normalized value. The initial default value of attributes is 100 but can vary between manufacturer.
Normalized values are usually mapped so that higher values are better (exceptions include drive temperature, number of head load/unload cycles), but higher raw attribute values may be better or worse depending on the attribute and manufacturer. For example, the "Reallocated Sectors Count" attribute's normalized value decreases as the count of reallocated sectors increases. In this case, the attribute's raw value will often indicate the actual count of sectors that were reallocated, although vendors are in no way required to adhere to this convention.
** SOME IMP SMART ATTR
Attribute Description
SMART 5 Reallocated Sectors Count
SMART 187 Reported Uncorrectable Errors
SMART 188 Command Timeout
SMART 197 Current Pending Sector Count
SMART 198 Uncorrectable Sector Count
Reallocated Sectors Count S.M.A.R.T. parameter indicates the count of reallocated sectors (512 bytes). When the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area (spare area). This process is also known as remapping and "reallocated" sectors are called remaps. This is why, on a modern hard disks, you will not see "bad blocks" while testing the surface - all bad blocks are hidden in reallocated sectors.
However, the more sectors that are reallocated, the more a sudden decrease (up to 10% and more) can be noticed in the disk read/write speed.
###### Just for reference
001 Read Error Rate Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number.
002 Throughput Performance Overall (general) throughput performance of a hard disk drive. If the value of this attribute is decreasing there is a high probability that there is a problem with the disk.
003 Spin-Up Time Average time of spindle spin up (from zero RPM to fully operational [milliseconds]).
004 Start/Stop Count A tally of spindle start/stop cycles. The spindle turns on, and hence the count is increased, both when the hard disk is turned on after having before been turned entirely off (disconnected from power source) and when the hard disk returns from having previously been put to sleep mode.
005* Reallocated Sectors Count Count of reallocated sectors. The raw value represents a count of the bad sectorsthat have been found and remapped.Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This value is primarily used as a metric of the life expectancy of the drive; a drive which has had any reallocations at all is significantly more likely to fail in the immediate months.
007 Seek Error Rate Vendor specific raw value.) Rate of seek errors of the magnetic heads. If there is a partial failure in the mechanical positioning system, then seek errors will arise. Such a failure may be due to numerous factors, such as damage to a servo, or thermal widening of the hard disk. The raw value has different structure for different vendors and is often not meaningful as a decimal number.
008 Seek Time Performance Average performance of seek operations of the magnetic heads. If this attribute is decreasing, it is a sign of problems in the mechanical subsystem.
009 Power-On Hours Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state."By default, the total expected lifetime of a hard disk in perfect condition is defined as 5 years (running every day and night on all days). This is equal to 1825 days in 24/7 mode or 43800 hours."On some pre-2005 drives, this raw value may advance erratically and/or "wrap around" (reset to zero periodically).
010* Spin Retry Count Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
011 Recalibration RetriesorCalibration Retry Count This attribute indicates the count that recalibration was requested (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.
012 Power Cycle Count This attribute indicates the count of full hard disk power on/off cycles.
173 SSD Wear Leveling Count Counts the maximum worst erase count on any block.
174 Unexpected power loss count Also known as "Power-off Retract Count" per conventional HDD terminology. Raw value reports the number of unclean shutdowns, cumulative over the life of an SSD, where an "unclean shutdown" is the removal of power without STANDBY IMMEDIATE as the last command (regardless of PLI activity using capacitor power). Normalized value is always 100.
175 Power Loss Protection Failure Last test result as microseconds to discharge cap, saturated at its maximum value. Also logs minutes since last test and lifetime number of tests. Raw value contains the following data:Bytes 0-1: Last test result as microseconds to discharge cap, saturates at max value. Test result expected in range 25 <= result <= 5000000, lower indicates specific error code.Bytes 2-3: Minutes since last test, saturates at max value.Bytes 4-5: Lifetime number of tests, not incremented on power cycle, saturates at max value.Normalized value is set to one on test failure or 11 if the capacitor has been tested in an excessive temperature condition, otherwise 100.
176 Erase Fail Count S.M.A.R.T. parameter indicates a number of flash erase command failures.
179 Used Reserved Block Count Total Pre-Fail attribute used at least in Samsung devices.
180 Unused Reserved Block Count Total Pre-Fail attribute used at least in HP devices.
181 Program Fail Count Total or Non-4K Aligned Access Count Total number of Flash program operation failures since the drive was deployed.Number of user data accesses (both reads and writes) where LBAs are not 4 KiB aligned (LBA % 8 != 0) or where size is not modulus 4 KiB (block count != 8), assuming logical block size (LBS) = 512 B.
182 Erase Fail Count Pre-Fail Attribute used at least in Samsung devices.
183 SATA Downshift Error CountorRuntime Bad Block Western Digital, Samsung or Seagate attribute: Either the number of downshifts of link speed (e.g. from 6Gbps to 3Gbps) or the total number of data blocks with detected, uncorrectable errors encountered during normal operation.Although degradation of this parameter can be an indicator of drive aging and/or potential electromechanical problems, it does not directly indicate imminent drive failure.
184* End-to-End error / IOEDC This attribute is a part of Hewlett-Packard's SMART IV technology, as well as part of other vendors' IO Error Detection and Correction schemas, and it contains a count of parity errors which occur in the data path to the media via the drive's cache RAM.
187* Reported Uncorrectable Errors The count of errors that could not be recovered using hardware ECC (see attribute 195).
188* Command Timeout The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero.
190 Temperature Difference or Airflow Temperature Value is equal to (100-temp. °C), allowing manufacturer to set a minimum threshold which corresponds to a maximum temperature. This also follows the convention of 100 being a best-case value and lower values being undesirable. However, some older drives may instead report raw Temperature (identical to 0xC2) or Temperature minus 50 here.
191 G-sense Error Rate The count of errors resulting from externally induced shock and vibration.
192 Power-off Retract Count,Emergency Retract Cycle Count(Fujitsu),orUnsafe Shutdown Count Number of power-off or emergency retract cycles.
193 Load Cycle CountorLoad/Unload Cycle Count(Fujitsu) Count of load/unload cycles into head landing zone position.Some drives use 225 (0xE1) for Load Cycle Count instead.Western Digital rates their VelociRaptor drives for 600,000 load/unload cycles,and WD Green drives for 300,000 cycles;the latter ones are designed to unload heads often to conserve power. On the other hand, the WD3000GLFS (a desktop drive) is specified for only 50,000 load/unload cycles.Some laptop drives and "green power" desktop drives are programmed to unload the heads whenever there has not been any activity for a short period, to save power.Operating systems often access the file system a few times a minute in the background,causing 100 or more load cycles per hour if the heads unload: the load cycle rating may be exceeded in less than a year.There are programs for most operating systems that disable the Advanced Power Management(APM) and Automatic acoustic management(AAM) features causing frequent load cycles.
194 TemperatureorTemperature Celsius Indicates the device temperature, if the appropriate sensor is fitted. Lowest byte of the raw value contains the exact temperature value (Celsius degrees).
195 Hardware ECC Recovered Vendor-specific raw value.) The raw value has different structure for different vendors and is often not meaningful as a decimal number.
196* Reallocation Event Count Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful and unsuccessful attempts are counted.
197* Current Pending Sector Count Count of "unstable" sectors (waiting to be remapped, because of unrecoverable read errors). If an unstable sector is subsequently read successfully, the sector is remapped and this value is decreased. Read errors on a sector will not remap the sector immediately (since the correct value cannot be read and so the value to remap is not known, and also it might become readable later); instead, the drive firmware remembers that the sector needs to be remapped, and will remap it the next time it's written.However, some drives will not immediately remap such sectors when written; instead the drive will first attempt to write to the problem sector and if the write operation is successful then the sector will be marked good (in this case, the "Reallocation Event Count" (0xC4) will not be increased). This is a serious shortcoming, for if such a drive contains marginal sectors that consistently fail only after some time has passed following a successful write operation, then the drive will never remap these problem sectors.
198* Offline) Uncorrectable Sector Count The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.[4]
199 UltraDMA CRC Error Count The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).
200 Multi-Zone Error Rate The count of errors found when writing a sector. The higher the value, the worse the disk's mechanical condition is.
201* Soft Read Error RateorTA Counter Detected Count indicates the number of uncorrectable software read errors.
241 Total LBAs Written Total count of LBAs written.
242 Total LBAs Read Total count of LBAs read.Some S.M.A.R.T. utilities will report a negative number for the raw value since in reality it has 48 bits rather than 32.
243 Total LBAs Written Expanded The upper 5 bytes of the 12-byte total number of LBAs written to the device. The lower 7 byte value is located at attribute 0xF1.
244 Total LBAs Read Expanded The upper 5 bytes of the 12-byte total number of LBAs read from the device. The lower 7 byte value is located at attribute 0xF2.
####### smartctl commands
smartctl -h
smartctl -i : See "SMART support i available" to check if your hdd supports.
smartctl -c "SMART capabilities"
smartctl -l selftest "Results of a short test"
smartctl -a : Detailed result of the disk
smartctl -H = overall health
the conf file is in "/etc/smartd.conf"
** The meaning and interpretation of the attributes varies between manufacturers, and are sometimes considered a trade secret for one manufacturer or another.
** From a legal perspective, the term "S.M.A.R.T." refers only to a signaling method between internal disk drive electromechanical sensors and the host computer. Because of this the specifications of S.M.A.R.T. are entirely vendor specific and, while many of these attributes have been standardized between drive vendors, others remain vendor-specific. S.M.A.R.T. implementations still differ and in some cases may lack "common" or expected features such as a temperature sensor or only include a few select attributes while still allowing the manufacturer to advertise the product as "S.M.A.R.T. compatible."
pre-fail: when raw value is 0, nothing happened, else if raw value is > 0, its about to crash.
old-age: not really have to be concerned... just that the hdd was powered on for 600 hrs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment