Mike Edwards on 23 Aug 2011 09:01:58 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Debian unstable locking up and corrupting filesystem |
> [ 13.720864] ata1.00: status: { DRDY ERR } > [ 13.720905] ata1.00: error: {UNC } > [ 13.744497] end_request: I/O error, dev sda, secdtor 195523 This is a drive not ready error while reading a specific sector. The most likely cause is an issue with the drive itself. SMART may have some useful information for you, but beware - it's not perfect. If you don't have it already, install the smartmontools package. Running smartctl -a /dev/sda should print out device information (model #, serial, etc), SMART status (simple PASSED/FAILING flag according to thresholds on each counter), and a full dump of the SMART counters. In general, the SMART status is often less than helpful. It's quite common to see a drive fail due to whole areas of the disk being rendered unusable (head crash / damaged or unstable platters?), while specific counters haven't reached the failure threshold yet, resulting in SMART reporting the drive as PASSED. At the same time, the SMART counters can be difficult to read, as they're implemented in different ways on different drives. In my experience, the counters you want to pay particular attention to are: Raw_Read_Error_Rate Reallocated_Sector_Ct Seek_Error_Rate Load_Cycle_Count Hardware_ECC_Recovered Reallocated_Event_Count Current_Pending_Sector Offline_Uncorrectable UDMA_CRC_Error_Count Multi_Zone_Error_Rate Also of note - not all drives will have the same SMART counters. Certain counters, like Raw_Read_Error_Rate, are found on all drives, while others (Hardware_ECC_Recovered) may not appear on some drives. Here's the tricky part - interpreting what these counters tell you. Again, this is largely vendor-specific. WD drives will tend to keep them at 0, unless a specific type of error occurs, in which case it will increment them. An example from a WD drive: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 48 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 199 000 Old_age Always - 24 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 With that example, all counters are 0, except for Load_Cyle_Count (48) and UDMA_CRC_Error_Count (24). This is a drive I'd keep an eye on, but nothing to panic about yet. Other vendors, like Seagate and Hitachi, will often constantly increment the Raw_Read_Error_Rate and Seek_Error_Rate counters, meaning the raw value will tell you little: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 109 099 006 Pre-fail Always - 130151115 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 123137049 ... In this case, you should pay close attention to the VALUE/WORST/THRESH values - these are something like a rating system for each counter. On this particular drive, the Raw_Read_Error_Rate counter has a current calculated rating of 109, with the worst being seen at 99. If that rating hits 66 or below, it will cause SMART to report the drive as failing. Lastly, look at the TYPE field on these counters. There's two types - Pre-fail and Old_age. The former indicates a definite failure condition, if that counter passes it's stated threshold. The latter typically increments with drive age/use, and it's threshold is a bit more akin to how many miles are on your car - do you want to keep your car after 100,000 miles? Do you want to keep your drive after X amount of Power_On_Hours, or UDMA_CRC_Error_Counts? Hope this helps. If you need assistance interpreting these counters, feel free to email me the output of 'smartctl -a /dev/sda' and I'll tell you what I can. -- Mike Edwards | If this email address disappears, Unsolicited advertisments to | assume it was spammed to death. To this address are not welcome. | reach me in that case, s/-.*@/@/ "Our progress as a nation can be no swifter than our progress in education. The human mind is our fundamental resource." -- John F. Kennedy ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug