Mike Edwards on 23 Aug 2011 09:01:58 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Debian unstable locking up and corrupting filesystem


> [   13.720864] ata1.00: status: { DRDY ERR }
> [   13.720905] ata1.00: error: {UNC }
> [   13.744497] end_request: I/O error, dev sda, secdtor 195523

This is a drive not ready error while reading a specific sector.  The
most likely cause is an issue with the drive itself.

SMART may have some useful information for you, but beware - it's not
perfect.  If you don't have it already, install the smartmontools package.

Running smartctl -a /dev/sda should print out device information (model
#, serial, etc), SMART status (simple PASSED/FAILING flag according to
thresholds on each counter), and a full dump of the SMART counters.

In general, the SMART status is often less than helpful.  It's quite
common to see a drive fail due to whole areas of the disk being
rendered unusable (head crash / damaged or unstable platters?), while
specific counters haven't reached the failure threshold yet, resulting
in SMART reporting the drive as PASSED.  At the same time, the SMART
counters can be difficult to read, as they're implemented in different
ways on different drives.

In my experience, the counters you want to pay particular attention to are:
Raw_Read_Error_Rate
Reallocated_Sector_Ct
Seek_Error_Rate
Load_Cycle_Count
Hardware_ECC_Recovered
Reallocated_Event_Count
Current_Pending_Sector
Offline_Uncorrectable
UDMA_CRC_Error_Count
Multi_Zone_Error_Rate

Also of note - not all drives will have the same SMART counters.
Certain counters, like Raw_Read_Error_Rate, are found on all drives,
while others (Hardware_ECC_Recovered) may not appear on some drives.

Here's the tricky part - interpreting what these counters tell you.
Again, this is largely vendor-specific.

WD drives will tend to keep them at 0, unless a specific type of error
occurs, in which case it will increment them.  An example from a WD
drive:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_
FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
Always       -       0
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age
Always       -       0
193 Load_Cycle_Count        0x0032   200   200   000    Old_age
Always       -       48
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   199   000    Old_age
Always       -       24
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
Offline      -       0

With that example, all counters are 0, except for Load_Cyle_Count (48)
and UDMA_CRC_Error_Count (24).  This is a drive I'd keep an eye on, but
nothing to panic about yet.

Other vendors, like Seagate and Hitachi, will often constantly
increment the Raw_Read_Error_Rate and Seek_Error_Rate counters, meaning
the raw value will tell you little:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_
FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   109   099   006    Pre-fail
Always       -       130151115
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail
Always       -       123137049
...

In this case, you should pay close attention to the VALUE/WORST/THRESH
values - these are something like a rating system for each counter.  On
this particular drive, the Raw_Read_Error_Rate counter has a current
calculated rating of 109, with the worst being seen at 99.  If that
rating hits 66 or below, it will cause SMART to report the drive as
failing.

Lastly, look at the TYPE field on these counters.  There's two types -
Pre-fail and Old_age.  The former indicates a definite failure
condition, if that counter passes it's stated threshold.  The latter
typically increments with drive age/use, and it's threshold is a bit
more akin to how many miles are on your car - do you want to keep your
car after 100,000 miles?  Do you want to keep your drive after X amount
of Power_On_Hours, or UDMA_CRC_Error_Counts?

Hope this helps.  If you need assistance interpreting these counters,
feel free to email me the output of 'smartctl -a /dev/sda' and I'll
tell you what I can.


-- 
                                                                      
Mike Edwards                    |   If this email address disappears,   
Unsolicited advertisments to    |   assume it was spammed to death.  To
this address are not welcome.   |   reach me in that case, s/-.*@/@/

"Our progress as a nation can be no swifter than our progress in education.
The human mind is our fundamental resource."
  -- John F. Kennedy
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug