|
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
|
[PLUG] Fixing smartd "Offline uncorrectable sectors" *
|
* For some value of "fixed"...
I just brought up a server in a remote co-lo, and was getting:
May 20 21:02:18 host smartd[6707]: Device: /dev/sda, 13 Offline
uncorrectable sectors
May 20 21:32:18 host smartd[6707]: Device: /dev/sda, 13 Offline
uncorrectable sectors
That drive is an older one pulled from some other server: a Maxtor
DiamondMax Plus 9 6Y160M0 160G and is disk0 of a software mirror set
that's sitting under LVM. So I have a mirror (which is NOT NOT NOT a
backup, but that's a different story) so I can probably keep running if
it dies, but... Dead hard drives are a PITA.
When I Google for that error I get stuff like:
http://smartmontools.sourceforge.net/badblockhowto.html
Unfortunately, nothing I found in Google told me what that error
actually *means*. The closest was
http://en.wikipedia.org/wiki/S.M.A.R.T which said:
ID Hex Attribute name Better Description
[...]
198 C6 Uncorrectable Sector Count \/
The total number of uncorrectable errors when reading/writing a sector.
A rise in the value of this attribute indicates defects of the disk
surface and/or problems in the mechanical subsystem. (or Off-Line Scan
Uncorrectable Sector Count – Fujitsu)[15]
Unfortunately, that's not *quite* what I see. I am getting "198
Offline_Uncorrectable" on Maxtor. Sigh. BTW the order drive is fine as
far as I can tell from SMART.
So just for fun, I zeroed free space on both partitions. From what I
read (mostly in badblockhowto.html), an attempt to write to a bad block
will cause the drive to Do Something About the error, and maybe fix* it.
Since according to df / has 142G available but only 7G used (5.6G is a
single VM), and /boot has 236M avail with 48M used, the odds favor that
if anything is really wrong it doesn't have data in it or it's in the VM
that I'll be re-rsyncing anyway.
# dd if=/dev/zero of=/boot/zero; rm -f /boot/zero
# dd if=/dev/zero of=/zero; rm -f /zero
dd: writing to `/zero': No space left on device
282258265+0 records in
282258264+0 records out
144516231168 bytes (145 GB) copied, 3992.15 s, 36.2 MB/s
(Funny story, after the big dd ended, I was wondering why I showed 0
free space. Turns out fcheck was running an md5sum on 145G of /zero, so
while I *had* deleted it, since md5sum had the file handle open it
wasn't "gone" yet. Once I killed the md5sum I got my space back.)
Also, I understand that 145 GB written as reported by dd is larger than
the 142G reported free by df. I hate drive space math in different bases...
Anyway, short story long, that fixed it after I told smart to re-test:
# smartctl -t offline /dev/sda
## Wait 300+ seconds
# smartctl -A /dev/sda | egrep '^198|^ID'
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
198 Offline_Uncorrectable 0x0008 253 240 000 Old_age
Offline - 0
If I'd been really smart I'd have re-run the test (smartctl -t offline
/dev/sda) before the "fix" just in case... But that didn't occur to me
until just now.
Hopefully this is useful for someone,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP |:::======| http://bashcookbook.com/
My Account, My Opinions |=========| http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
|
|