JP Vossen on 20 May 2009 22:20:44 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[PLUG] Fixing smartd "Offline uncorrectable sectors" *

* For some value of "fixed"...

I just brought up a server in a remote co-lo, and was getting:

May 20 21:02:18 host smartd[6707]: Device: /dev/sda, 13 Offline 
uncorrectable sectors
May 20 21:32:18 host smartd[6707]: Device: /dev/sda, 13 Offline 
uncorrectable sectors

That drive is an older one pulled from some other server: a Maxtor 
DiamondMax Plus 9 6Y160M0 160G and is disk0 of a software mirror set 
that's sitting under LVM.  So I have a mirror (which is NOT NOT NOT a 
backup, but that's a different story) so I can probably keep running if 
it dies, but...  Dead hard drives are a PITA.

When I Google for that error I get stuff like:

Unfortunately, nothing I found in Google told me what that error 
actually *means*.  The closest was which said:
ID 	Hex 	Attribute name 	Better 	Description
198 	C6 	Uncorrectable Sector Count 	\/
	The total number of uncorrectable errors when reading/writing a sector. 
A rise in the value of this attribute indicates defects of the disk 
surface and/or problems in the mechanical subsystem. (or Off-Line Scan 
Uncorrectable Sector Count – Fujitsu)[15]

Unfortunately, that's not *quite* what I see.  I am getting "198 
Offline_Uncorrectable" on Maxtor.  Sigh.  BTW the order drive is fine as 
far as I can tell from SMART.

So just for fun, I zeroed free space on both partitions.  From what I 
read (mostly in badblockhowto.html), an attempt to write to a bad block 
will cause the drive to Do Something About the error, and maybe fix* it. 
  Since according to df / has 142G available but only 7G used (5.6G is a 
single VM), and /boot has 236M avail with 48M used, the odds favor that 
if anything is really wrong it doesn't have data in it or it's in the VM 
that I'll be re-rsyncing anyway.
	# dd if=/dev/zero of=/boot/zero; rm -f /boot/zero
	# dd if=/dev/zero of=/zero; rm -f /zero
	dd: writing to `/zero': No space left on device
	282258265+0 records in
	282258264+0 records out
	144516231168 bytes (145 GB) copied, 3992.15 s, 36.2 MB/s

(Funny story, after the big dd ended, I was wondering why I showed 0 
free space.  Turns out fcheck was running an md5sum on 145G of /zero, so 
while I *had* deleted it, since md5sum had the file handle open it 
wasn't "gone" yet.  Once I killed the md5sum I got my space back.)

Also, I understand that 145 GB written as reported by dd is larger than 
the 142G reported free by df.  I hate drive space math in different bases...

Anyway, short story long, that fixed it after I told smart to re-test:
# smartctl -t offline /dev/sda
	## Wait 300+ seconds
# smartctl -A /dev/sda | egrep '^198|^ID'
198 Offline_Uncorrectable   0x0008   253   240   000    Old_age 
Offline      -       0

If I'd been really smart I'd have re-run the test (smartctl -t offline 
/dev/sda) before the "fix" just in case...  But that didn't occur to me 
until just now.

Hopefully this is useful for someone,
JP Vossen, CISSP            |:::======|
My Account, My Opinions     |=========|
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.
Philadelphia Linux Users Group         --
Announcements -
General Discussion  --