Rich Freeman via plug on 13 Jul 2021 07:13:15 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

[PLUG] New Hard Drive Testing Practices

Figured I'd start a conversation even if the topic is a little trivial.

I'm curious what others do when they obtain a new hard drive (or a
used one for that matter - I'm curious if you do anything
differently).  Do you do any kind of testing on drives before putting
them into service, assuming you can spare a few days/etc?  Do you take
any kind of measures to mitigate the potential of early failures in
lieu of testing?

I have a new drive that will probably arrive tomorrow and am going to
be taking another drive out of service that has an uncorrectable
sector (or 8 of them depending on how you count).  The disk is in a
RAID and it looks like the sector isn't even in use right now, but I
usually try to replace disks in this condition.

That actually creates a bit of a risk management question.  What is
the relative benefit of doing testing before putting a new disk into
service, knowing that it comes at the cost of delaying removal of a
potentially-failing disk from service?  Both the risk and benefit are
probably pretty low in this case.

Some options I can think of:

1. smart short test (a no-brainer really, but largely ignores the surface)
2. smart long test
3. badblocks destructive test (probably the most extensive practical
option, but this can take a number of days for a 14TB drive).  This
also has the advantage of probably detecting an SMR drive that managed
to sneak through though this drive isn't supposed to be one.
4. Add the drive to the RAID without removing the old drive, and then
do a scrub, and remove the old drive after it passes.  This is
effectively a one-pass random data destructive write test that only
costs the additional time of a read pass as the write pass was going
to happen anyway.  However, it would probably only test the in-use
regions of the disk surface.  I'm using ZFS so silent errors won't be
an issue, but use care with anything that doesn't detect silent errors
since if there is a failure you're going to have to deal with it
manually somehow (hopefully you can pick the version where 2/3 disks
agree).  This also only works for RAID options that support adding an
extra redundant disk temporarily.

Figured this might stir up some interesting discussion.

Philadelphia Linux Users Group         --
Announcements -
General Discussion  --