Rich Mingin (PLUG) via plug on 7 Nov 2022 15:37:00 -0800


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Box won't boot after RAID drive swap


On Mon, Nov 7, 2022 at 6:09 PM Keith via plug
<plug@lists.phillylinux.org> wrote:
>
> On 11/7/22 17:54, Walt Mankowski via plug wrote:
> > On Mon, Nov 07, 2022 at 09:56:02PM -0500, Rich Mingin (PLUG) via plug wrote:
> >> On Mon, Nov 7, 2022 at 4:30 PM Rich Freeman via plug
> >> <plug@lists.phillylinux.org> wrote:
> >>> Basically it does the right thing in most circumstances.  Hard to be
> >>> certain what is going on here, but it could be that the distro has
> >>> overridden the behavior and is preventing a degraded array from
> >>> mounting, or the array just isn't finding any drives.  Keep in mind
> >>> this is a computer that can't even be reliably booted to firmware, so
> >>> this is getting beyond the scope of raid.
> >> Getting ahead of the first issue. Don't blame failure to boot on the
> >> array when the computer is frequently failing to complete basic power
> >> on tests before turning off again. There is a hardware issue, beyond
> >> just the disks. No OS is loaded at that time, if the box is powering
> >> off mid-POST, there absolutely is a hardware problem to identify and
> >> resolve before anything with md/LVM/etc come into play.
> >>
> >> Could be a loose cable, could be power supply damage by the failing
> >> disk, could be intermittent cosmic ray errors. Too little data to
> >> guess meaningfully, beyond needing more troubleshooting.
> > It seems to be both an mdadm issue and a hardware issue. The first
> > thing I did was remove the old drive and replace it with a new
> > one. It got well into the boot process but refused to mount the array
> > with one of the drives missing. It also refused to boot without my
> > external backup drive plugged in, presumably because I was mounting
> > /dev/sde as /backup in /etc/fstab. (I really need to check my default
> > settings if I can ever get this box to boot again!)
> >
> > This business with it shutting down before it even finishes booting
> > started after I put the old drive back.
> >
> > I need to check the hardware cables again, and also dumb stuff like
> > maybe I'm just not plugging in the power cable all the way. But along
> > with all these computer problems I've also come down with a nasty
> > chest cold, and I'm just not feeling up to crawling under my desk
> > again today.
> >
> > Walt
>
>
> Feel better Walt !!
>
> Based on your findings, one of Rich's post and one of Leroy's posts, I
> now have some questions for the list after doing a quick Google myself,
> I'm also not finding any examples of replacing a failed drive in a RAID
> 1 without removing the drive ***first*** while it is online.
>
> 1) Has anyone every run a degraded RAID 1 (i.e. only one disk online)
> that was created with mdadm?  Was that a boot set or data set?
>
> 2) Has anyone ever replaced a failed RAID 1 disk with mdadm without
> first removing the bad disk while the system was up? What where your
> steps and is this (or your process) documented somewhere?
>
> Sorry, I'm suddenly curious about this.
>
> --
> ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
> Keith C. Perry, MS E.E.
> Managing Member, DAO Technologies LLC
> (O) +1.215.525.4165 x2033
> (M) +1.215.432.5167
> www.daotechnologies.com
>
> ___________________________________________________________________________
> Philadelphia Linux Users Group         --        http://www.phillylinux.org
> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
> General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

Yes, feel better Walt. Sorry to hear you're under the weather.

I have had an mdadm mirror degrade and boot without other issues. I
don't have the box currently, but it was running Ubuntu 20.04. One
disk failed, the box became a little wobbly (disk timeout/reset spam),
so I shut it down hard. Disk was missing/offline after a power cycle,
but the machine booted fine. I ended up just degrading the raid and
running single disk for a while, since I was making plans to part
out/rebuild the machine anyways. Didn't have any further issues.

I have rebuilt the RAID6 in my media server without offlining/removing
the failed disk first. It's not exactly 1:1, but should be useful
data. The member disk failed out, I physically removed it, added
another, and partitioned it to match the existing (UEFI ESP "stub"
partition, then md member for all the remaining space). On next boot I
was able to use mdadm to add it to the array and the array
automatically resilvered correctly. I believe the array members were
listed as sd[a-d] online, sde offline/failed, sd[f-k] online. When I
added the new disk, sde was apparently "held" so the system made it
sdl. Just before the next reboot for updates, I used mdadm to verify
the array and then remove the "missing" member. After that reboot, the
member disks were sd[a-k] again as before. It's still online without
any other issues.

Neither of these machines were using LVM, since I just haven't gotten
around to adding it to my standard lineup.
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug