Rich Mingin (PLUG) via plug on 7 Nov 2022 15:37:00 -0800 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Box won't boot after RAID drive swap |
On Mon, Nov 7, 2022 at 6:09 PM Keith via plug <plug@lists.phillylinux.org> wrote: > > On 11/7/22 17:54, Walt Mankowski via plug wrote: > > On Mon, Nov 07, 2022 at 09:56:02PM -0500, Rich Mingin (PLUG) via plug wrote: > >> On Mon, Nov 7, 2022 at 4:30 PM Rich Freeman via plug > >> <plug@lists.phillylinux.org> wrote: > >>> Basically it does the right thing in most circumstances. Hard to be > >>> certain what is going on here, but it could be that the distro has > >>> overridden the behavior and is preventing a degraded array from > >>> mounting, or the array just isn't finding any drives. Keep in mind > >>> this is a computer that can't even be reliably booted to firmware, so > >>> this is getting beyond the scope of raid. > >> Getting ahead of the first issue. Don't blame failure to boot on the > >> array when the computer is frequently failing to complete basic power > >> on tests before turning off again. There is a hardware issue, beyond > >> just the disks. No OS is loaded at that time, if the box is powering > >> off mid-POST, there absolutely is a hardware problem to identify and > >> resolve before anything with md/LVM/etc come into play. > >> > >> Could be a loose cable, could be power supply damage by the failing > >> disk, could be intermittent cosmic ray errors. Too little data to > >> guess meaningfully, beyond needing more troubleshooting. > > It seems to be both an mdadm issue and a hardware issue. The first > > thing I did was remove the old drive and replace it with a new > > one. It got well into the boot process but refused to mount the array > > with one of the drives missing. It also refused to boot without my > > external backup drive plugged in, presumably because I was mounting > > /dev/sde as /backup in /etc/fstab. (I really need to check my default > > settings if I can ever get this box to boot again!) > > > > This business with it shutting down before it even finishes booting > > started after I put the old drive back. > > > > I need to check the hardware cables again, and also dumb stuff like > > maybe I'm just not plugging in the power cable all the way. But along > > with all these computer problems I've also come down with a nasty > > chest cold, and I'm just not feeling up to crawling under my desk > > again today. > > > > Walt > > > Feel better Walt !! > > Based on your findings, one of Rich's post and one of Leroy's posts, I > now have some questions for the list after doing a quick Google myself, > I'm also not finding any examples of replacing a failed drive in a RAID > 1 without removing the drive ***first*** while it is online. > > 1) Has anyone every run a degraded RAID 1 (i.e. only one disk online) > that was created with mdadm? Was that a boot set or data set? > > 2) Has anyone ever replaced a failed RAID 1 disk with mdadm without > first removing the bad disk while the system was up? What where your > steps and is this (or your process) documented somewhere? > > Sorry, I'm suddenly curious about this. > > -- > ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ > Keith C. Perry, MS E.E. > Managing Member, DAO Technologies LLC > (O) +1.215.525.4165 x2033 > (M) +1.215.432.5167 > www.daotechnologies.com > > ___________________________________________________________________________ > Philadelphia Linux Users Group -- http://www.phillylinux.org > Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce > General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug Yes, feel better Walt. Sorry to hear you're under the weather. I have had an mdadm mirror degrade and boot without other issues. I don't have the box currently, but it was running Ubuntu 20.04. One disk failed, the box became a little wobbly (disk timeout/reset spam), so I shut it down hard. Disk was missing/offline after a power cycle, but the machine booted fine. I ended up just degrading the raid and running single disk for a while, since I was making plans to part out/rebuild the machine anyways. Didn't have any further issues. I have rebuilt the RAID6 in my media server without offlining/removing the failed disk first. It's not exactly 1:1, but should be useful data. The member disk failed out, I physically removed it, added another, and partitioned it to match the existing (UEFI ESP "stub" partition, then md member for all the remaining space). On next boot I was able to use mdadm to add it to the array and the array automatically resilvered correctly. I believe the array members were listed as sd[a-d] online, sde offline/failed, sd[f-k] online. When I added the new disk, sde was apparently "held" so the system made it sdl. Just before the next reboot for updates, I used mdadm to verify the array and then remove the "missing" member. After that reboot, the member disks were sd[a-k] again as before. It's still online without any other issues. Neither of these machines were using LVM, since I just haven't gotten around to adding it to my standard lineup. ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug