Rich Freeman on 29 Sep 2013 05:18:07 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Spanning volumes with LVM (Ubuntu)


On Sat, Sep 28, 2013 at 6:26 PM, Matt Mossholder <matt@mossholder.com> wrote:
>
> There are several issues with utilizing RAID5 for MythTV storage. The
> biggest is that when you write data that doesn't fill an entire stripe, you
> end up having to read the stripe in first, so that you can then recalculate
> the parity of the new stripe.

That really is an issue with any striped RAID in general (anything but
RAID1 basically).  Nothing really MythTV-specific.

I don't really buy into that MythTV wiki page.  Unless you're putting
a lot of demand on your array it shouldn't have trouble keeping up
with multiple HD streams - mine certainly didn't as long as you have
decent RAM for the cache.  Even with multiple streams there is no
reason it shouldn't be able to write data out in stripe-sized chunks
as long as your buffers are right...

>
> I noticed when I was previously recording to RAID5 that I would get
> occasional corruption in the data. This disappeared completely when I
> rebuilt to use stand-alone recording drives.

A few years ago I used to get that problem.  I resolved it by
disabling fsync in the MythTV ThreadedFileWriter routine and
increasing its buffer sizes.  That is just 3-4 lines of code to change
and it is pretty trivial.  In more recent versions it hasn't been a
problem so I don't bother with it.  The routine already slowly dumps
data into the OS file buffer, so all you need to do is comment out the
line that tries to fsync it so that the OS can just write it to disk
in its own time.  Not sure if they fixed it or if something else
changed, but back in the ~0.24 days the buffer was pretty small
(megabytes) and they'd try to fsync out really small chunks of it, so
for a nice hour-long show there were probably tens of thousands of
fsyncs.

Ok, start of rant...

Recording multiple HD streams into small memory buffers and then
dumping the data out in small chunks to multiple files with fsyncs on
every write is just brain-dead - certainly doing that on a striped
RAID is going to cause problems.  There is no need to fsync at all for
something like mythtv - you just write to the file and it is the job
of the kernel to make sure it makes it to disk.  Without the fsync if
the system crashes there is some risk that you'll lose a minute of
video or whatever, but you're going to lose more than that while the
system reboots anyway and isn't recording.  In contrast, those fsyncs
from small buffers are going to kill the drive and result in buffer
overruns.  I have no idea why the MythTV authors did it that way, and
removing that fsync GREATLY improves performance.

For whatever reason there seem to be people out there who think that
you should fsync anytime you write to a file if you don't want to lose
the data.  Fsync should really only be used when you're dealing with
cross-network/process transactions (such as in a database) to ensure
that operations are consistent.  You can go read the big long post by
the ext3 maintainers about how what I just said is wrong, but then
I'll just point you to the short post by Linus where he modified their
filesystem code against their wishes so that it works just fine
without all the fsyncs.  Fsync just puts an unnecessary constraint on
the OS - it tells the OS to forget all its smart buffer/cache
management algorithms and just write this one piece of data right now
no matter what, and when every process acts this selfishly it kills
the performance of any drive configuration, but striped RAID even
moreso.

Ok, getting off the fsync soapbox...  For most applications RAID5 is
going to be perfectly fine, especially for things like MythTV that are
writing large multimedia files.  That said, the whole
read-before-write issue in RAID is part of the reason that I switched
to btrfs.  The other issue is that I happened to run into parity
errors which is a big problem with the design of mdadm RAID.  If a
drive outright fails mdadm handles that just fine.  If for whatever
reason data changes on one of the drives there is no way for it to
know which n-1 combination of drives has the right data to rebuild the
bad one.  Btrfs and ZFS use block-level checksums so that it is always
clear if any particular block is right or not, and then both will use
redundant blocks if available to fix things.  The COW approach also
handles modifications to existing files better - in-place
modifications of files are basically transactional and after a crash
you'll either end up with the old data or the new data.  I think ext3
also achieves that with Linus's change to the default behaviors (that
was the part he got into a fight with the maintainers over - they
wanted it to lose existing data by default if you didn't fsync after
every write).

Sorry about the rant.  :)

Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug