Keith C. Perry on 22 Apr 2016 00:01:07 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] [plug-announce] TOMORROW - Tue, Apr 19, 2016: PLUG North - "Linux Containers" by Jason Plum and Rich Freeman (*6:30 pm* at CoreDial in Blue Bell)


Ah so you're referring to the the COW functionality of the fs.  I understand now.  Fair point, XFS is not a COW type fs but now in understanding your point, my fs of choice for in that regard is NILFS2 which doesn't specifically have writable snapshots (yet).  So I can understand why that might now have been looked at though for what is desired in ephemeral container use, a log structured file system would work.

"Sure, but when you get a silent bit flip on a disk the following happens:
1.  mdadm reads the bad data off the disk and notes the drive did not
return any errors.
2.  mdadm hands the data to lvm, which hands it to btrfs.
3.  btrfs checks the checksum and it is wrong, so it reports a read failure."

True but if is was that "easy" to silently flip a bit, software RAID wouldn't be a thing.  That's more of a problem with hardware RAID because the controller can lie about the fsync status.  Worst than that, even on good cards, RAID subsystem issues like the classic battery failure can go unreported as well.  Beyond LVM's own checksum facilities there is also dmeventd which should detect problems so issues like what you describe should never happen silently.  When detected, there are a number of things that can be done to deal with failures (i.e. bad disk in a RAID, JBOD or mirror set).  Every fs would be vulnerable to this situation but its just not something I've ever seen personally.  I can't remember the last time I heard someone working directly with mdadm having such a problem.

"Now, if you had a bunch of lvm volumes on separate drives and striped
btrfs across them that would work fine, but it kind of defeats the
purpose of using lvm in the first place.  You certainly wouldn't try
to snapshot such a beast while it was online!"

LOL, yea but now that you put it out there someone will try it!


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Owner, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: "Rich Freeman" <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Thursday, April 21, 2016 8:04:54 PM
Subject: Re: [PLUG] [plug-announce] TOMORROW - Tue, Apr 19, 2016: PLUG North - "Linux Containers" by Jason Plum and Rich Freeman (*6:30 pm* at CoreDial in Blue Bell)

On Thu, Apr 21, 2016 at 6:12 PM, Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
> "Sure, but there are downsides to running something like btrfs on LVM.
> LVM is also not feature-complete with btrfs even with regard to
> snapshotting.  I can reflink a single file in btrfs but there is no
> practical way to alias the cp command to create COW snapshots with
> LVM."
>
> Ahh, ok if "reflink" means individual file recover, then I see that point.

A reflink is a copy-on-write copy of a file.  Just as there is a
system call to copy or move a file, there is one to create a reflink.
If I have a 200GB file on a btrfs filesystem with 2GB of free space, I
can make a reflink copy of it which uses no additional space initially
(just an inode and some metadata).  If I make changes to the file the
changed extents will consume space.  Making a reflink is slower than a
subvolume snapshot or a hard link, but much faster than a file copy.
Unlike a hard link a modification to a reflink of a file doesn't touch
the original copy.

> "You also get the raid hole back
> when striping, and some other optimizations that are possible when the
> filesystem handles the device management.  I imagine that LVM would
> also not cope as well with striping across devices of mixed sizes and
> so on."
>
> Well this is interesting...  I've never heard of the "raid hole" but more importantly one of the ideas behind using LVM is that file systems are poor at device management.  If they were, they would become bloated with code to do things that have little to do with implementing a good file system.  It's possible that BTRFS and ZFS go against that paradigm but that doesn't make much sense to me, especially in the context of "do one thing and do it well".  However, I suspect for those there is plenty of use cases for both approaches at this point.

This was a popular counter-argument to zfs when it was first
introduced (btrfs wasn't even a concept then):
https://lkml.org/lkml/2006/6/9/389

There is a lot more to it than just convenience.  By having knowledge
of the used/free state of every block a COW filesystem can ensure that
data is never overwritten in place which greatly reduces the risk of
loss if a write is interrupted.  A traditional raid has to overwrite a
stripe in place leaving a period of time where the stripe is
inconsistent while it is partially written.

>
> "I'm not suggesting that btrfs really is production-ready.  I'm just
> saying that you don't get the same functionality by building on LVM,
> and anybody using btrfs actually loses functionality if they put it on
> top of LVM."
>
> Is that really true though?  I now can see based on what you said above why someone might feel that BTRFS (or ZFS) includes enough utility to do good volume management but as far as file system features, excluding RAID functionality, how would BTRFS on top of say /dev/sdb be "better" than BTRFS on top of /dev/myServer/myData (myServer being the VG and myData being the LV)?  'Could be my ignorance of BTRFS but a device is a device so the fs should not know or even care that it is on an LV.

Btrfs doesn't know or even care that it is on an LV.  However, since
it is on a single volume it behaves in single-volume mode.  It can't
store its data redundantly across multiple disks.

But wait, you say, LVM+mdadm can redundantly store the data across
multiple disks.

Sure, but when you get a silent bit flip on a disk the following happens:
1.  mdadm reads the bad data off the disk and notes the drive did not
return any errors.
2.  mdadm hands the data to lvm, which hands it to btrfs.
3.  btrfs checks the checksum and it is wrong, so it reports a read failure.

Without mdadm+lvm btrfs would have just read the redundant drive(s)
and returned the correct data without a failure.  Perhaps re-accessing
the data might cause it to be read correctly in this scenario if mdadm
picked a different copy, but I'm not sure about that.  If the problem
were a single bad bit in a raid5 stripe it would need to pick the
right set of n-1 drives to send the correct data to btrfs.  You could
trigger a raid scrub, which would just detect the error but not be
able to correct it since it won't know which copy is bad.

Now, if you had a bunch of lvm volumes on separate drives and striped
btrfs across them that would work fine, but it kind of defeats the
purpose of using lvm in the first place.  You certainly wouldn't try
to snapshot such a beast while it was online!


-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug