Re: [PLUG] Virtualization clusters & shared storage

On Tue, Aug 14, 2018 at 5:59 PM, Keith C. Perry <kperry@daotechnologies.com> wrote:

LizardFS scales to 12 exabytes...

Ceph I could not quickly find- call it the same or more...

I think you or anyone else using either system would be fine :D

Although both CephFS and LizardFS metadata is kept separate, this is infinitely better than storing metadata and data together. I can give you some horror stories about that with commercial systems. In this regard there is going to be little difference between Ceph and LizardFS. They're going to both able to perform well with proper builds. As far as I've seen reported LizardFS does tend to be faster than other parallel systems but I don't remember seeing a comparison to CephFS. I suspect they are probably similar is real workload IOPS.

My ZFS was idea was a throw away LOL I don't know enough about that system to make a recommendation other than to 1) not use ZFS and 2) bring up another node if you want more redundancy. Honestly, that would be my answer in any case unless you're want to archive.

My hashing idea was also a throw away but I don't have the ECC concerns and you are right about checking all copies in the case of mirroring (erasure coding would be even more of a pita to check), however scrubbing IS part of the system.

ZFS under LizardFS without ECC ram... As I said before, there are no free lunches- you either have ECC ram or you don't. ZFS without ECC ram is NOT as safe as ZFS with ECC ram.

https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/

The most important line in that is this:

"All that stuff about ZFS self-healing goes down the drain if the system isn't using ECC RAM"

Its not ZFS that makes you **safer** its ECC RAM. You can debate the merits of ZFS versus other filesystems but that is a different conversation.

Actually, after reading that link I definitely would not recommend ZFS under a parallel file system without ECC. The potential for making matters worse is non-zero and as you point out, now we have the parallel file system correcting errors when it shouldn't have to.

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
Keith C. Perry, MS E.E.
Managing Member, DAO Technologies LLC
(O) +1.215.525.4165 x2033
(M) +1.215.432.5167
www.daotechnologies.com

----- Original Message -----
From: "Rich Freeman" <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Monday, August 13, 2018 6:25:00 PM
Subject: Re: [PLUG] Virtualization clusters & shared storage

On Mon, Aug 13, 2018 at 5:28 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
>
> I choose that link to illustrate that Ceph typically needs more steps
> to manage. In Ceph it re-balances automatically as expected OSD but
> adding and removing an OSD is more manual. In LizardFS, re-balancing
> is re-balancing. Adding or removing resources is the same thing,
> failed or not (unless you have data with a goal of 1) because LizardFS
> concerns itself with the state of **file** chunks. Naturally this
> implies that the state of chunk server is known but the important
> difference is that LizardFS works at a lower level. So to me, that
> 3 wins... better granularity, simpler and more flexible management.
> Cepth is a different product- a product with wider scope and for that
> reason there is more complexity in their offerings. The FS part is
> maturing fast but they are really more known for their object storage
> and block devices.

I'm not sure I entirely agree with all of that statement, but after
reading up a bit more on LizardFS one contrast I will make is that
they've made a few different design choices that have pros and cons.
One of Ceph's big objectives was to eliminate all bottlenecks so that
it can scale really big. This led to driving object storage out of a
hash function. There is no central database of what object is stored
where - the storage is algorithmic, so anybody with the shared library
and the configuration can figure out exactly where any object will be
stored. This means that clients can go directly to storage nodes and
there is no master server (for object/volume storage - the POSIX
filesystem layer does require a metadata server that is more of a
bottleneck). However, this also means that when you add one node the
configuration changes, which changes the input to the hash function,
and it basically changes the destination of every single chunk of data
everywhere and causes a huge rebalance. Apparently when the cluster
gets really tiny that hash function has some bugs that can cause
issues, and from a bit of searching online that might not be entirely
limited to small clusters, but I didn't see many recent reports of
this problem.

LizardFS has a master server, which means it can be a lot more
flexible about where stuff is stored, since the master server just
keeps a database of what goes where. When you add a node it doesn't
have to move everything around - it just allocates new stuff to the
new nodes and records where it put everything. It of course backs up
this critical metadata, but at any one time there is only one master
server, and every single operation ends up hitting the master server
to figure out where the data goes. Now, it looks like they tried to
minimize the IO hitting the master server for obvious reasons, but it
isn't going to be infinitely scalable.

> You certainly could use ZFS under LizardFS and then figure out how integrate the ZFS snapshot process into a "backup" procedure for the cluster.

That sounds incredibly problematic. You have many nodes. You'd need
to somehow pause the cluster so that those nodes all reach a
consistent state, then do the snapshots everywhere, and then resume
the cluster. Then when you do the backup you're capturing all the
redundancy in your backup since ZFS doesn't know anything about that,
so your backups are wasteful. During the pause you're effectively
offline.

>
> I looked up the cost of RAM for my machine intelligence workstation... 8Gb ECC was ~$118 and 8Gb non-ECC was $83...

So, I'm talking commodity hardware. A used COMPUTER at Newegg costs
$100-150 total. Clearly we're not talking about spending $83 on RAM.

However, the relative prices are about right for the RAM itself.

Again, though, you also need a CPU+motherboard that supports ECC. For
AMD this doesn't really cost you more as long as you're careful when
selecting the motherboard. For Intel it will likely cost you quite a
bit more, as they deliberately disable ECC on any CPU in the sweet
spot price-wise.

And I'm still not aware of any cheap ECC ARM solutions.

> I suppose as an protection mechanism you could generate signatures
> (with a fast hash like SHA-1) for every file on your system and store
> it in a database so that you can periodically run a check against your
> parallel filesystem storage. You could prove to yourself in a couple
> of years that your solution you is viable (or not) without ECC.

This suffers for a few reasons:

1. I want to check all the redundant copies, and reading the files
with a filesystem call will only check one copy (no idea which).
2. This necessitates moving all the data for each file to a single
node to do the checksum, which means a ton of network traffic.
3. This operation would be ignorant of how data is stored on-disk,
which means purely random IO all over the cluster. Sure, the cluster
might be able to handle it, but it is unnecessary disk seeking.

Ideally this scrubbing should be part of the filesystem, so that all
the nodes can do a low-priority scrub of the data stored on that node.
This could be done asynchronously with no network traffic (except to
report status or correct errors), and at idle priority on each node.
This would also hit all the redundant copies.

Now, for LizardFS on top of ZFS I'm not super-worried about silent
corruption of data at rest, because ZFS already detects that. The
only issue would be triggering renewal of any bad files. ZFS could
tell you that a particular file (at the lower level) is corrupt, but
then you need to get LizardFS to regenerate that file from the rest of
the cluster. It would get an IO error when it goes to read that file,
so as long as scrubs read all the data on disk you would be fine,
because ZFS would generate an IO error reading a file it knows is bad,
and then presumably LizardFS would go and find the redundant data and
regenerate it, overwriting the bad file with a new good one.

This wouldn't cover memory issues though that impacted the data before
ZFS wrote it to disk. If ZFS is handed corrupt data to store, it will
simply ensure that the data remains in the same corrupt state it was
given.

--
Rich
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group -- http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug