Rich Freeman on 13 Aug 2018 15:25:18 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


On Mon, Aug 13, 2018 at 5:28 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
>
> I choose that link to illustrate that Ceph typically needs more steps
> to manage.  In Ceph it re-balances automatically as expected OSD but
> adding and removing an OSD is more manual.  In LizardFS, re-balancing
> is re-balancing.  Adding or removing resources is the same thing,
> failed or not (unless you have data with a goal of 1) because LizardFS
> concerns itself with the state of **file** chunks.  Naturally this
> implies that the state of chunk server is known but the important
> difference is that LizardFS works at a lower level.  So to me, that
> 3 wins... better granularity, simpler and more flexible management.
> Cepth is a different product- a product with wider scope and for that
> reason there is more complexity in their offerings.  The FS part is
> maturing fast but they are really more known for their object storage
> and block devices.

I'm not sure I entirely agree with all of that statement, but after
reading up a bit more on LizardFS one contrast I will make is that
they've made a few different design choices that have pros and cons.
One of Ceph's big objectives was to eliminate all bottlenecks so that
it can scale really big.  This led to driving object storage out of a
hash function.  There is no central database of what object is stored
where - the storage is algorithmic, so anybody with the shared library
and the configuration can figure out exactly where any object will be
stored.  This means that clients can go directly to storage nodes and
there is no master server (for object/volume storage - the POSIX
filesystem layer does require a metadata server that is more of a
bottleneck).  However, this also means that when you add one node the
configuration changes, which changes the input to the hash function,
and it basically changes the destination of every single chunk of data
everywhere and causes a huge rebalance.  Apparently when the cluster
gets really tiny that hash function has some bugs that can cause
issues, and from a bit of searching online that might not be entirely
limited to small clusters, but I didn't see many recent reports of
this problem.

LizardFS has a master server, which means it can be a lot more
flexible about where stuff is stored, since the master server just
keeps a database of what goes where.  When you add a node it doesn't
have to move everything around - it just allocates new stuff to the
new nodes and records where it put everything.  It of course backs up
this critical metadata, but at any one time there is only one master
server, and every single operation ends up hitting the master server
to figure out where the data goes.  Now, it looks like they tried to
minimize the IO hitting the master server for obvious reasons, but it
isn't going to be infinitely scalable.

> You certainly could use ZFS under LizardFS and then figure out how integrate the ZFS snapshot process into a "backup" procedure for the cluster.

That sounds incredibly problematic.  You have many nodes.  You'd need
to somehow pause the cluster so that those nodes all reach a
consistent state, then do the snapshots everywhere, and then resume
the cluster.  Then when you do the backup you're capturing all the
redundancy in your backup since ZFS doesn't know anything about that,
so your backups are wasteful.  During the pause you're effectively
offline.

>
> I looked up the cost of RAM for my machine intelligence workstation...  8Gb ECC was ~$118 and 8Gb non-ECC was $83...

So, I'm talking commodity hardware.  A used COMPUTER at Newegg costs
$100-150 total.  Clearly we're not talking about spending $83 on RAM.

However, the relative prices are about right for the RAM itself.

Again, though, you also need a CPU+motherboard that supports ECC.  For
AMD this doesn't really cost you more as long as you're careful when
selecting the motherboard.  For Intel it will likely cost you quite a
bit more, as they deliberately disable ECC on any CPU in the sweet
spot price-wise.

And I'm still not aware of any cheap ECC ARM solutions.

> I suppose as an protection mechanism you could generate signatures
> (with a fast hash like SHA-1) for every file on your system and store
> it in a database so that you can periodically run a check against your
> parallel filesystem storage.  You could prove to yourself in a couple
> of years that your solution you is viable (or not) without ECC.

This suffers for a few reasons:

1.  I want to check all the redundant copies, and reading the files
with a filesystem call will only check one copy (no idea which).
2.  This necessitates moving all the data for each file to a single
node to do the checksum, which means a ton of network traffic.
3.  This operation would be ignorant of how data is stored on-disk,
which means purely random IO all over the cluster.  Sure, the cluster
might be able to handle it, but it is unnecessary disk seeking.

Ideally this scrubbing should be part of the filesystem, so that all
the nodes can do a low-priority scrub of the data stored on that node.
This could be done asynchronously with no network traffic (except to
report status or correct errors), and at idle priority on each node.
This would also hit all the redundant copies.

Now, for LizardFS on top of ZFS I'm not super-worried about silent
corruption of data at rest, because ZFS already detects that.  The
only issue would be triggering renewal of any bad files.  ZFS could
tell you that a particular file (at the lower level) is corrupt, but
then you need to get LizardFS to regenerate that file from the rest of
the cluster.  It would get an IO error when it goes to read that file,
so as long as scrubs read all the data on disk you would be fine,
because ZFS would generate an IO error reading a file it knows is bad,
and then presumably LizardFS would go and find the redundant data and
regenerate it, overwriting the bad file with a new good one.

This wouldn't cover memory issues though that impacted the data before
ZFS wrote it to disk.  If ZFS is handed corrupt data to store, it will
simply ensure that the data remains in the same corrupt state it was
given.

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug