Rich Freeman on 12 Aug 2018 12:14:14 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


On Sun, Aug 12, 2018 at 1:34 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
> Rich, I was frustrated last year trying to find comparisons myself
> between GlusterFS, Ceph (their FS functionality), XtreemFS and
> LizardFS.  I ended up building most of these but then settled on
> LizardFS since it is the one that is software defined storage and not
> object storage.


Honestly, you'll have to explain what you mean by that distinction?
While Ceph does object storage, and the FS component is built on top
of that, ultimately all filesystems are software-defined, unless
they're literally implemented in hardware.

> To your questions...
> * Does the implementation protect against memory corruption on storage
> nodes that do not use ECC?   (Note, I said storage nodes, NOT client
> nodes.)
>
> Protect in what sense?  If you don't have ECC ram you'd be relying on
> redundancies elsewhere in the transport of the data to protect against
> bit changes.


That is my question - which filesystems do this.  Example algorithm
I'd consider safe against memory corruption without use of ECC on the
storage node:

1.  Client (with ECC) passes data to filesystem layer on client.
2.  Filesystem layer (still with ECC since same host) computes checksum.
3.  Client filesystem layer passes data+checksum to some component of
the storage cluster.
4.  Storage cluster component verifies data vs checksum, then
distributes it through the cluster to the nodes that will store it.
5.  Each receiving node verifies data vs checksum, and stores
checksum+data on disk.  It then reads back the checksum only and
compares it to the checksum it received originally.
6.  Each receiving node acks completion of storage to the storage
component in #4, along with the checksum.
7.  Storage component waits for all acks, then compares all the
checksums received against the original checksum from the client.
8.  Storage component sends ack that sync is complete to the client,
along with the checksum.
9.  Client compares checksum it sent in #3 to the checksum it got back
in #8.  If they match then the sync is complete.

With an algorithm like this I believe we can tolerate a RAM failure
anywhere in the storage cluster, and the client is protected by ECC.
If there is a RAM failure somewhere in the cluster then the round trip
of checksums is going to run into an issue and the sync won't complete
on the client, which can trigger error handling to re-store the data.

For comparison, here is an algorithm I don't consider safe:
1.  Client (with ECC) passes data to filesystem layer on client.
2.  Filesystem layer (still with ECC since same host) computes checksum.
3.  Client filesystem layer passes data+checksum to some component of
the storage cluster.
4.  Storage cluster component verifies data vs checksum, then
distributes it through the cluster to the nodes that will store it.
5.  Each receiving node verifies data vs checksum.
5a.  Each receiving node writes the data to a checksummed filesystem
like ZFS, which computes its own checksum.
5b.  The receiving node does NOT read-back the data or otherwise
confirm that the data ZFS wrote was the data that was verified in step
5 above.
6.  Each receiving node acks completion of storage to the storage
component in #4, along with the checksum.
7.  Storage component waits for all acks, then compares all the
checksums received against the original checksum from the client.
8.  Storage component sends ack that sync is complete to the client,
along with the checksum.
9.  Client compares checksum it sent in #3 to the checksum it got back
in #8.  If they match then the sync is complete.

The only change is that the original checksum is never written to disk
and re-verified.  This means that if the data changed between the time
it was checksummed and the time it was passed to ZFS (and
re-protected) that would not be caught.  You could end up with two
nodes whose ZFS both say they have valid data, but the copies differ.

Now, if you changed the order of operations in the second algorithm so
that the data validation happened on a read-back from ZFS that would
address this issue.


>
> * Complexity/etc.
>
> What do you mean?

I have 3 PCs with 2 drives each.  How many command lines will I type
to get the whole thing running?  And how trivial are they/etc?

With something like ZFS/mdadm setting up a RAID is pretty simple.  I
can tell you that with Ceph it gets a bit more complex with the need
to sync secrets and such across nodes, and various tuning options, and
so on.  There is an ansible playbook out there that helps with a lot
of that, but sometimes I wonder if it is a recipe for losing all your
data in one bug...

> * Options for backup.
>
> Standard stuff here, you'll want /etc/lizardfs or /etc/mfs for your
> configuration files and /var/lib/lizardfs or /var/lib/mfs for your
> metadata.  Keep in mind that shadow masters and metaloggers also would
> give you copies of your metadata.


I was thinking more in terms of snapshots, serialization/etc.  Ie I
want to backup the cluster, not backup every node in the cluster
simultaneously/etc.

> * How well does it scale down?  I'm looking at alternatives to
> largeish ZFS/NAS home-lab implementations, not to run Google.  Do I
> need 12 physical hosts to serve 10TB of data with redundancy to
> survive the loss of 1 physical host?
>
> It scales down very well.  I'm working on something now that I'm not
> ready to talk about yet but a common question that comes up on the
> list is can you run multiple chunk servers on one system.  The answer
> is yes.


Sure, but what happens when that one system fails, and you lose
multiple chunk servers at once?  Will the filesystem have ensured that
those multiple chunk servers weren't required to meet the replication
constraints?

> I'd have to know more about what you are doing but I know of scenarios
> where people are running single systems with 2 or more chunk servers
> against their storage instead of RAID or other systems to provide
> redundancy availablilty.


Well, my whole goal is that I can have maybe ~15-18TB of disks
providing ~12TB of usable storage across a minimal number of systems
so that the loss of any one host doesn't affect the cluster.  With
Ceph that is relatively straightforward, with the main constraint
being that I need at least 3 physical hosts.  It doesn't really
support having only 2.  Ceph actually uses a storage node per disk, so
a host with multiple disks runs multiple storage hosts.  However, the
default redundancy is at the host level, not the storage node level,
so redundancy will always be satisfied for the loss of any one host.
You can actually have more complex mappings like tagging each node
with host+rack+room+building+site and have all kinds of rules on how
the data is split up.

>
> I looked at CephFS too and I like what I see but its still object storage.

Well, if all you used was CephFS you wouldn't know it was using object
storage.  I'm not sure why this is a concern.  You could view it as an
internal implementation detail, with the filesystem being an
interface, but with object store being a separate interface.

> Plus changing your data strategy on the fly or haven't multiple strategies seemed to be much more difficult to do (if at all).

So, it is somewhat complex, but my understanding is that you can
simply change the rules, and the remapping happens automatically.  The
main issue is that this can potentially cause a storm of IO as the
mapping of data to nodes is all based on a hash algorithm, so when you
change the inputs a little you change the outputs of the hash a lot.
For example, suppose you reduce redundancy from 3x-2x.  You might
naively think that this just means that the nodes would just figure
out which 1/3rd of the data to toss out.  In reality it changes the
destination addresses of every block of data in the cluster, so just
about every block is going to get rewritten somewhere else (other than
the percentage that ends up on the same node due to chance alone).
This is still automatic, but basically every node in the cluster ends
up with a massive work queue that they all work through as they export
data to other nodes, and also import data from other nodes doing the
same thing.  Eventually it all settles down...eventually.  What can
cause serious problems is if you go in and make another change while
this is in progress as this basically doubles all the queues, and
eventually RAM becomes a problem on every node in the cluster.

As the admin you don't have to really think about this too much, but
you need to be aware of it because anything that changes the mapping
is going to probably cause the whole cluster to go crazy.  This is
part of why they recommend storage nodes have a secondary LAN between
them - it helps keep this traffic from killing the rest of your LAN.

>
> I found LizardFS to be much easier to setup and work with than Ceph.
> Adding / removing resources, moving data around, rebuilding failed
> resources and changing replication goals are easy to do. The other
> thing I really like is that it layers on a standard filesystems which
> was important since filesystems take a long time to vet say they
> are reliable.  You could in theory share a current filesystem with
> LizardsFS (you just create a folder for it so that root of the chunk
> server data starts there) to try it out- if you like it, moving things
> around is easy trivial and if you don't, you just uninstall and delete
> that folder).  Ceph may allow you to do but I didn't look into it.


Until recently layering on another filesystem was the only way Ceph
worked.  This is why I'm not sure why you're making this distinction.
However, I'm completely willing to believe that the clustering layer
is simpler to administer.

>
> LizardFS has been around for awhile now and they do have corporate
> backing.  I'm not sure why more people aren't familiar with it but it
> took me awhile to find it too.  Gluster and Ceph tend to suck the air
> out of the room since they are the most popular.


And that is what makes me nervous about LizardFS.  I'm really looking
for a solution that is easy to expand/maintain/etc.  One of the issues
if I scale up is that it becomes harder to "just rsync my data to
something else" - since that something else has to be just as big as
what I'm replacing, and I can't just migrate the data in-place.


-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug