Re: [PLUG] Virtualization clusters & shared storage

Keith C. Perry on 12 Aug 2018 15:43:06 -0700
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Virtualization clusters & shared storage
From: "Keith C. Perry" <kperry@daotechnologies.com>
To: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Subject: Re: [PLUG] Virtualization clusters & shared storage
Date: Sun, 12 Aug 2018 18:42:07 -0400 (EDT)
Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Sender: "plug" <plug-bounces@lists.phillylinux.org>
Thread-index: TbDh09dfwy7nFKvwRvAUkBW2iUPo7g==
Thread-topic: Virtualization clusters & shared storage
"Honestly, you'll have to explain what you mean by that distinction?
While Ceph does object storage, and the FS component is built on top
of that, ultimately all filesystems are software-defined, unless
they're literally implemented in hardware."

There in lies he rub... a definition. Allow me to offer an operational perspective for what it is worth.

Object storage is more of buzz / marketing / shiny-new-box-around-existing-idea paradigm than it is a different way to approach storage.  What you are essentially doing is applying a database concept file system data.  That is to say, you are writing to a system that when that write is complete you have a durable blob (or actually chucks of data that make up a blob).  Think of it as working with an ACID compliant file.  This is not new though- if you've ever had to move blobs in and out of databases for multiple users you played with object storage.  What Ceph et al bring us is the full filesystem semantics backed by such a system whether its cloud based like Amazon's S3 or local like Ceph's own object storage system.

Software defined storage works differently in that the system concerns itself with the state of your storage and works to maintain that.  It is most apparent when were talking about replication but also extends to general filesystem health- e.g. are all my chunks of data available for this file?... is the data in them correct? etc. It is far easier to say something like "I want 3 copies of this file... 2 in NY and 1 in PA" or "I want 2 copies of this file but at least one has to be on SSDs" than it is to specify all the actions to do that AND maintain balance across all your OSD / chuck servers AND execute the specific actions needed to maintain your storage definitions when a failure occurs.

This may seem like a distinction without a difference but then I would say, all things being equal why not do the thing that is easiest?  If you've ever had a major storage disaster the last thing you want to do with people breathing down your neck is follow the careful procedures of having to repair a failed RAID system or LVM.  Yes it can be done but I would much rather know that my filesystem already started moving data to maintain my stated data goals rather than me having to take action and then still deal with the failure.  Object storage systems are not going re-balancing **inherently** in BOTH directions- i.e. under and over goals situations are both corrected without user intervention.  

~ ~ ~

As far as I know what you are looking for doesn't exist- maybe in ZFS or BTRFS but I don't use either one of those nor would I for the fs for underneath.  LizardFS does recommend xfs or zfs but ext4 is fine too (that used to be in the documentation before zfs).  ZFS to me is overkill and would require more resources.  I get the concern you have about detecting in memory corruption but if this was a critical issue today every modern file system would have some sort of basic checking.  With ECC ram becoming cheaper and more available to the consumer market I think this issue goes away but my questions back to you would be:

How often have you seen this happen?
Are you 100% sure the corruption happened due to bad memory?
Are you 100% sure there were not any external forces that created this event?

Again, I agree its **possible** but IP networks already implement checksums so if we are not detecting bit flips with ECC ram and we're not having an issues with bit flips elsewhere on a chunk server node (which would be corrected during a scrub) after the data is received from the network interface, I'm not sure what we gain by adding another CRC mechanism if this is **not probable** to happen.  I did post this question to the list so I'll pass along the response when I see it. 

~ ~ ~

I have 3 PCs with 2 drives each.  How many command lines will I type
to get the whole thing running?  And how trivial are they/etc?

The minimum I would do:

1) PC1... master & chunk server configs (3 files)
2) PC2... chunk server & metadata logger config (3 files)
3) PC3... chuck server config (2 files)

I'm not including things like initializes metadata or adjusting but this set you could do goal 2 or 3, xor2 (2 data + 1 parity) or ec2+1 (same as xor2) or ec(3+2).  From here you could add a shadow master (to PC3).  Any single data drive anywhere (assuming they are not also system drives) and PC2 or PC3 could fail.  With ec(3+2), PC2 and PC3 could fail but you could would still be up.  PC1 in the basic case could not fail since it is the only metadata server, assuming it is backed up you could restore and pull over a copy of the metadata from PC2.

That would probably take you less than 60 minutes- maybe less than 30  :)

~ ~ ~

I was thinking more in terms of snapshots, serialization/etc.  Ie I
want to backup the cluster, not backup every node in the cluster
simultaneously/etc.

I think answered that- "the cluster" is its /etc configurations and metadata.  The /etc files are static.

Yes, you can snapshot data.

I don't know what you mean by serialization, example?

~ ~ ~ 

Sure, but what happens when that one system fails, and you lose
multiple chunk servers at once?  Will the filesystem have ensured that
those multiple chunk servers weren't required to meet the replication
constraints?

If your system is down then it is down so while your data is "safe" you would probably only be ok to the last backup of your server (and probably not since the metadata would not be up to date).  I scaled this down to a trivial case that is not viable unless you have you metadata logged elsewhere.  I mentioned this because I know its been done (and I certainly hope people doing it are capturing the metadata elsewhere).  Really the most basic deployment would be on two systems.

~ ~ ~

Well, my whole goal is that I can have maybe ~15-18TB of disks
providing ~12TB of usable storage across a minimal number of systems
so that the loss of any one host doesn't affect the cluster.  With
Ceph that is relatively straightforward, with the main constraint
being that I need at least 3 physical hosts.  It doesn't really
support having only 2.  Ceph actually uses a storage node per disk, so
a host with multiple disks runs multiple storage hosts.  However, the
default redundancy is at the host level, not the storage node level,
so redundancy will always be satisfied for the loss of any one host.
You can actually have more complex mappings like tagging each node
with host+rack+room+building+site and have all kinds of rules on how
the data is split up.

You can do a two node cluster with LizardFS.  Essentially you'd have the equivalent of DRBD, network RAID-1.  LizardFS assumes one storage node per device as well but as I said above you can do more if you want.

~ ~ ~

>So, it is somewhat complex, but my understanding is that you can
simply change the rules, and the remapping happens automatically.  The
main issue is that this can potentially cause a storm of IO as the
mapping of data to nodes is all based on a hash algorithm, so when you
change the inputs a little you change the outputs of the hash a lot.
For example, suppose you reduce redundancy from 3x-2x.  You might
naively think that this just means that the nodes would just figure
out which 1/3rd of the data to toss out.  In reality it changes the
destination addresses of every block of data in the cluster, so just
about every block is going to get rewritten somewhere else (other than
the percentage that ends up on the same node due to chance alone).
This is still automatic, but basically every node in the cluster ends
up with a massive work queue that they all work through as they export
data to other nodes, and also import data from other nodes doing the
same thing.  Eventually it all settles down...eventually.  What can
cause serious problems is if you go in and make another change while
this is in progress as this basically doubles all the queues, and
eventually RAM becomes a problem on every node in the cluster.

>As the admin you don't have to really think about this too much, but
you need to be aware of it because anything that changes the mapping
is going to probably cause the whole cluster to go crazy.  This is
part of why they recommend storage nodes have a secondary LAN between
them - it helps keep this traffic from killing the rest of your LAN.

I did not think that Ceph re-balances (down) automatically, if I were you, I'd want to prove that one to myself.  I haven't looked at it since spring 2017.  In LizardFS I can apply a goal to a file or a directory (even recursively) with one command and then monitor the progress.  The load for this can be controlled so that is not a concern if your servers are doing other things.  I didn't think Ceph could do this at the file / directory level either so if it can, that is great.

~ ~ ~

And that is what makes me nervous about LizardFS.  I'm really looking
for a solution that is easy to expand/maintain/etc.  One of the issues
if I scale up is that it becomes harder to "just rsync my data to
something else" - since that something else has to be just as big as
what I'm replacing, and I can't just migrate the data in-place.

?  So.... you're an Linux guy and you don't get that the most popular thing isn't always the best thing?  You sure 'bout that?  LOL

All I can say is that I like what I see from the project and I like what I've seen over the last year personally. I have the same issue- data just keeps growing and I want my storage management to be easy to maintain and scale.  I determined last year that LizardFS got this right- more right than the others I mentioned but that was like 6 to 8 months on my life to get to that point.  Its only now I've been putting the time into building out the deployments and I really like what I am seeing.

There are case studies on the .com variant of the web site but I think the whitepaper really gets to the power of the system.  That's what got me to try it and once I saw it in action, I was sold.


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Managing Member, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: "Rich Freeman" <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Sunday, August 12, 2018 3:13:57 PM
Subject: Re: [PLUG] Virtualization clusters & shared storage

On Sun, Aug 12, 2018 at 1:34 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
> Rich, I was frustrated last year trying to find comparisons myself
> between GlusterFS, Ceph (their FS functionality), XtreemFS and
> LizardFS.  I ended up building most of these but then settled on
> LizardFS since it is the one that is software defined storage and not
> object storage.


Honestly, you'll have to explain what you mean by that distinction?
While Ceph does object storage, and the FS component is built on top
of that, ultimately all filesystems are software-defined, unless
they're literally implemented in hardware.

> To your questions...
> * Does the implementation protect against memory corruption on storage
> nodes that do not use ECC?   (Note, I said storage nodes, NOT client
> nodes.)
>
> Protect in what sense?  If you don't have ECC ram you'd be relying on
> redundancies elsewhere in the transport of the data to protect against
> bit changes.


That is my question - which filesystems do this.  Example algorithm
I'd consider safe against memory corruption without use of ECC on the
storage node:

1.  Client (with ECC) passes data to filesystem layer on client.
2.  Filesystem layer (still with ECC since same host) computes checksum.
3.  Client filesystem layer passes data+checksum to some component of
the storage cluster.
4.  Storage cluster component verifies data vs checksum, then
distributes it through the cluster to the nodes that will store it.
5.  Each receiving node verifies data vs checksum, and stores
checksum+data on disk.  It then reads back the checksum only and
compares it to the checksum it received originally.
6.  Each receiving node acks completion of storage to the storage
component in #4, along with the checksum.
7.  Storage component waits for all acks, then compares all the
checksums received against the original checksum from the client.
8.  Storage component sends ack that sync is complete to the client,
along with the checksum.
9.  Client compares checksum it sent in #3 to the checksum it got back
in #8.  If they match then the sync is complete.

With an algorithm like this I believe we can tolerate a RAM failure
anywhere in the storage cluster, and the client is protected by ECC.
If there is a RAM failure somewhere in the cluster then the round trip
of checksums is going to run into an issue and the sync won't complete
on the client, which can trigger error handling to re-store the data.

For comparison, here is an algorithm I don't consider safe:
1.  Client (with ECC) passes data to filesystem layer on client.
2.  Filesystem layer (still with ECC since same host) computes checksum.
3.  Client filesystem layer passes data+checksum to some component of
the storage cluster.
4.  Storage cluster component verifies data vs checksum, then
distributes it through the cluster to the nodes that will store it.
5.  Each receiving node verifies data vs checksum.
5a.  Each receiving node writes the data to a checksummed filesystem
like ZFS, which computes its own checksum.
5b.  The receiving node does NOT read-back the data or otherwise
confirm that the data ZFS wrote was the data that was verified in step
5 above.
6.  Each receiving node acks completion of storage to the storage
component in #4, along with the checksum.
7.  Storage component waits for all acks, then compares all the
checksums received against the original checksum from the client.
8.  Storage component sends ack that sync is complete to the client,
along with the checksum.
9.  Client compares checksum it sent in #3 to the checksum it got back
in #8.  If they match then the sync is complete.

The only change is that the original checksum is never written to disk
and re-verified.  This means that if the data changed between the time
it was checksummed and the time it was passed to ZFS (and
re-protected) that would not be caught.  You could end up with two
nodes whose ZFS both say they have valid data, but the copies differ.

Now, if you changed the order of operations in the second algorithm so
that the data validation happened on a read-back from ZFS that would
address this issue.


>
> * Complexity/etc.
>
> What do you mean?

I have 3 PCs with 2 drives each.  How many command lines will I type
to get the whole thing running?  And how trivial are they/etc?

With something like ZFS/mdadm setting up a RAID is pretty simple.  I
can tell you that with Ceph it gets a bit more complex with the need
to sync secrets and such across nodes, and various tuning options, and
so on.  There is an ansible playbook out there that helps with a lot
of that, but sometimes I wonder if it is a recipe for losing all your
data in one bug...

> * Options for backup.
>
> Standard stuff here, you'll want /etc/lizardfs or /etc/mfs for your
> configuration files and /var/lib/lizardfs or /var/lib/mfs for your
> metadata.  Keep in mind that shadow masters and metaloggers also would
> give you copies of your metadata.


I was thinking more in terms of snapshots, serialization/etc.  Ie I
want to backup the cluster, not backup every node in the cluster
simultaneously/etc.

> * How well does it scale down?  I'm looking at alternatives to
> largeish ZFS/NAS home-lab implementations, not to run Google.  Do I
> need 12 physical hosts to serve 10TB of data with redundancy to
> survive the loss of 1 physical host?
>
> It scales down very well.  I'm working on something now that I'm not
> ready to talk about yet but a common question that comes up on the
> list is can you run multiple chunk servers on one system.  The answer
> is yes.


Sure, but what happens when that one system fails, and you lose
multiple chunk servers at once?  Will the filesystem have ensured that
those multiple chunk servers weren't required to meet the replication
constraints?

> I'd have to know more about what you are doing but I know of scenarios
> where people are running single systems with 2 or more chunk servers
> against their storage instead of RAID or other systems to provide
> redundancy availablilty.


Well, my whole goal is that I can have maybe ~15-18TB of disks
providing ~12TB of usable storage across a minimal number of systems
so that the loss of any one host doesn't affect the cluster.  With
Ceph that is relatively straightforward, with the main constraint
being that I need at least 3 physical hosts.  It doesn't really
support having only 2.  Ceph actually uses a storage node per disk, so
a host with multiple disks runs multiple storage hosts.  However, the
default redundancy is at the host level, not the storage node level,
so redundancy will always be satisfied for the loss of any one host.
You can actually have more complex mappings like tagging each node
with host+rack+room+building+site and have all kinds of rules on how
the data is split up.

>
> I looked at CephFS too and I like what I see but its still object storage.

Well, if all you used was CephFS you wouldn't know it was using object
storage.  I'm not sure why this is a concern.  You could view it as an
internal implementation detail, with the filesystem being an
interface, but with object store being a separate interface.

> Plus changing your data strategy on the fly or haven't multiple strategies seemed to be much more difficult to do (if at all).

So, it is somewhat complex, but my understanding is that you can
simply change the rules, and the remapping happens automatically.  The
main issue is that this can potentially cause a storm of IO as the
mapping of data to nodes is all based on a hash algorithm, so when you
change the inputs a little you change the outputs of the hash a lot.
For example, suppose you reduce redundancy from 3x-2x.  You might
naively think that this just means that the nodes would just figure
out which 1/3rd of the data to toss out.  In reality it changes the
destination addresses of every block of data in the cluster, so just
about every block is going to get rewritten somewhere else (other than
the percentage that ends up on the same node due to chance alone).
This is still automatic, but basically every node in the cluster ends
up with a massive work queue that they all work through as they export
data to other nodes, and also import data from other nodes doing the
same thing.  Eventually it all settles down...eventually.  What can
cause serious problems is if you go in and make another change while
this is in progress as this basically doubles all the queues, and
eventually RAM becomes a problem on every node in the cluster.

As the admin you don't have to really think about this too much, but
you need to be aware of it because anything that changes the mapping
is going to probably cause the whole cluster to go crazy.  This is
part of why they recommend storage nodes have a secondary LAN between
them - it helps keep this traffic from killing the rest of your LAN.

>
> I found LizardFS to be much easier to setup and work with than Ceph.
> Adding / removing resources, moving data around, rebuilding failed
> resources and changing replication goals are easy to do. The other
> thing I really like is that it layers on a standard filesystems which
> was important since filesystems take a long time to vet say they
> are reliable.  You could in theory share a current filesystem with
> LizardsFS (you just create a folder for it so that root of the chunk
> server data starts there) to try it out- if you like it, moving things
> around is easy trivial and if you don't, you just uninstall and delete
> that folder).  Ceph may allow you to do but I didn't look into it.


Until recently layering on another filesystem was the only way Ceph
worked.  This is why I'm not sure why you're making this distinction.
However, I'm completely willing to believe that the clustering layer
is simpler to administer.

>
> LizardFS has been around for awhile now and they do have corporate
> backing.  I'm not sure why more people aren't familiar with it but it
> took me awhile to find it too.  Gluster and Ceph tend to suck the air
> out of the room since they are the most popular.


And that is what makes me nervous about LizardFS.  I'm really looking
for a solution that is easy to expand/maintain/etc.  One of the issues
if I scale up is that it becomes harder to "just rsync my data to
something else" - since that something else has to be just as big as
what I'm replacing, and I can't just migrate the data in-place.


-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
Follow-Ups:
- Re: [PLUG] Virtualization clusters & shared storage
  - From: Rich Freeman <r-plug@thefreemanclan.net>
References:
- [PLUG] Virtualization clusters & shared storage
  - From: JP Vossen <jp@jpsdomain.org>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: JP Vossen <jp@jpsdomain.org>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: Rich Freeman <r-plug@thefreemanclan.net>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: Rich Freeman <r-plug@thefreemanclan.net>
Prev by Date: Re: [PLUG] Virtualization clusters & shared storage
Next by Date: Re: [PLUG] Virtualization clusters & shared storage
Previous by thread: Re: [PLUG] Virtualization clusters & shared storage
Next by thread: Re: [PLUG] Virtualization clusters & shared storage
Index(es):
- Date
- Thread