Keith C. Perry on 12 Aug 2018 22:45:50 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


Comment below are prepended with ">>"

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Managing Member, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: "Rich Freeman" <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Sunday, August 12, 2018 8:40:14 PM
Subject: Re: [PLUG] Virtualization clusters & shared storage

On Sun, Aug 12, 2018 at 6:42 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
>
> Software defined storage works differently in that the system concerns itself with the state of your storage and works to maintain that.  It is most apparent when were talking about replication but also extends to general filesystem health- e.g. are all my chunks of data available for this file?... is the data in them correct? etc. It is far easier to say something like "I want 3 copies of this file... 2 in NY and 1 in PA" or "I want 2 copies of this file but at least one has to be on SSDs" than it is to specify all the actions to do that AND maintain balance across all your OSD / chuck servers AND execute the specific actions needed to maintain your storage definitions when a failure occurs.

Isn't that basically what Ceph(FS) already does?  You tell it how much
replication you want, and under what constraints, and it ensures it
stays that way.  You don't have to manually do anything when a failure
happens - as long as there is sufficient space remaining it will
reshuffle your data automatically to bring it back into redundancy.
Now, obviously if the cluster is 99% full and you lose a replica it
isn't going to be able to re-achieve redundancy until you add more
storage, but as soon as you do it will shuffle the data around.

So, I'm not seeing the distinction you're trying to make, unless you
think that other cluster filesystems don't automatically heal/etc, or
make you micromanage what data goes where...

>> Ceph's operations, re-balancing and the like, from what I remember work at the OSD level not at the file or directory level.  I don't remember finding in documentation that supports the later.  CepthFS might be closer to an SDS than I give it credit for.  For me, LizardFS is a simpler solution.

> As far as I know what you are looking for doesn't exist- maybe in ZFS
> or BTRFS but I don't use either one of those nor would I for the fs
> for underneath.  LizardFS does recommend xfs or zfs but ext4 is fine
> too (that used to be in the documentation before zfs).  ZFS to me is
> overkill and would require more resources.  I get the concern you
> have about detecting in memory corruption but if this was a critical
> issue today every modern file system would have some sort of basic
> checking.  With ECC ram becoming cheaper and more available to the
> consumer market I think this issue goes away but my questions back to
> you would be:

I think that we're not seeing a lot of concern with this because
everybody just throws money at the problem and buys ECC RAM.  I want
to do storage on CHEAP hardware, as in $100 per storage node or
something on that order of magnitude (I'd do them on Pis if their IO
weren't so bad).  Cheap hardware means no ECC.

>> I get that but at a certain point there is no "free" lunch.  I've built way more systems without ECC ram than with and I've never had bad memory in a system that didn't have another issue.  Bad ram is usually the result of something else going wrong in my experience.

>
> How often have you seen this happen?
> Are you 100% sure the corruption happened due to bad memory?

I have absolutely had filesystem corruption due to bad memory.

>> How often though?

> Are you 100% sure there were not any external forces that created this event?

Considering the system could barely stay up for more than 10 minutes
once it started failing, I'm pretty confident the memory was bad.

>> that doesn't scream memory to me actually

Now, that was a pretty easy to discover situation so backups would
have been an option (and this system didn't actually store anything
all that critical anyway).  I'd be more concerned about stuff
discovered much later.

>> fair point there, people usually find out too late that their backups are corrupt

> Again, I agree its **possible** but IP networks already implement
> checksums so if we are not detecting bit flips with ECC ram and we're
> not having an issues with bit flips elsewhere on a chunk server node
> (which would be corrected during a scrub) after the data is received
> from the network interface, I'm not sure what we gain by adding
> another CRC mechanism if this is **not probable** to happen.  I did
> post this question to the list so I'll pass along the response when I
> see it.

I'm not using ECC RAM in my storage nodes - the goal is to keep them
cheap.  So, this isn't redundancy in software on top of ECC, but
rather in place of it.  It is silly to not check CRCs since we're
already computing them for other purposes and all we need to do is
simply not discard them, but store them with the data and return them
with the acknowledgement packet.  What does that cost, a single field?
 It is silly not to safeguard against it.

>> I'm not sure I agree wit you statement about CRC's.  However, even if that were true at every layer of the stack it would break the concept of layer independence to not through them out after the data has been as successfully passed up or down.  Trying to handle this in software is not going to work. We're saying a bit flip is a error that software is not going to pick up- hence the need for ECC.  If you not using ECC then all bets are off since the computed CRC could be wrong.  In the example you gave you mentioned you had ECC in the clients and NOT in the storage OSDs / chunk servers.  To borrow your own language, if you can about your data and have this concern won't it be "silly" not to use ECC memory there too?  Why go "cheap" when you're talking about your data?

But, then again the market for these cluster filesystems doesn't tend
to be people running them at home, so I get why they aren't
super-concerned with it.

>> Very true

>
> I was thinking more in terms of snapshots, serialization/etc.  Ie I
> want to backup the cluster, not backup every node in the cluster
> simultaneously/etc.
>
> I think answered that- "the cluster" is its /etc configurations and metadata.  The /etc files are static.
>
> Yes, you can snapshot data.
>
> I don't know what you mean by serialization, example?

I have two snapshots of the same data at different times, and I want a
file/pipe that can be used to reproduce one snapshot from the other,
or a file/pipe that can be used to recreate one of the snapshots given
nothing else.  Backups, essentially.

>> This sounds like versioning to me and that is what a snapshot gives you.  For better description of what is happening I would refer you to the whitepaper since it is covered in there.

> I did not think that Ceph re-balances (down) automatically, if I were you, I'd want to prove that one to myself.

I've already done it in small-scale tests.  Just run a fleet of VMs
with a 100GB volume on each.  If you don't set noout then when the
cluster consensus decides that an OSD is missing it is removed from
the cluster and then data rebalances.  This is automatic unless you
disable it, which you would not want to do normally.

In fact, one of the more annoying things is that if you want to shut
down the cluster you have to set flags to disable this behavior
otherwise you'll create a storm as the nodes don't shut down
synchronously.  Granted, that is a single command line.  Then you have
to re-enable it when the cluster comes back up.  It isn't really
intended to be something you're shutting down at the end of the day -
I imagine you could automate it to a degree though you'd need to make
sure the cluster is all up before telling it to start doing the
integrity checks.

>> LizardFS doesn't work that way once missing chucks or chunks that don't meet there are are detected the system begin to re-balance itself at a rate you tune to your server.  You're not going to see an storms of data because of missing chunks unless that is what you want.  It is complete plausible that a chuck server could go down and all the files and directories meet their goals.  The re-balancing in that case is related to load and resource consumption.  In Cepth I seem to recall that that sort of re-balancing is manually done http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/  <-- In LizardFS I don't have to worry about the concept of weighs.  When it comes to removing a disk or chunk server, there is no need to migrate data from it unless something had a goal of 1.


> I didn't think Ceph could do this at the file / directory level either so if it can, that is great.

Ceph has a layer where these settings can be applied.  So, you can
have a group of objects/volumes/filesystems that use one set of rules,
and another set that use another.  They can all share the same storage
nodes collectively - you don't have to allocate individual nodes to a
particular policy.  I don't think you could do it at the
file/directory level unless you made it a new mountpoint.  Of course,
it is all POSIX so you can have as many different Ceph mounts as you
want, and you can also have the same filesystem mounted on multiple
clients at the same time.

At the same layer where you'd apply the redundancy policies you also
manage security (at the cluster level).  Keys are assigned to
permissions/etc, so different clients can have access to different
sets of data.  The data itself is distributed across all the storage
nodes, so the data from one client can be sitting next to the data for
another.

>> This is one of the most significant differences.  I don't have to do a mount point for each data policy.  I like managing things from the file or directory level.  The space can get craved up however I want with whatever permissions I want and whatever quotas if I need them.  You can also password protect mount points.

The concern I had with Ceph is that to me if felt like most user executed operations required several steps whereas in LizardFS you might be modifying a config file and reloading or executing a single command.  I would still content that this is because Ceph is a more layered product that has capabilities other than CephFS.  I was looking for something as simple and as capable as possible- Cepth relative to LizardFS is not that.  I also really liked XtreemFS (simple and better geo-spacial performance but appears to now be defunct- only bug fixes) more than Cepth too but LizardsFS has the better overall feature set and monitoring as well as pretty active development. 

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug