Keith C. Perry on 13 Aug 2018 14:29:16 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


We're getting a bit off topic so I'll hit the Ceph points then make some ECC ram points

Ceph
^^^^

I choose that link to illustrate that Ceph typically needs more steps to manage.  In Ceph it re-balances automatically as expected OSD but adding and removing an OSD is more manual.  In LizardFS, re-balancing is re-balancing.  Adding or removing resources is the same thing, failed or not (unless you have data with a goal of 1) because LizardFS concerns itself with the state of **file** chunks.  Naturally this implies that the state of chunk server is known but the important difference is that LizardFS works at a lower level.  So to me, that 3 wins... better granularity, simpler and more flexible management.  Cepth is a different product- a product with wider scope and for that reason there is more complexity in their offerings.  The FS part is maturing fast but they are really more known for their object storage and block devices.

In regards to the ZFS type of snapshot, that is not there.  However, I struggle to find the need for it.  If we are saying data is growing and we're going to move to a distributed filesystem to take advantage of the replication and ease of management features only to say we're concerned about backing it up again, we're going to need even more space.  My current data protection solution (not on LizardFS yet) keeps all data versions in more than one place for whatever retention period required.  That storage system IS the backup for the primary fileserver but that means there are 3 copies of current data plus anything versioned.  This is what parallel filesystems excel at- if I want 3 copies of my data and want to snapshot to bring in some versioning then I can do that in one system and gain a lot of flexibility.  There is no functional need to produce another backup unless you want to be more paranoid (I'm not going to say that is bad).  So 4 copies?... ok, then why not just bring up a 4th storage node?  At some point you're going to hit a point of diminishing returns even if you are sleeping better.  I seem to recall a ZFS fan bringing the snapshot sending up on the list.  One thing that ZFS folks in particular get wrong is that ZFS != Parallel File system.  Its still closer to basic filesystem concepts than it is to distributed or parallel filesystem concepts.  You certainly could use ZFS under LizardFS and then figure out how integrate the ZFS snapshot process into a "backup" procedure for the cluster.

It goes without saying that if you were doing this to archive data then that would be a different conversation... and even more storage  :)


ECC
^^^

I looked up the cost of RAM for my machine intelligence workstation...  8Gb ECC was ~$118 and 8Gb non-ECC was $83... 42% is a high premium **percentage** but in the grand scheme of things would the additional $140 dollars for 32Gb of ECC vs 32Gb of non-ECC be worth the piece of mind?  That is a individual choice and is an example of the current state of affairs.  What is true is that that price premium is shrinking.

The vast majority of consumer and business computers are not using ECC ram... this especially true with laptops which is what most people and business are buying.  So, how it is that we aren't seeing a systemic issue with bits getting silently flipped in our data that ends up on one of these wonderful servers with that glorious ECC memory?  Garbage in would be happily and successfully checked so where is the outrage?

Answer?  This is not a problem-

But CERN says...

Yea, I'm aware of their study and it doesn't change my statement because when I say its "not a problem" it is because as you and I previously stated, data is checked as its is passed along through our systems.  Previously I mentioned that IP has ecc functions.  Another place we have them is in the pci subsystem in hardware.  Here is an example of what has been happening on my laptop today...

[Mon Aug 13 10:53:00 2018] pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
[Mon Aug 13 10:53:00 2018] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e5(Receiver ID)
[Mon Aug 13 10:53:00 2018] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000001/00002000                                                                                   
[Mon Aug 13 10:53:00 2018] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)                                                                                                     

Ok, the error got corrected but what is device 8086:9d15 ?

lspci -vn

00:1c.5 0604: 8086:9d15 (rev f1) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 123
        Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
        Memory behind bridge: df100000-df1fffff
        Capabilities: [40] Express Root Port (Slot+), MSI 00
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [90] Subsystem: 1043:1b1d
        Capabilities: [a0] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Access Control Services
        Capabilities: [200] L1 PM Substates
        Capabilities: [220] #19
        Kernel driver in use: pcieport
        Kernel modules: shpchp


now lspci -tnn

-[0000:00]-+-00.0  Intel Corporation Skylake Host Bridge/DRAM Registers [8086:1904]
           +-02.0  Intel Corporation HD Graphics 520 [8086:1916]
           +-04.0  Intel Corporation Skylake Processor Thermal Subsystem [8086:1903]
           +-14.0  Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:9d2f]
           +-14.2  Intel Corporation Sunrise Point-LP Thermal subsystem [8086:9d31]
           +-16.0  Intel Corporation Sunrise Point-LP CSME HECI #1 [8086:9d3a]
           +-17.0  Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode] [8086:9d03]
           +-1c.0-[01]----00.0  NVIDIA Corporation GM108M [GeForce 940M] [10de:1347]
           +-1c.5-[02]----00.0  Intel Corporation Wireless 7265 [8086:095a]
           +-1f.0  Intel Corporation Sunrise Point-LP LPC Controller [8086:9d48]
           +-1f.2  Intel Corporation Sunrise Point-LP PMC [8086:9d21]
           +-1f.3  Intel Corporation Sunrise Point-LP HD Audio [8086:9d70]
           \-1f.4  Intel Corporation Sunrise Point-LP SMBus [8086:9d23]


Damn intel wireless but I digress...

(this is not surprising though)

There are a a number of chips in hardware sub systems that use error correcting code to detect and automatically fix bit errors.  The reason why hardware has to do this is because the hardware is the physical thing that ultimately responsible for successful I/O to the human- whether is a keyboard or network traffic.  The CRC's at the IP layer would be meaningless if the pci buss and nic chipssets didn't do something to prevent garbage in.  Sure you can generate CRCs in software and pass them along but ONLY if there is something external to that process that is playing a supervisory role to make sure software is manipulating good data.  Furthermore the reason why ECC is a good to had and not a must have is because we have adequate checking in elsewhere in hardware that there is good data fidelity without it.

Assuming good memory was in a system to begin with, when you find consistently bad memory it usually be because of something else that happened.  That something else in my experience is some sort of failure in either the system power supply or motherboard (and usually there, in the power stages).  From there external power events- over, possibly under voltage events or "dirty" power (i.e. noise), next thermal events, then mechanical (e.g. vibrations).  I wasn't saying memory errors don't happen just that it would be incredibly strange to not be able to track bad memory back to another event generally speaking.  Certainly the edge can always happen no matter how rare it is.

I really do get your concerns and in a perfect world ECC ram would be the norm but there is a reason why that hasn't become the case. Quick anecdote... I recently found and imaged old Maxtor system disks from 2 of my earliest Linux workstations... circa 1996.  I as able to launch both images in virtualization and I went through just about everything in my home directories and the data was sound.  2 weeks later, BOTH those drives spun no more.  They were Maxtors so that is not surprising but the workstations they were in also did not have ECC ram.  I think we've been making  decent motherboard for awhile as far as data integrity is concerned and they only have gotten better.  Considering that almost all of these distributed and parallel filesystems- even something like Lustre (http://lustre.org/getting-started-with-lustre/) talks about being able to be used on commodity hardware, there must be a consensus that ECC ram is not a **must** have for these systems.

I suppose as an protection mechanism you could generate signatures (with a fast hash like SHA-1) for every file on your system and store it in a database so that you can periodically run a check against your parallel filesystem storage.  You could prove to yourself in a couple of years that your solution you is viable (or not) without ECC. 


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Managing Member, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: "Rich Freeman" <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Monday, August 13, 2018 7:13:10 AM
Subject: Re: [PLUG] Virtualization clusters & shared storage

On Mon, Aug 13, 2018 at 1:44 AM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
> Comment below are prepended with ">>"

Uh, you might want to try another MUA for lists.

In any case, everything below without quoting is a reply, and
everything else has one more > than I received it with.

> From: "Rich Freeman" <r-plug@thefreemanclan.net>
> To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
> Sent: Sunday, August 12, 2018 8:40:14 PM
> Subject: Re: [PLUG] Virtualization clusters & shared storage
>
> On Sun, Aug 12, 2018 at 6:42 PM Keith C. Perry
> <kperry@daotechnologies.com> wrote:
> >
>
> >> Ceph's operations, re-balancing and the like, from what I remember
> >> work at the OSD level not at the file or directory level.  I don't
> >> remember finding in documentation that supports the later.  CepthFS
> >> might be closer to an SDS than I give it credit for.  For me,
> >> LizardFS is a simpler solution.

Ceph's policies work at the pool level.  A cluster can contain many
pools.  Any particular OSD can store data from multiple pools, and
typically all the OSDs contain data for all the pools.

>
> > Are you 100% sure there were not any external forces that created this event?
>
> Considering the system could barely stay up for more than 10 minutes
> once it started failing, I'm pretty confident the memory was bad.
>
> >> that doesn't scream memory to me actually

Well, memtestx86 thought otherwise.  It is possible the issue was in
the CPU or the motherboard I suppose.  Either way data was getting
corrupted.  And if checksums were being computed off of the device
entirely it wouldn't really matter where.

>
> >> I'm not sure I agree wit you statement about CRC's.  However, even
> >> if that were true at every layer of the stack it would break the
> >> concept of layer independence to not through them out after the
> >> data has been as successfully passed up or down.

Actually, layer independence is exactly why you WOULDN'T throw them
out.  The lower layers need not check the checksums as long as they
preserve all the data they're given and pass it back.  Having the
lower layers know about them would greatly optimize scrubbing through,
as they wouldn't have to transmit the data across the network to
verify it.

If you want to use independent layers, then the lower layers are just
given a block of data to store, and they store it.  That block of data
just happens to contain a checksum.

> >> Trying to handle this in software is not going to work. We're
> >> saying a bit flip is a error that software is not going to pick up-
> >> hence the need for ECC.  If you not using ECC then all bets are off
> >> since the computed CRC could be wrong.

Sure, but since that CRC is sent in a round trip before the sync
completes, if the CRC were computed wrong then you'd get a mismatch
before the sync finishes, which means the client still has the correct
data and can repeat the operation.

And if the write completes, and then a CRC is computed incorrectly on
read, it isn't a problem, because you can just use the data from the
other nodes and scrub or kick out the bad node.  If the CRCs are long
enough the chance of having multiple bit flips result in a wrong but
consistent set of data are very low.

> >> In the example you gave you mentioned you had ECC in the clients
> >> and NOT in the storage OSDs / chunk servers.  To borrow your own
> >> language, if you can about your data and have this concern won't it
> >> be "silly" not to use ECC memory there too?  Why go "cheap" when
> >> you're talking about your data?

ECC only protects against RAM errors.  What if there is a CPU error?

The whole point of having complete hardware redundancy is to handle a
hardware failure at any level (well, that and scalability).  If you
generate the CRCs at a higher layer, and verify them at a higher
layer, then the lower layer doesn't have to be reliable, since you
have redundancy.

Doing it in software is essentially free, so why not do it?

And ECC memory certainly isn't cheap.  If you are running Intel you'll
be charged a pretty penny for it, if you're using AMD you just have to
pay for the RAM but that is expensive, and if you're using ARM good
luck with using it at all.

Obviously I care about my data, which is why I'm not going to touch
any of these solutions unless I'm convinced they address this problem.
Right now I'm using ECC+ZFS, but that is fairly limiting because I am
restricted to the number of drives I can fit in a relatively
inexpensive single box.  With a cluster filesystem I can in theory
pick up cheap used PCs as disposable nodes.

> >
> > I was thinking more in terms of snapshots, serialization/etc.  Ie I
> > want to backup the cluster, not backup every node in the cluster
> > simultaneously/etc.
> >
> > I think answered that- "the cluster" is its /etc configurations and metadata.  The /etc files are static.
> >
> > Yes, you can snapshot data.
> >
> > I don't know what you mean by serialization, example?
>
> I have two snapshots of the same data at different times, and I want a
> file/pipe that can be used to reproduce one snapshot from the other,
> or a file/pipe that can be used to recreate one of the snapshots given
> nothing else.  Backups, essentially.
>
> >> This sounds like versioning to me and that is what a snapshot gives
> >> you.  For better description of what is happening I would refer you
> >> to the whitepaper since it is covered in there.

Of course a snapshot gives you versioning.  What I want is the ability
to get the snapshot OFF of the cluster for backups/etc, or just to
move data around.

zfs send or btrfs send are examples of serialization.  So is tar.
However, utilities that aren't integrated with the filesystem can't
efficiently do incremental backups as they have to read every file to
tell what changed, or at least stat them all.

> >> In Cepth I seem to recall that that sort of re-balancing is manually done http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/

Uh, did you read that page?  It contains no steps that perform
re-balancing.  All the instructions are simply for adding or removing
the OSD itself.  The cluster does the rebalancing whenever an OSD is
added or removed, unless this is disabled (which might be done during
administration, such as when adding many nodes at once so that you
aren't triggering multiple rounds of balancing).

> >> <-- In LizardFS I don't have to worry about the concept of weighs.
> >> When it comes to removing a disk or chunk server, there is no need
> >> to migrate data from it unless something had a goal of 1.

That sounds like a Ceph bug, actually, despite them describing it as a
"corner case" - when a node is out the data should automatically
migrate away, especially if you aren't around.  If I have a device
fail I don't really have the option to mark it back in and mess around
with weights.  I need the data to move off automatically.

But, reading that makes me less likely to use Ceph, as the developers
seem unconcerned with smaller clusters...

So, thanks for pointing that out.

> The concern I had with Ceph is that to me if felt like most user
> executed operations required several steps whereas in LizardFS you
> might be modifying a config file and reloading or executing a single
> command.

And that is a concern I share.  There is an ansible playbook for
managing a Ceph cluster which makes the whole thing a lot more
goal-driven.  However, it also makes me worry that a playbook bug
could end up wiping out my whole cluster.  Also, I had trouble setting
per-host configs and I suspect it was an incompatibility between the
playbook and the version of ansible I was running.  Of course, as with
many upstream projects, the playbook docs did not mention any strict
dependency version requirements.  That makes me thing I'd need an
Ubuntu container just to run my playbooks since that is probably what
the authors are developing it on.

But, in my testing the playbooks just worked.  You pointed them at a
bunch of debian hosts and it would go finding every disk that matched
the given list attached to all of them, wiped them, configured them
for Ceph, and assembled them into a cluster.  If you added a disk it
would get noticed the next time the playbook is run and that disk
would be added to the cluster.  Failures would be handled by the
cluster itself.  I never ran into that "corner case" they mentioned
but that doesn't make me less concerned - I'd want to understand
exactly what causes it and whether it would apply to me before I used
Ceph for anything really.  Maybe it only applies if you try to split
10MB across 3 nodes.  Or maybe it is one of those corner cases that
"probably" won't wipe out your 50PB production cluster...

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug