Rich Freeman on 13 Aug 2018 04:13:27 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Virtualization clusters & shared storage |
On Mon, Aug 13, 2018 at 1:44 AM Keith C. Perry <kperry@daotechnologies.com> wrote: > > Comment below are prepended with ">>" Uh, you might want to try another MUA for lists. In any case, everything below without quoting is a reply, and everything else has one more > than I received it with. > From: "Rich Freeman" <r-plug@thefreemanclan.net> > To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org> > Sent: Sunday, August 12, 2018 8:40:14 PM > Subject: Re: [PLUG] Virtualization clusters & shared storage > > On Sun, Aug 12, 2018 at 6:42 PM Keith C. Perry > <kperry@daotechnologies.com> wrote: > > > > >> Ceph's operations, re-balancing and the like, from what I remember > >> work at the OSD level not at the file or directory level. I don't > >> remember finding in documentation that supports the later. CepthFS > >> might be closer to an SDS than I give it credit for. For me, > >> LizardFS is a simpler solution. Ceph's policies work at the pool level. A cluster can contain many pools. Any particular OSD can store data from multiple pools, and typically all the OSDs contain data for all the pools. > > > Are you 100% sure there were not any external forces that created this event? > > Considering the system could barely stay up for more than 10 minutes > once it started failing, I'm pretty confident the memory was bad. > > >> that doesn't scream memory to me actually Well, memtestx86 thought otherwise. It is possible the issue was in the CPU or the motherboard I suppose. Either way data was getting corrupted. And if checksums were being computed off of the device entirely it wouldn't really matter where. > > >> I'm not sure I agree wit you statement about CRC's. However, even > >> if that were true at every layer of the stack it would break the > >> concept of layer independence to not through them out after the > >> data has been as successfully passed up or down. Actually, layer independence is exactly why you WOULDN'T throw them out. The lower layers need not check the checksums as long as they preserve all the data they're given and pass it back. Having the lower layers know about them would greatly optimize scrubbing through, as they wouldn't have to transmit the data across the network to verify it. If you want to use independent layers, then the lower layers are just given a block of data to store, and they store it. That block of data just happens to contain a checksum. > >> Trying to handle this in software is not going to work. We're > >> saying a bit flip is a error that software is not going to pick up- > >> hence the need for ECC. If you not using ECC then all bets are off > >> since the computed CRC could be wrong. Sure, but since that CRC is sent in a round trip before the sync completes, if the CRC were computed wrong then you'd get a mismatch before the sync finishes, which means the client still has the correct data and can repeat the operation. And if the write completes, and then a CRC is computed incorrectly on read, it isn't a problem, because you can just use the data from the other nodes and scrub or kick out the bad node. If the CRCs are long enough the chance of having multiple bit flips result in a wrong but consistent set of data are very low. > >> In the example you gave you mentioned you had ECC in the clients > >> and NOT in the storage OSDs / chunk servers. To borrow your own > >> language, if you can about your data and have this concern won't it > >> be "silly" not to use ECC memory there too? Why go "cheap" when > >> you're talking about your data? ECC only protects against RAM errors. What if there is a CPU error? The whole point of having complete hardware redundancy is to handle a hardware failure at any level (well, that and scalability). If you generate the CRCs at a higher layer, and verify them at a higher layer, then the lower layer doesn't have to be reliable, since you have redundancy. Doing it in software is essentially free, so why not do it? And ECC memory certainly isn't cheap. If you are running Intel you'll be charged a pretty penny for it, if you're using AMD you just have to pay for the RAM but that is expensive, and if you're using ARM good luck with using it at all. Obviously I care about my data, which is why I'm not going to touch any of these solutions unless I'm convinced they address this problem. Right now I'm using ECC+ZFS, but that is fairly limiting because I am restricted to the number of drives I can fit in a relatively inexpensive single box. With a cluster filesystem I can in theory pick up cheap used PCs as disposable nodes. > > > > I was thinking more in terms of snapshots, serialization/etc. Ie I > > want to backup the cluster, not backup every node in the cluster > > simultaneously/etc. > > > > I think answered that- "the cluster" is its /etc configurations and metadata. The /etc files are static. > > > > Yes, you can snapshot data. > > > > I don't know what you mean by serialization, example? > > I have two snapshots of the same data at different times, and I want a > file/pipe that can be used to reproduce one snapshot from the other, > or a file/pipe that can be used to recreate one of the snapshots given > nothing else. Backups, essentially. > > >> This sounds like versioning to me and that is what a snapshot gives > >> you. For better description of what is happening I would refer you > >> to the whitepaper since it is covered in there. Of course a snapshot gives you versioning. What I want is the ability to get the snapshot OFF of the cluster for backups/etc, or just to move data around. zfs send or btrfs send are examples of serialization. So is tar. However, utilities that aren't integrated with the filesystem can't efficiently do incremental backups as they have to read every file to tell what changed, or at least stat them all. > >> In Cepth I seem to recall that that sort of re-balancing is manually done http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/ Uh, did you read that page? It contains no steps that perform re-balancing. All the instructions are simply for adding or removing the OSD itself. The cluster does the rebalancing whenever an OSD is added or removed, unless this is disabled (which might be done during administration, such as when adding many nodes at once so that you aren't triggering multiple rounds of balancing). > >> <-- In LizardFS I don't have to worry about the concept of weighs. > >> When it comes to removing a disk or chunk server, there is no need > >> to migrate data from it unless something had a goal of 1. That sounds like a Ceph bug, actually, despite them describing it as a "corner case" - when a node is out the data should automatically migrate away, especially if you aren't around. If I have a device fail I don't really have the option to mark it back in and mess around with weights. I need the data to move off automatically. But, reading that makes me less likely to use Ceph, as the developers seem unconcerned with smaller clusters... So, thanks for pointing that out. > The concern I had with Ceph is that to me if felt like most user > executed operations required several steps whereas in LizardFS you > might be modifying a config file and reloading or executing a single > command. And that is a concern I share. There is an ansible playbook for managing a Ceph cluster which makes the whole thing a lot more goal-driven. However, it also makes me worry that a playbook bug could end up wiping out my whole cluster. Also, I had trouble setting per-host configs and I suspect it was an incompatibility between the playbook and the version of ansible I was running. Of course, as with many upstream projects, the playbook docs did not mention any strict dependency version requirements. That makes me thing I'd need an Ubuntu container just to run my playbooks since that is probably what the authors are developing it on. But, in my testing the playbooks just worked. You pointed them at a bunch of debian hosts and it would go finding every disk that matched the given list attached to all of them, wiped them, configured them for Ceph, and assembled them into a cluster. If you added a disk it would get noticed the next time the playbook is run and that disk would be added to the cluster. Failures would be handled by the cluster itself. I never ran into that "corner case" they mentioned but that doesn't make me less concerned - I'd want to understand exactly what causes it and whether it would apply to me before I used Ceph for anything really. Maybe it only applies if you try to split 10MB across 3 nodes. Or maybe it is one of those corner cases that "probably" won't wipe out your 50PB production cluster... -- Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug