Rich Freeman on 13 Aug 2018 15:25:18 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Virtualization clusters & shared storage |
On Mon, Aug 13, 2018 at 5:28 PM Keith C. Perry <kperry@daotechnologies.com> wrote: > > > I choose that link to illustrate that Ceph typically needs more steps > to manage. In Ceph it re-balances automatically as expected OSD but > adding and removing an OSD is more manual. In LizardFS, re-balancing > is re-balancing. Adding or removing resources is the same thing, > failed or not (unless you have data with a goal of 1) because LizardFS > concerns itself with the state of **file** chunks. Naturally this > implies that the state of chunk server is known but the important > difference is that LizardFS works at a lower level. So to me, that > 3 wins... better granularity, simpler and more flexible management. > Cepth is a different product- a product with wider scope and for that > reason there is more complexity in their offerings. The FS part is > maturing fast but they are really more known for their object storage > and block devices. I'm not sure I entirely agree with all of that statement, but after reading up a bit more on LizardFS one contrast I will make is that they've made a few different design choices that have pros and cons. One of Ceph's big objectives was to eliminate all bottlenecks so that it can scale really big. This led to driving object storage out of a hash function. There is no central database of what object is stored where - the storage is algorithmic, so anybody with the shared library and the configuration can figure out exactly where any object will be stored. This means that clients can go directly to storage nodes and there is no master server (for object/volume storage - the POSIX filesystem layer does require a metadata server that is more of a bottleneck). However, this also means that when you add one node the configuration changes, which changes the input to the hash function, and it basically changes the destination of every single chunk of data everywhere and causes a huge rebalance. Apparently when the cluster gets really tiny that hash function has some bugs that can cause issues, and from a bit of searching online that might not be entirely limited to small clusters, but I didn't see many recent reports of this problem. LizardFS has a master server, which means it can be a lot more flexible about where stuff is stored, since the master server just keeps a database of what goes where. When you add a node it doesn't have to move everything around - it just allocates new stuff to the new nodes and records where it put everything. It of course backs up this critical metadata, but at any one time there is only one master server, and every single operation ends up hitting the master server to figure out where the data goes. Now, it looks like they tried to minimize the IO hitting the master server for obvious reasons, but it isn't going to be infinitely scalable. > You certainly could use ZFS under LizardFS and then figure out how integrate the ZFS snapshot process into a "backup" procedure for the cluster. That sounds incredibly problematic. You have many nodes. You'd need to somehow pause the cluster so that those nodes all reach a consistent state, then do the snapshots everywhere, and then resume the cluster. Then when you do the backup you're capturing all the redundancy in your backup since ZFS doesn't know anything about that, so your backups are wasteful. During the pause you're effectively offline. > > I looked up the cost of RAM for my machine intelligence workstation... 8Gb ECC was ~$118 and 8Gb non-ECC was $83... So, I'm talking commodity hardware. A used COMPUTER at Newegg costs $100-150 total. Clearly we're not talking about spending $83 on RAM. However, the relative prices are about right for the RAM itself. Again, though, you also need a CPU+motherboard that supports ECC. For AMD this doesn't really cost you more as long as you're careful when selecting the motherboard. For Intel it will likely cost you quite a bit more, as they deliberately disable ECC on any CPU in the sweet spot price-wise. And I'm still not aware of any cheap ECC ARM solutions. > I suppose as an protection mechanism you could generate signatures > (with a fast hash like SHA-1) for every file on your system and store > it in a database so that you can periodically run a check against your > parallel filesystem storage. You could prove to yourself in a couple > of years that your solution you is viable (or not) without ECC. This suffers for a few reasons: 1. I want to check all the redundant copies, and reading the files with a filesystem call will only check one copy (no idea which). 2. This necessitates moving all the data for each file to a single node to do the checksum, which means a ton of network traffic. 3. This operation would be ignorant of how data is stored on-disk, which means purely random IO all over the cluster. Sure, the cluster might be able to handle it, but it is unnecessary disk seeking. Ideally this scrubbing should be part of the filesystem, so that all the nodes can do a low-priority scrub of the data stored on that node. This could be done asynchronously with no network traffic (except to report status or correct errors), and at idle priority on each node. This would also hit all the redundant copies. Now, for LizardFS on top of ZFS I'm not super-worried about silent corruption of data at rest, because ZFS already detects that. The only issue would be triggering renewal of any bad files. ZFS could tell you that a particular file (at the lower level) is corrupt, but then you need to get LizardFS to regenerate that file from the rest of the cluster. It would get an IO error when it goes to read that file, so as long as scrubs read all the data on disk you would be fine, because ZFS would generate an IO error reading a file it knows is bad, and then presumably LizardFS would go and find the redundant data and regenerate it, overwriting the bad file with a new good one. This wouldn't cover memory issues though that impacted the data before ZFS wrote it to disk. If ZFS is handed corrupt data to store, it will simply ensure that the data remains in the same corrupt state it was given. -- Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug