Keith C. Perry on 14 Aug 2018 15:00:40 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Virtualization clusters & shared storage |
LizardFS scales to 12 exabytes... Ceph I could not quickly find- call it the same or more... I think you or anyone else using either system would be fine :D Although both CephFS and LizardFS metadata is kept separate, this is infinitely better than storing metadata and data together. I can give you some horror stories about that with commercial systems. In this regard there is going to be little difference between Ceph and LizardFS. They're going to both able to perform well with proper builds. As far as I've seen reported LizardFS does tend to be faster than other parallel systems but I don't remember seeing a comparison to CephFS. I suspect they are probably similar is real workload IOPS. My ZFS was idea was a throw away LOL I don't know enough about that system to make a recommendation other than to 1) not use ZFS and 2) bring up another node if you want more redundancy. Honestly, that would be my answer in any case unless you're want to archive. My hashing idea was also a throw away but I don't have the ECC concerns and you are right about checking all copies in the case of mirroring (erasure coding would be even more of a pita to check), however scrubbing IS part of the system. ZFS under LizardFS without ECC ram... As I said before, there are no free lunches- you either have ECC ram or you don't. ZFS without ECC ram is NOT as safe as ZFS with ECC ram. https://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/ The most important line in that is this: "All that stuff about ZFS self-healing goes down the drain if the system isn't using ECC RAM" Its not ZFS that makes you **safer** its ECC RAM. You can debate the merits of ZFS versus other filesystems but that is a different conversation. Actually, after reading that link I definitely would not recommend ZFS under a parallel file system without ECC. The potential for making matters worse is non-zero and as you point out, now we have the parallel file system correcting errors when it shouldn't have to. ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Keith C. Perry, MS E.E. Managing Member, DAO Technologies LLC (O) +1.215.525.4165 x2033 (M) +1.215.432.5167 www.daotechnologies.com ----- Original Message ----- From: "Rich Freeman" <r-plug@thefreemanclan.net> To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org> Sent: Monday, August 13, 2018 6:25:00 PM Subject: Re: [PLUG] Virtualization clusters & shared storage On Mon, Aug 13, 2018 at 5:28 PM Keith C. Perry <kperry@daotechnologies.com> wrote: > > > I choose that link to illustrate that Ceph typically needs more steps > to manage. In Ceph it re-balances automatically as expected OSD but > adding and removing an OSD is more manual. In LizardFS, re-balancing > is re-balancing. Adding or removing resources is the same thing, > failed or not (unless you have data with a goal of 1) because LizardFS > concerns itself with the state of **file** chunks. Naturally this > implies that the state of chunk server is known but the important > difference is that LizardFS works at a lower level. So to me, that > 3 wins... better granularity, simpler and more flexible management. > Cepth is a different product- a product with wider scope and for that > reason there is more complexity in their offerings. The FS part is > maturing fast but they are really more known for their object storage > and block devices. I'm not sure I entirely agree with all of that statement, but after reading up a bit more on LizardFS one contrast I will make is that they've made a few different design choices that have pros and cons. One of Ceph's big objectives was to eliminate all bottlenecks so that it can scale really big. This led to driving object storage out of a hash function. There is no central database of what object is stored where - the storage is algorithmic, so anybody with the shared library and the configuration can figure out exactly where any object will be stored. This means that clients can go directly to storage nodes and there is no master server (for object/volume storage - the POSIX filesystem layer does require a metadata server that is more of a bottleneck). However, this also means that when you add one node the configuration changes, which changes the input to the hash function, and it basically changes the destination of every single chunk of data everywhere and causes a huge rebalance. Apparently when the cluster gets really tiny that hash function has some bugs that can cause issues, and from a bit of searching online that might not be entirely limited to small clusters, but I didn't see many recent reports of this problem. LizardFS has a master server, which means it can be a lot more flexible about where stuff is stored, since the master server just keeps a database of what goes where. When you add a node it doesn't have to move everything around - it just allocates new stuff to the new nodes and records where it put everything. It of course backs up this critical metadata, but at any one time there is only one master server, and every single operation ends up hitting the master server to figure out where the data goes. Now, it looks like they tried to minimize the IO hitting the master server for obvious reasons, but it isn't going to be infinitely scalable. > You certainly could use ZFS under LizardFS and then figure out how integrate the ZFS snapshot process into a "backup" procedure for the cluster. That sounds incredibly problematic. You have many nodes. You'd need to somehow pause the cluster so that those nodes all reach a consistent state, then do the snapshots everywhere, and then resume the cluster. Then when you do the backup you're capturing all the redundancy in your backup since ZFS doesn't know anything about that, so your backups are wasteful. During the pause you're effectively offline. > > I looked up the cost of RAM for my machine intelligence workstation... 8Gb ECC was ~$118 and 8Gb non-ECC was $83... So, I'm talking commodity hardware. A used COMPUTER at Newegg costs $100-150 total. Clearly we're not talking about spending $83 on RAM. However, the relative prices are about right for the RAM itself. Again, though, you also need a CPU+motherboard that supports ECC. For AMD this doesn't really cost you more as long as you're careful when selecting the motherboard. For Intel it will likely cost you quite a bit more, as they deliberately disable ECC on any CPU in the sweet spot price-wise. And I'm still not aware of any cheap ECC ARM solutions. > I suppose as an protection mechanism you could generate signatures > (with a fast hash like SHA-1) for every file on your system and store > it in a database so that you can periodically run a check against your > parallel filesystem storage. You could prove to yourself in a couple > of years that your solution you is viable (or not) without ECC. This suffers for a few reasons: 1. I want to check all the redundant copies, and reading the files with a filesystem call will only check one copy (no idea which). 2. This necessitates moving all the data for each file to a single node to do the checksum, which means a ton of network traffic. 3. This operation would be ignorant of how data is stored on-disk, which means purely random IO all over the cluster. Sure, the cluster might be able to handle it, but it is unnecessary disk seeking. Ideally this scrubbing should be part of the filesystem, so that all the nodes can do a low-priority scrub of the data stored on that node. This could be done asynchronously with no network traffic (except to report status or correct errors), and at idle priority on each node. This would also hit all the redundant copies. Now, for LizardFS on top of ZFS I'm not super-worried about silent corruption of data at rest, because ZFS already detects that. The only issue would be triggering renewal of any bad files. ZFS could tell you that a particular file (at the lower level) is corrupt, but then you need to get LizardFS to regenerate that file from the rest of the cluster. It would get an IO error when it goes to read that file, so as long as scrubs read all the data on disk you would be fine, because ZFS would generate an IO error reading a file it knows is bad, and then presumably LizardFS would go and find the redundant data and regenerate it, overwriting the bad file with a new good one. This wouldn't cover memory issues though that impacted the data before ZFS wrote it to disk. If ZFS is handed corrupt data to store, it will simply ensure that the data remains in the same corrupt state it was given. -- Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug