Rich Freeman on 17 Aug 2018 17:06:05 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Virtualization clusters & shared storage |
On Fri, Aug 17, 2018 at 7:41 PM Keith C. Perry <kperry@daotechnologies.com> wrote: > > "Again, the principle is layering. If you embed a checksum at a higher > layer, and verify it at a higher layer, then the lower layer is a > black box. The higher layer has a high degree of assurance that the > data is intact no matter how unreliable the lower layer is if the > checksum verifies and is sufficiently long. Confidence has nothing to > do with which system is weaker, though reliability does. That is, if > your sha512 of the data matches it is EXTREMELY unlikely that anything > changed randomly (if unsigned then deliberate tampering is still a > concern), so you have VERY high confidence. However, if your disk > storage system randomly alters 1% of its data every day then your > sha512s will probably always mismatch, giving you HIGH confidence of > your LOW reliability." > > Putting aside my statements about ECC being used elsewhere in > hardware, your statement is omitting the very important point that > the software layers sit on top of hardware. If we treat the hardware > layer as a black box then the difference is that either you have a > layer that detects and correct single bit errors or you don't. If > it is the later then it is entirely possible that you are checking > bad data since software must assume hardware is telling it the right > thing. The total system confidence would have to weight a system > with ECC ram higher than systems without. Furthermore each later > is just confirming the data handed off at the interface. The client WOULD have ECC, and thus it could reliably verify the checksums. I agree that the storage nodes, without ECC, could not reliably do so. They would still attempt to do so since it is still better to detect errors earlier, and during scrubbing/etc. It is fine if they miss an error since it will eventually be picked up. I'm not considering the "hardware" as some general layer that is a black box. I'm saying the client software and the hardware it is running on (with ECC) is one layer, and the storage nodes, with their software and hardware (without ECC) are another. Also, any decent software checksum is not going to be designed to only detect single bit errors. I'd use a much longer hash which would detect almost any kind of random modification. > > "I'm not really sure where I'm even advocating for ZFS here > specifically. It is needed with something like lizardfs if it doesn't > provide any protection for data at rest. Ceph has bluestore which > makes zfs completely unnecessary. btrfs is also a perfectly fine > solution conceptually, though its implementations have been > problematic over the years." > > Accept that you kind of just did :D I mentioned before LizardFS > has a regular process that checks chunks to make sure they have > accurate data but you are saying that its needs something like <insert > something else here>. If you don't like LizardFS approach that's fair > but to say it doesn't have this ability is incorrect How does LizardFS, without ZFS, detect and recover from modifications to data on-disk? Does it compute its own checksums? > I do not think you can even have parallel file systems without the ability to check the data consistent across all copies. That would be unless. Determining that data is consistent across all copies is not sufficient. You have to be able to determine what the CORRECT data is. In fact, Ceph not running on a storage layer that implements checksumming (such as bluestore or zfs) DOES lack this protection. When it does a scrub it confirms that all the copies are identical. However, if not using a checksumming filesystem it cannot tell which copy is right - only that they differ. That means that manual intervention is required to resolve the issue. zfs or bluestore don't just determine that copies differ - they determine which one is right. They can even determine if neither is correct. > > So... > > "What is your alternative? As I said ZFS isn't the only solution that > provides on-disk checksums, but as far as I can tell you aren't > advocating for any alternative which does." > > ...wasn't presented because there is a process... > > https://github.com/lizardfs/lizardfs/blob/master/external/crcutil-1.0/README What is this? What does it actually do? I gather that it is a library used by lizardfs in some way. I have no idea from this file whether it has anything to do with protecting data at rest. > "If you're using lizardfs on ext4, and lizardfs doesn't checksum data > at rest, and you have data modified on disk, then at best it will > detect a discrepancy during scrubbing and not know how to resolve it. > At worst it will provide incorrect data to clients before the next > scrub, which will get used to process data, and maybe store new data > on the cluster, which then gets created with matching copies that > aren't detectable with scrubbing. Errors can propagate if not > detected." > > ...is not correct. What isn't correct about it? IF lizardfs doesn't checksum data at rest, then all my concerns apply. Technically the statement is still true if the condition is false, as is the case of any if statement, but that is a bit beside the point. In any case I'm mostly interested in whether LizardFS protects data at rest, in particular using a checksum that is generated and verified outside of any of the storage nodes. > > "Sounds like you just described another solution for backups..." > > Yes, by utilizing the parallel file system's native abilities. I'm wouldn't be doing anything other than physically rotating media around. As long as the media isn't accessible to the filesystem while it is being rotated out then that is fine. > Putting aside your point about human error (if I can wipe out my cluster, why **can't** I wipe out my S3 buckets?) You're not going to do it in one command line. Sure, deliberate sabotage is a potential issue (and something that any serious company will safeguard against). But, you aren't going to accidentally wipe out both your S3 and your LizardFS storage. If you can, then you're doing it wrong. You absolutely have to protect against logical errors when you have a backup strategy. I don't trust myself not to make a mistake. Maybe my ansible script has a bug and wipes out every node in my storage cluster. You can't keep your backups in a place where it is easy to delete by accident. If you're talking about a business with multiple employees then you also can't keep them in a place where one employee can wipe out both your active data and the backups either, even deliberately. Separation of privilege is useful both to prevent accidents and ill intentions... -- Rich ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug