Rich Freeman on 17 Aug 2018 17:06:05 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


On Fri, Aug 17, 2018 at 7:41 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
> "Again, the principle is layering.  If you embed a checksum at a higher
> layer, and verify it at a higher layer, then the lower layer is a
> black box.  The higher layer has a high degree of assurance that the
> data is intact no matter how unreliable the lower layer is if the
> checksum verifies and is sufficiently long.  Confidence has nothing to
> do with which system is weaker, though reliability does.  That is, if
> your sha512 of the data matches it is EXTREMELY unlikely that anything
> changed randomly (if unsigned then deliberate tampering is still a
> concern), so you have VERY high confidence.  However, if your disk
> storage system randomly alters 1% of its data every day then your
> sha512s will probably always mismatch, giving you HIGH confidence of
> your LOW reliability."
>
> Putting aside my statements about ECC being used elsewhere in
> hardware, your statement is omitting the very important point that
> the software layers sit on top of hardware.  If we treat the hardware
> layer as a black box then the difference is that either you have a
> layer that detects and correct single bit errors or you don't.  If
> it is the later then it is entirely possible that you are checking
> bad data since software must assume hardware is telling it the right
> thing.  The total system confidence would have to weight a system
> with ECC ram higher than systems without.  Furthermore each later
> is just confirming the data handed off at the interface.

The client WOULD have ECC, and thus it could reliably verify the checksums.

I agree that the storage nodes, without ECC, could not reliably do so.
They would still attempt to do so since it is still better to detect
errors earlier, and during scrubbing/etc.  It is fine if they miss an
error since it will eventually be picked up.

I'm not considering the "hardware" as some general layer that is a
black box.  I'm saying the client software and the hardware it is
running on (with ECC) is one layer, and the storage nodes, with their
software and hardware (without ECC) are another.

Also, any decent software checksum is not going to be designed to only
detect single bit errors.  I'd use a much longer hash which would
detect almost any kind of random modification.

>
> "I'm not really sure where I'm even advocating for ZFS here
> specifically.  It is needed with something like lizardfs if it doesn't
> provide any protection for data at rest.  Ceph has bluestore which
> makes zfs completely unnecessary.  btrfs is also a perfectly fine
> solution conceptually, though its implementations have been
> problematic over the years."
>
> Accept that you kind of just did :D I mentioned before LizardFS
> has a regular process that checks chunks to make sure they have
> accurate data but you are saying that its needs something like <insert
> something else here>.  If you don't like LizardFS approach that's fair
> but to say it doesn't have this ability is incorrect

How does LizardFS, without ZFS, detect and recover from modifications
to data on-disk?  Does it compute its own checksums?

>  I do not think you can even have parallel file systems without the ability to check the data consistent across all copies.  That would be unless.

Determining that data is consistent across all copies is not
sufficient.  You have to be able to determine what the CORRECT data
is.

In fact, Ceph not running on a storage layer that implements
checksumming (such as bluestore or zfs) DOES lack this protection.
When it does a scrub it confirms that all the copies are identical.
However, if not using a checksumming filesystem it cannot tell which
copy is right - only that they differ.  That means that manual
intervention is required to resolve the issue.

zfs or bluestore don't just determine that copies differ - they
determine which one is right.  They can even determine if neither is
correct.

>
> So...
>
> "What is your alternative?  As I said ZFS isn't the only solution that
> provides on-disk checksums, but as far as I can tell you aren't
> advocating for any alternative which does."
>
> ...wasn't presented because there is a process...
>
> https://github.com/lizardfs/lizardfs/blob/master/external/crcutil-1.0/README

What is this?  What does it actually do?  I gather that it is a
library used by lizardfs in some way.  I have no idea from this file
whether it has anything to do with protecting data at rest.

> "If you're using lizardfs on ext4, and lizardfs doesn't checksum data
> at rest, and you have data modified on disk, then at best it will
> detect a discrepancy during scrubbing and not know how to resolve it.
> At worst it will provide incorrect data to clients before the next
> scrub, which will get used to process data, and maybe store new data
> on the cluster, which then gets created with matching copies that
> aren't detectable with scrubbing.  Errors can propagate if not
> detected."
>
> ...is not correct.

What isn't correct about it?

IF lizardfs doesn't checksum data at rest, then all my concerns apply.
Technically the statement is still true if the condition is false, as
is the case of any if statement, but that is a bit beside the point.
In any case I'm mostly interested in whether LizardFS protects data at
rest, in particular using a checksum that is generated and verified
outside of any of the storage nodes.

>
> "Sounds like you just described another solution for backups..."
>
> Yes, by utilizing the parallel file system's native abilities.  I'm wouldn't be doing anything other than physically rotating media around.

As long as the media isn't accessible to the filesystem while it is
being rotated out then that is fine.

> Putting aside your point about human error (if I can wipe out my cluster, why **can't** I wipe out my S3 buckets?)

You're not going to do it in one command line.  Sure, deliberate
sabotage is a potential issue (and something that any serious company
will safeguard against).  But, you aren't going to accidentally wipe
out both your S3 and your LizardFS storage.  If you can, then you're
doing it wrong.

You absolutely have to protect against logical errors when you have a
backup strategy.  I don't trust myself not to make a mistake.  Maybe
my ansible script has a bug and wipes out every node in my storage
cluster.  You can't keep your backups in a place where it is easy to
delete by accident.  If you're talking about a business with multiple
employees then you also can't keep them in a place where one employee
can wipe out both your active data and the backups either, even
deliberately.  Separation of privilege is useful both to prevent
accidents and ill intentions...

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug