Keith C. Perry on 17 Aug 2018 16:43:04 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


"Again, the principle is layering.  If you embed a checksum at a higher
layer, and verify it at a higher layer, then the lower layer is a
black box.  The higher layer has a high degree of assurance that the
data is intact no matter how unreliable the lower layer is if the
checksum verifies and is sufficiently long.  Confidence has nothing to
do with which system is weaker, though reliability does.  That is, if
your sha512 of the data matches it is EXTREMELY unlikely that anything
changed randomly (if unsigned then deliberate tampering is still a
concern), so you have VERY high confidence.  However, if your disk
storage system randomly alters 1% of its data every day then your
sha512s will probably always mismatch, giving you HIGH confidence of
your LOW reliability."

Putting aside my statements about ECC being used elsewhere in hardware, your statement is omitting the very important point that the software layers sit on top of hardware.  If we treat the hardware layer as a black box then the difference is that either you have a layer that detects and correct single bit errors or you don't.  If it is the later then it is entirely possible that you are checking bad data since software must assume hardware is telling it the right thing.  The total system confidence would have to weight a system with ECC ram higher than systems without.  Furthermore each later is just confirming the data handed off at the interface.  Unless bits are added at each stage and a new hash is generated the weight of each of software layers is the same.  Networking is sort of this an quasi-example of this since packet systems encapsulate data then checksum is that (the the data in the packet is preserved).  That is not something that happens at every layer transit.

"I'm not really sure where I'm even advocating for ZFS here
specifically.  It is needed with something like lizardfs if it doesn't
provide any protection for data at rest.  Ceph has bluestore which
makes zfs completely unnecessary.  btrfs is also a perfectly fine
solution conceptually, though its implementations have been
problematic over the years."

Accept that you kind of just did :D  I mentioned before LizardFS has a regular process that checks chunks to make sure they have accurate data but you are saying that its needs something like <insert something else here>.  If you don't like LizardFS approach that's fair but to say it doesn't have this ability is incorrect.  I do not think you can even have parallel file systems without the ability to check the data consistent across all copies.  That would be unless.

So...

"What is your alternative?  As I said ZFS isn't the only solution that
provides on-disk checksums, but as far as I can tell you aren't
advocating for any alternative which does."

...wasn't presented because there is a process...

https://github.com/lizardfs/lizardfs/blob/master/external/crcutil-1.0/README

...thus...

"If you're using lizardfs on ext4, and lizardfs doesn't checksum data
at rest, and you have data modified on disk, then at best it will
detect a discrepancy during scrubbing and not know how to resolve it.
At worst it will provide incorrect data to clients before the next
scrub, which will get used to process data, and maybe store new data
on the cluster, which then gets created with matching copies that
aren't detectable with scrubbing.  Errors can propagate if not
detected."

...is not correct.

"Sounds like you just described another solution for backups..."

Yes, by utilizing the parallel file system's native abilities.  I'm wouldn't be doing anything other than physically rotating media around.  As I envision this on lizardFS (I've kind played around with this) I would be adding and removing resource and waiting some delta for the system to have data consistency.  I could then remove then with the update data set and the system would re-balance again..  As long as there is there is always enough resource to safely run in the reduced state my data is safe.  Honestly, I'm not sure if this will work the way I want ultimately but I think there is a way.

Putting aside your point about human error (if I can wipe out my cluster, why **can't** I wipe out my S3 buckets?) I do agree that have a "technology firewall" between your main system and your data protection system is a good idea.  This is why I'm still not sure of the exact role of a parallel file system relative to my current practices.  It may be that they only compliment each other.


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Managing Member, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: "Rich Freeman" <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Friday, August 17, 2018 5:59:38 PM
Subject: Re: [PLUG] Virtualization clusters & shared storage

On Fri, Aug 17, 2018 at 5:26 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
> In your scenario you're trying to have things both ways.  On the one
> hand you're saying that you would you have ECC ram on clients to make
> sure source data is trusted but then on the storage system, you would
> not use ECC ram and you want to use some ZFSy 'ish checksum magic to
> get to the same confidence level.
>
> That's an asymmetric solution because doing things in hardware is not
> the same as software.  At best you have to have **expectation** that
> the confidence level is that of the weakest system.  In practice I'm
> sure it would be higher but as a rule I would not think that way.

Again, the principle is layering.  If you embed a checksum at a higher
layer, and verify it at a higher layer, then the lower layer is a
black box.  The higher layer has a high degree of assurance that the
data is intact no matter how unreliable the lower layer is if the
checksum verifies and is sufficiently long.  Confidence has nothing to
do with which system is weaker, though reliability does.  That is, if
your sha512 of the data matches it is EXTREMELY unlikely that anything
changed randomly (if unsigned then deliberate tampering is still a
concern), so you have VERY high confidence.  However, if your disk
storage system randomly alters 1% of its data every day then your
sha512s will probably always mismatch, giving you HIGH confidence of
your LOW reliability.

Also, all ECC is doing is calculating its OWN set of checksums, and
verifying them.  It is completely redundant if there is a higher layer
that is independent of the hardware it is implemented on doing the
same thing.  The main benefit of a redundant layer of ECC would be in
reducing latency if the higher layer has to take steps to correct
errors, but as you point out errors in RAM are very rare, so it
doesn't add much value in practice, provided you're checking for it at
a higher level.  It is important that the checksum verification take
place on a system with ECC RAM, because otherwise the checksum could
fail but the execution logic of the software not work correctly and
thus behave as if it passed.  When you have RAM problems all bets are
off.

As far as having it both ways go, it is completely logical.  Your
clients need ECC ANYWAY, because they're doing operations on the data.
If the filesystem is properly designed you can leverage the higher
reliability of the clients to make up for the lower reliability of the
storage nodes.  You could have 10 clients and 10,000 storage nodes,
and in that case why would you want to pay for all that ECC RAM on the
storage nodes if the clients can just leverage their ECC RAM to
provide the exact same security?

> I am rather concerned about the ZFS fanboy'ism though- more generally since it doesn't apply to a conversation about parallel file systems.

I'm not really sure where I'm even advocating for ZFS here
specifically.  It is needed with something like lizardfs if it doesn't
provide any protection for data at rest.  Ceph has bluestore which
makes zfs completely unnecessary.  btrfs is also a perfectly fine
solution conceptually, though its implementations have been
problematic over the years.

What I'm advocating for is checksums on data at rest, because no
amount of ECC or verification at other levels in the system will help
you if bits get flipped on the storage and this isn't detectable by
your algorithm.

> As I illustrated before there is other hardware correction code
> implemented on motherboards today.  ECC is an additional step.  That
> doesn't mean data is not safe without ECC it means data is more safe
> with ECC.  Despite the engineering points and recent talk about cosmic
> rays flipping bits in flight, the reality still is that most people do
> not use ECC ram and have very high data fidelity.  People trust these
> systems to their finances, store memories and other important data
> every day.  This was going on long before all the cloud stuff too.

I don't think I said anything contrary to any of this.

> ZFS vs. everything-else is like systemd vs. everything-else... it is
> **another** approach to something.  It is not **the** approach.  I'm
> trying to avoid that kind of discussion.  If you like ZFS fine, if
> you like something else that is also fine.  You still have to have a
> complete data management strategy.

What is your alternative?  As I said ZFS isn't the only solution that
provides on-disk checksums, but as far as I can tell you aren't
advocating for any alternative which does.

If you're using lizardfs on ext4, and lizardfs doesn't checksum data
at rest, and you have data modified on disk, then at best it will
detect a discrepancy during scrubbing and not know how to resolve it.
At worst it will provide incorrect data to clients before the next
scrub, which will get used to process data, and maybe store new data
on the cluster, which then gets created with matching copies that
aren't detectable with scrubbing.  Errors can propagate if not
detected.

If you're using lizardfs on top of zfs then you're fine if there is a
disk corruption, because ZFS will generate an IO error and if lizardfs
is even remotely sane that will trigger a recovery of the affected
data.  That still won't protect you from a RAM error if lizardfs lacks
internal checksums, since the bits could flip in RAM between the time
when zfs verifies its checksums and lizardfs generates whatever
checksums it uses for network transmission.

> My simple test is this... can you
> rebuild your systems from a complete failure regardless of what caused
> it?  You can certainly achieve that WITHOUT ZFS or ECC ram.  Using
> parallel filesystems on commodity hardware is another tool in the tool
> box for long term data management.

That test is too simple.  It isn't enough to be able to rebuild a
system from complete failure.  You have to be able to DETECT that
failure and know that you need to rebuild the data.

And if your rebuild solution is offline backups, then you're talking
about a long period of downtime during the rebuild.  Solutions that
provide for redundancy and integrity checking can provide a high
degree of assurance that any errors will be detected prior to the data
being used, and recover from that error in a few hundred milliseconds,
while not exposing the other layers of the solution to the error.

> Also, I don't agree with redundancy not being the same as backups.
> It depends on how you do things.  People and companies are poor at
> managing offline backups at a minimum.  You end up with stacks of
> media (or files), off-line as you say, stored some place never to be
> touch and checked for fidelity and viability in a disaster.  It is
> far more efficient and practical to have your data system online with
> versioning and then at least have one copy off **premise** but also
> online.

Sounds like you just described another solution for backups...

A key element here is that you're using different technology for
active storage vs your backup solution, and that your backup solution
can't be told to discard data prematurely, and the backup solution is
physically remote from your active data.

Just defining another location in your software defined storage and
setting a policy doesn't really accomplish that, because a software
bug could take out all that data at once.  An incorrect command using
a privileged operator could instruct both the active and "backup"
nodes to discard all their data at once.

Now, if your main system is lizardfs/ceph/whatever, and then you're
generating serialized snapshots and uploading those to S3, and you
have some S3 policy set to protect files for a certain period of time,
and the account used to upload those backups has permissions to create
but not delete backup files, then you have some decent separation.  A
mistake in administration isn't going to wipe out both your active
cluster and your S3 backups at once, because there is no one account
that has access to do both.

And if your backups are in a stack of tapes somewhere then you're all
the safer because you don't have to worry about whether you failed to
think of something in setting up all your fancy roles/etc.

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug