Re: [PLUG] Virtualization clusters & shared storage

Keith C. Perry on 17 Aug 2018 19:56:00 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage

From: "Keith C. Perry" <kperry@daotechnologies.com>
To: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Subject: Re: [PLUG] Virtualization clusters & shared storage
Date: Fri, 17 Aug 2018 22:54:50 -0400 (EDT)
Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Sender: "plug" <plug-bounces@lists.phillylinux.org>
Thread-index: Yi3zBm40mbiCwtcFnBNabDDVRVkAfA==
Thread-topic: Virtualization clusters & shared storage

"I agree that the storage nodes, without ECC, could not reliably do so.
They would still attempt to do so since it is still better to detect
errors earlier, and during scrubbing/etc.  It is fine if they miss an
error since it will eventually be picked up."

Exactly.  Which means something could get flipped and checked in the affirmative on the way to client.  This makes the solution asymmetric since you could still be effected by bit flips during the retrieval process.  As long as the metadata server is good there is an opportunity to correct a back chunk on disk if there is not a systemic memory problem.

"Also, any decent software checksum is not going to be designed to only
detect single bit errors.  I'd use a much longer hash which would
detect almost any kind of random modification."

A checksum (i.e. a MAC or hash) is NOT error correcting code.  Checksum confirm identity- yes I received the correct data or not.  The length of it doesn't determine the to do correction.  You can use sha1, sha256 or sha512 and they all give you identity string.  The lengths are just different.  Error correction code in its most basic hardware design can detect n bit flips and make n-1 corrections.  When you implement these things in hardware the correction ability is part of how the electric circuit is designed.  This is why they are more expensive and why we're talking apples and oranges.  Software does not have the same durability as an electric circuit whose design and operation includes correcting errors.

"How does LizardFS, without ZFS, detect and recover from modifications
to data on-disk?  Does it compute its own checksums?"

There is ZFS again...

Once again, yes checksums are computed- the underlying system is filesystem is irrelevant.  ZFS isn't getting you anything here.

As I understand it the checksum is part of the metadata and if a chuck come back with the wrong one that chuck will be replaced.

"What is this?  What does it actually do?  I gather that it is a
library used by lizardfs in some way.  I have no idea from this file
whether it has anything to do with protecting data at rest."

I gave you and anyone else a starting point to research this for yourself.  You keep telling me what LizardFS doesn't do and yet you're not reading.  I've already made my evaluation and I'm comfortable with the product on XFS.  If you think lizardFS is better product on ZFS then you can run it that way- plenty of lizardFS use do.  My point is that, that is redundant, more resource hungry and if I need more resources maybe I should just go ahead and build the system ECC ram anyway.

"Determining that data is consistent across all copies is not
sufficient.  You have to be able to determine what the CORRECT data
is."

This is why you protect your metadata with as much veracity as you actual data.  If the data does not match the checksum the storage system is going to want to replace that chuck of data from a another server that did match.  That is how it is corrected.

"In fact, Ceph not running on a storage layer that implements
checksumming (such as bluestore or zfs) DOES lack this protection.
When it does a scrub it confirms that all the copies are identical.
However, if not using a checksumming filesystem it cannot tell which
copy is right - only that they differ.  That means that manual
intervention is required to resolve the issue."

Yikes... well that's scary if Ceph works that way.  If the storage server doesn't have ECC and there is a flip the metadata could still say its good.  ZFS steps in and said neither is good then what?  ZFS' panics the fs and it shuts down the storage node???  Wow and no thank you.  I don't need my servers fighting like that and I'm not going to manually look at chunks of binary or even text data to determine what is right.  Trust or do not trust your filesystem.

"What isn't correct about it?"

Already answer that-

"IF lizardfs doesn't checksum data at rest, then all my concerns apply.
Technically the statement is still true if the condition is false, as
is the case of any if statement, but that is a bit beside the point.
In any case I'm mostly interested in whether LizardFS protects data at
rest, in particular using a checksum that is generated and verified
outside of any of the storage nodes."

This is now a different request.  You want to checksum your data outside of the parallel file system in the local file system.  Ok, you can do that if you want by running the a filesystem that does this but I will continue to maintain that that is not necessary because there is already a robust system in place to check the validity of the chunks and you are risking a metadata fight.  Without ECC ram (and ignoring any off hardware ecc) I could point to a ZFS system and say I want something checking outside of that filesystem.  Another software mechanism doing that **outside** of the files system having power over the metadata doesn't make sense to me and create another failure point.

The nice thing about ecc is that is just happens.  Software doesn't get to fight it because it would know what to fight.

"As long as the media isn't accessible to the filesystem while it is
being rotated out then that is fine."

Once the resources are removed from the system they can't be modified.  In lizardFS when you mark resource for removal no further changes are written and that will start the re-balancing if it can.  I guess they could stay in that state but I would actually store them to maximize disk life.

"You're not going to do it in one command line.  Sure, deliberate
sabotage is a potential issue (and something that any serious company
will safeguard against).  But, you aren't going to accidentally wipe
out both your S3 and your LizardFS storage.  If you can, then you're
doing it wrong."

That was going to be my point.  There are plenty of access controls in lizardFS and the trash doesn't purge right away.  A user literally can not delete something immediately (unless the system is setup to allow a minimum trash time of 0).  It would however show up in the monitor immediately and with the cmd line admin tool so it could be detected. 
 You would need access to the metadata partition to manually force a purge and that has access controls too.

"You absolutely have to protect against logical errors when you have a
backup strategy.  I don't trust myself not to make a mistake.  Maybe
my ansible script has a bug and wipes out every node in my storage
cluster.  You can't keep your backups in a place where it is easy to
delete by accident.  If you're talking about a business with multiple
employees then you also can't keep them in a place where one employee
can wipe out both your active data and the backups either, even
deliberately.  Separation of privilege is useful both to prevent
accidents and ill intentions..."

Agreed! This is why in corporate you get data offsite and keep multiple versions.  :D

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Managing Member, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: "Rich Freeman" <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Friday, August 17, 2018 8:05:49 PM
Subject: Re: [PLUG] Virtualization clusters & shared storage

On Fri, Aug 17, 2018 at 7:41 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
> "Again, the principle is layering.  If you embed a checksum at a higher
> layer, and verify it at a higher layer, then the lower layer is a
> black box.  The higher layer has a high degree of assurance that the
> data is intact no matter how unreliable the lower layer is if the
> checksum verifies and is sufficiently long.  Confidence has nothing to
> do with which system is weaker, though reliability does.  That is, if
> your sha512 of the data matches it is EXTREMELY unlikely that anything
> changed randomly (if unsigned then deliberate tampering is still a
> concern), so you have VERY high confidence.  However, if your disk
> storage system randomly alters 1% of its data every day then your
> sha512s will probably always mismatch, giving you HIGH confidence of
> your LOW reliability."
>
> Putting aside my statements about ECC being used elsewhere in
> hardware, your statement is omitting the very important point that
> the software layers sit on top of hardware.  If we treat the hardware
> layer as a black box then the difference is that either you have a
> layer that detects and correct single bit errors or you don't.  If
> it is the later then it is entirely possible that you are checking
> bad data since software must assume hardware is telling it the right
> thing.  The total system confidence would have to weight a system
> with ECC ram higher than systems without.  Furthermore each later
> is just confirming the data handed off at the interface.

The client WOULD have ECC, and thus it could reliably verify the checksums.

I agree that the storage nodes, without ECC, could not reliably do so.
They would still attempt to do so since it is still better to detect
errors earlier, and during scrubbing/etc.  It is fine if they miss an
error since it will eventually be picked up.

I'm not considering the "hardware" as some general layer that is a
black box.  I'm saying the client software and the hardware it is
running on (with ECC) is one layer, and the storage nodes, with their
software and hardware (without ECC) are another.

Also, any decent software checksum is not going to be designed to only
detect single bit errors.  I'd use a much longer hash which would
detect almost any kind of random modification.

>
> "I'm not really sure where I'm even advocating for ZFS here
> specifically.  It is needed with something like lizardfs if it doesn't
> provide any protection for data at rest.  Ceph has bluestore which
> makes zfs completely unnecessary.  btrfs is also a perfectly fine
> solution conceptually, though its implementations have been
> problematic over the years."
>
> Accept that you kind of just did :D I mentioned before LizardFS
> has a regular process that checks chunks to make sure they have
> accurate data but you are saying that its needs something like <insert
> something else here>.  If you don't like LizardFS approach that's fair
> but to say it doesn't have this ability is incorrect

How does LizardFS, without ZFS, detect and recover from modifications
to data on-disk?  Does it compute its own checksums?

>  I do not think you can even have parallel file systems without the ability to check the data consistent across all copies.  That would be unless.

Determining that data is consistent across all copies is not
sufficient.  You have to be able to determine what the CORRECT data
is.

In fact, Ceph not running on a storage layer that implements
checksumming (such as bluestore or zfs) DOES lack this protection.
When it does a scrub it confirms that all the copies are identical.
However, if not using a checksumming filesystem it cannot tell which
copy is right - only that they differ.  That means that manual
intervention is required to resolve the issue.

zfs or bluestore don't just determine that copies differ - they
determine which one is right.  They can even determine if neither is
correct.

>
> So...
>
> "What is your alternative?  As I said ZFS isn't the only solution that
> provides on-disk checksums, but as far as I can tell you aren't
> advocating for any alternative which does."
>
> ...wasn't presented because there is a process...
>
> https://github.com/lizardfs/lizardfs/blob/master/external/crcutil-1.0/README

What is this?  What does it actually do?  I gather that it is a
library used by lizardfs in some way.  I have no idea from this file
whether it has anything to do with protecting data at rest.

> "If you're using lizardfs on ext4, and lizardfs doesn't checksum data
> at rest, and you have data modified on disk, then at best it will
> detect a discrepancy during scrubbing and not know how to resolve it.
> At worst it will provide incorrect data to clients before the next
> scrub, which will get used to process data, and maybe store new data
> on the cluster, which then gets created with matching copies that
> aren't detectable with scrubbing.  Errors can propagate if not
> detected."
>
> ...is not correct.

What isn't correct about it?

IF lizardfs doesn't checksum data at rest, then all my concerns apply.
Technically the statement is still true if the condition is false, as
is the case of any if statement, but that is a bit beside the point.
In any case I'm mostly interested in whether LizardFS protects data at
rest, in particular using a checksum that is generated and verified
outside of any of the storage nodes.

>
> "Sounds like you just described another solution for backups..."
>
> Yes, by utilizing the parallel file system's native abilities.  I'm wouldn't be doing anything other than physically rotating media around.

As long as the media isn't accessible to the filesystem while it is
being rotated out then that is fine.

> Putting aside your point about human error (if I can wipe out my cluster, why **can't** I wipe out my S3 buckets?)

You're not going to do it in one command line.  Sure, deliberate
sabotage is a potential issue (and something that any serious company
will safeguard against).  But, you aren't going to accidentally wipe
out both your S3 and your LizardFS storage.  If you can, then you're
doing it wrong.

You absolutely have to protect against logical errors when you have a
backup strategy.  I don't trust myself not to make a mistake.  Maybe
my ansible script has a bug and wipes out every node in my storage
cluster.  You can't keep your backups in a place where it is easy to
delete by accident.  If you're talking about a business with multiple
employees then you also can't keep them in a place where one employee
can wipe out both your active data and the backups either, even
deliberately.  Separation of privilege is useful both to prevent
accidents and ill intentions...

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

References:
- [PLUG] Virtualization clusters & shared storage
  - From: JP Vossen <jp@jpsdomain.org>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: Rich Freeman <r-plug@thefreemanclan.net>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: Rich Freeman <r-plug@thefreemanclan.net>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: Rich Freeman <r-plug@thefreemanclan.net>

Prev by Date: Re: [PLUG] Linux tip: Log IP addresses, not hostnames, for use by fail2ban...
Next by Date: Re: [PLUG] Linux tip: Log IP addresses, not hostnames, for use by fail2ban...
Previous by thread: Re: [PLUG] Virtualization clusters & shared storage
Next by thread: Re: [PLUG] Virtualization clusters & shared storage
Index(es):
- Date
- Thread