Re: [PLUG] Virtualization clusters & shared storage

Keith C. Perry on 12 Aug 2018 10:35:07 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage

From: "Keith C. Perry" <kperry@daotechnologies.com>
To: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Subject: Re: [PLUG] Virtualization clusters & shared storage
Date: Sun, 12 Aug 2018 13:34:10 -0400 (EDT)
Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Sender: "plug" <plug-bounces@lists.phillylinux.org>
Thread-index: xyMBNvHsUDX6/EcmicbvWGF4/Whafg==
Thread-topic: Virtualization clusters & shared storage

Rich, I was frustrated last year trying to find comparisons myself between GlusterFS, Ceph (their FS functionality), XtreemFS and LizardFS.  I ended up building most of these but then settled on LizardFS since it is the one that is software defined storage and not object storage.  Part of the problem is that everyone has a slightly different way of doing things so the first thing I did to help myself out is to get a hold of what object storage is (or should be) and what software defined storage is (or should be).  After awhile it became easier to understand if the new thing tossed out actually was dong something new and more efficient or just wording the same things differently.  Its exhausting to constantly take a deep dive on these things.

To your questions...
* Does the implementation protect against memory corruption on storage
nodes that do not use ECC?   (Note, I said storage nodes, NOT client
nodes.)

Protect in what sense?  If you don't have ECC ram you'd be relying on redundancies elsewhere in the transport of the data to protect against bit changes.  I'm not sure what LizardFS does in this regard.

* Does the implementation protect against on-disk corruption for data
at rest?  (I'm lumping into "implementation" any disk-layer solutions
being used, like ZFS/Bluestore/whatever, as many distributed systems
separate these.)

Chunkservers check their data for consistency by default every 10 seconds.  Recent versions now allow you to have millisecond accuracy if you want more scrubbing.

* Does the implementation support EC/striping/etc so that physical
disk requirements aren't multiples of the usable capacity?

My answer would be yes since as I point out in my description of how I would do a build with JP's hardware, this system is very space efficient.  You can read more about the goal / replication system here https://docs.lizardfs.com/adminguide/replication.html or see page 9 of the current whitepaper here https://lizardfs.com/documentation/  (I would recommend reading this anyway to understand the architecture of the system, its not that long and it has examples and diagrams)

* What is the recommended RAM:storage ratio like?

See https://docs.lizardfs.com/introduction.html#hardware-recommendation

* Complexity/etc.

What do you mean?

* Options for backup.

Standard stuff here, you'll want /etc/lizardfs or /etc/mfs for your configuration files and /var/lib/lizardfs or /var/lib/mfs for your metadata.  Keep in mind that shadow masters and metaloggers also would give you copies of your metadata.

* How well does it scale down?  I'm looking at alternatives to
largeish ZFS/NAS home-lab implementations, not to run Google.  Do I
need 12 physical hosts to serve 10TB of data with redundancy to
survive the loss of 1 physical host?

It scales down very well.  I'm working on something now that I'm not ready to talk about yet but a common question that comes up on the list is can you run multiple chunk servers on one system.  The answer is yes.  I'd have to know more about what you are doing but I know of scenarios where people are running single systems with 2 or more chunk servers against their storage instead of RAID or other systems to provide redundancy availablilty.

* How efficiently can it scrub/etc for bad on-disk storage?  (One
thing that concerns me about separate storage layers is whether there
is an automated way to actually fix issues the storage layer detects,
since they can't fix themselves without redundancy at that layer.)

LizardFS does tasks like this without user interaction.  There are controls for the process as well as load monitoring so that you don't thrash your running processes because of background disk operations.

~ ~ ~

I don't know enough about the network transport to answer the memory protect question.  I'd going to ask on the list since I would like to know the answer too.

I looked at CephFS too and I like what I see but its still object storage.  The thing that really caught my attention with Ceph are its network block devices.  You are right it is more mature and most of the use cases point to running VMs (probably diskless workstations too). I thought I might be able to use those in a more general purpose way but for what I was looking to do it wasn't appropriate.

LizardFS vs. Ceph is a asymmetric comparison.  While LizardFS does essentially one thing and one thing well (software definition storage), Cepth has really three major components (object store, block device and filesystem) that can serve different needs.  CephFS is really the point of comparison.  CephFS ability to be network and topology awareness through its CRUSH maps didn't resonant with me.  Plus changing your data strategy on the fly or haven't multiple strategies seemed to be much more difficult to do (if at all).

I found LizardFS to be much easier to setup and work with than Ceph.  Adding / removing resources, moving data around, rebuilding failed resources and changing replication goals are easy to do. The other thing I really like is that it layers on a standard filesystems which was important since filesystems take a long time to vet say they are reliable.  You could in theory share a current filesystem with LizardsFS (you just create a folder for it so that root of the chunk server data starts there) to try it out- if you like it, moving things around is easy trivial and if you don't, you just uninstall and delete that folder).  Ceph may allow you to do but I didn't look into it.

LizardFS has been around for awhile now and they do have corporate backing.  I'm not sure why more people aren't familiar with it but it took me awhile to find it too.  Gluster and Ceph tend to suck the air out of the room since they are the most popular.

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Managing Member, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: "Rich Freeman" <r-plug@thefreemanclan.net>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Sunday, August 12, 2018 6:54:32 AM
Subject: Re: [PLUG] Virtualization clusters & shared storage

On Sat, Aug 11, 2018 at 6:22 PM Keith C. Perry
<kperry@daotechnologies.com> wrote:
>
> JP, not to give you more reading material but along the same lines...  https://docs.lizardfs.com/cookbook/hypervisors.html#using-lizardfs-as-shared-storage-for-proxmoxve
>

Keith - have you seen any documentation that compares LizardFS to some
of the newer options like CephFS?  I'm finding it difficult to
actually find comparisons of the various distributed options, and
every time somebody tosses one out I feel like I will have to end up
doing a deep dive and basically do my own comparison.

Things that I'm interested in are things like:

* Does the implementation protect against memory corruption on storage
nodes that do not use ECC?   (Note, I said storage nodes, NOT client
nodes.)
* Does the implementation protect against on-disk corruption for data
at rest?  (I'm lumping into "implementation" any disk-layer solutions
being used, like ZFS/Bluestore/whatever, as many distributed systems
separate these.)
* Does the implementation support EC/striping/etc so that physical
disk requirements aren't multiples of the usable capacity?
* What is the recommended RAM:storage ratio like?
* Complexity/etc.
* Options for backup.
* How well does it scale down?  I'm looking at alternatives to
largeish ZFS/NAS home-lab implementations, not to run Google.  Do I
need 12 physical hosts to serve 10TB of data with redundancy to
survive the loss of 1 physical host?
* How efficiently can it scrub/etc for bad on-disk storage?  (One
thing that concerns me about separate storage layers is whether there
is an automated way to actually fix issues the storage layer detects,
since they can't fix themselves without redundancy at that layer.)

I listed the first one first for a reason because this info is
frustratingly difficult to find.  Many of the options let you store
data on ZFS or something similar, but they are vague on whether they
actually guarantee that they'll detect corruption while the data is
being handled in RAM on the storage nodes.  I want cheap/disposable
storage nodes, so I'd prefer a system that assumes they're unreliable.
If a known-good hash is computed on the client, and preserved
end-to-end, or at least until AFTER another hash is computed and
checked when the disk layer takes over, then it should be safe.
However, if the data is sent over the network and the hash is
forgotten after network transmission is checked, and the data sits in
RAM unprotected until the disk layer takes over (which might just be 3
lines of code), then that is a vulnerability.

Right now CephFS seems to be the most attractive option for me, with
the caveat that the "FS" part of it is newish, and I'm not sure where
they are with the failure tolerance on the MDS layer.  Ceph for block
storage (and probably also for serving volumes for VMs) sounds like it
is much more mature.

So far you're the only one I've heard advocate LizardFS, which sounds
similar to CephFS, so I'm curious about how they compare.  CephFS also
separates the disk storage layer, though they now offer their own
which is optimized more for Ceph (they wanted the on-disk
checksumming/etc, but didn't want to implement all the other POSIX/etc
stuff for what is just a block storage back end, which I think makes
sense).  I'm pretty sure you could dump it on ext4, but I'm sure it is
much more common to put it on something like ZFS for the checksums.

-- 
Rich
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

Follow-Ups:
- Re: [PLUG] Virtualization clusters & shared storage
  - From: Rich Freeman <r-plug@thefreemanclan.net>

References:
- [PLUG] Virtualization clusters & shared storage
  - From: JP Vossen <jp@jpsdomain.org>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: JP Vossen <jp@jpsdomain.org>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Lee H. Marzke" <lee@marzke.net>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: JP Vossen <jp@jpsdomain.org>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Keith C. Perry" <kperry@daotechnologies.com>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: Rich Freeman <r-plug@thefreemanclan.net>

Prev by Date: Re: [PLUG] Virtualization clusters & shared storage
Next by Date: Re: [PLUG] Virtualization clusters & shared storage
Previous by thread: Re: [PLUG] Virtualization clusters & shared storage
Next by thread: Re: [PLUG] Virtualization clusters & shared storage
Index(es):
- Date
- Thread