Lee H. Marzke on 10 Aug 2018 12:31:20 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


Keith,

LizardFs seems to have similar disk efficiency as VSAN set to host failure2Tolerate=2.

60TB total disk,   ~20TB usable,  or perhaps ~40TB with erasure coding.

Might be great for a lab, However this is now using 5 physical servers for
just storage, and no hypervisors.

Quite a lot of complexity and setup for storage compared to a single dual-head
TrueNAS box.

Does this have any De-duplication ?

Note that Tegile has de-duplication on all models, and Nimble has de-dup on
the all-Flash units,  which may significantly save space in a lab with many
repetitive environments for test.


Lee


----- Original Message -----
> From: "Keith C. Perry" <kperry@daotechnologies.com>
> To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
> Sent: Friday, 10 August, 2018 14:43:24
> Subject: Re: [PLUG] Virtualization clusters & shared storage

> From JP...
> 
> Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x
> 2TB drives in RAID5 for 10TB local storage each.
> 
> ...one of your builds...
> 
> 3 node cluster with redundant FreeNAS:
>        1. VM node1: CPU/RAM used for VMs, 10TB local space wasted
>        2. VM node2: CPU/RAM used, 10TB local space wasted
>        3. VM node3: CPU/RAM used, 10TB local space wasted
>        4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available
>        5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a mirror
> 
> My build with LizardFS with the same hardware...
> 
> 5 node cluster, with standard goal of 3 or EC(5,2)
>        1. LizardsFS master + chunkserver 1 (~12Tb for data storage) + VM node 1
>        2. LizardsFS shadow master + chunkserver 2 (~12Tb for data storage) + VM node 2
>        3. LizardsFS metalogger + chunkserver 3 (~12Tb for data storage) + VM node 3
>        4. LizardsFS metalogger + chunkserver 4 (~12Tb for data storage) + VM node 4
>        5. LizardsFS metalogger + chunkserver 5 (~12Tb for data storage) + VM node 3
> 
> This deployment would give you around 60Tb of raw space.  For argument sake if
> assume you're using the same goal type for all data, then goal 3 would give you
> about 20Tb of data space and EC(5,2) would give you 42.86Tb.  In addition, some
> other common points...
> 
> 1. I'm ignoring system and metadata requirements because they are negligible-
> Even if the OS was on 500Gb drive that would be more than enough space for the
> OS and metadata that could manage 10's of millions of files.  This why the
> master and shadow master can also be chunkservers.  However, it is a general
> practice to use small RAID-1 for the OS volume for better availability.
> 2. While using RAID or LVM is possible it is completely unnecessary with
> LizardFS.  The system can use single drives directly on a chunkserver thus the
> overhead is just for the file system formatting.
> 3. R710's if I recall correctly can have 2 or 4 ports.  I would strongly
> recommend using bonded NICs so that you have a highest network throughput
> possible (mode 5 or mode 6 if your NICs can do it. LACP/mode4 if you have a
> capable switch).
> 
> This build would give you a completely fault tolerate and available storage
> system with low storage overhead.  I won't say anything about IOPS because that
> depends on your workloads but network bonds help quite a bit structurally since
> that is a typical bottleneck.  Specifically, this system can tolerate:
> 
> 1. any 2 disks can fail anywhere... LizardFS automatically load balances your
> data across all your disk on all your chunkservers so whether you use
> individual disks, LVM or RAID the system will do that.  When a disk fails, the
> system will begin to migrate data to maintain your goals if there are available
> resources.  If you specific disks directly in the system, you would just
> replace the failed one- no messing with RAID and LVM procedures to replace a
> failed disk. LizardFS will re-balance the data again when failed resources are
> replaced.  There is no downtime or data loss
> 2. any 2 chunkservers can fail...  extending from above any 2 nodes can die
> completely- even the master server (which has the metadata).  When the master
> fails, the shadow master takes over automatically.  There is no downtime or
> data loss
> 3. if both the master and shadow master go down, there are 3 metaloggers which
> just receive metadata.  Any of those 3 servers can manually have the master
> service brought or just be a place from which the real master restores is
> metadata from.  Here there will be downtime but still no data loss.
> 
> So, with 5 servers LizardFS can give you a very resilient system which scales
> up, out or both and be adjusted easily to facilitate multiple strategies in one
> storage system.
> 
> 
> ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
> Keith C. Perry, MS E.E.
> Managing Member, DAO Technologies LLC
> (O) +1.215.525.4165 x2033
> (M) +1.215.432.5167
> www.daotechnologies.com
> 
> ----- Original Message -----
> From: "JP Vossen" <jp@jpsdomain.org>
> To: plug@lists.phillylinux.org
> Sent: Wednesday, August 8, 2018 9:23:53 PM
> Subject: Re: [PLUG] Virtualization clusters & shared storage
> 
> Firth, thanks to Lee and Andy for the term I didn't know for what I
> mean: hyper-converged infrastructure.
>	https://en.wikipedia.org/wiki/Hyper-converged_infrastructure
> 
> It looks like Proxmox does that:
>	https://pve.proxmox.com/wiki/Hyper-converged_Infrastructure
> 
> Thanks Keith, I had seen LizardFS but was not aware of the implications.
> 
> Doug, Kubernetes is not in play, though it might be in the future.  Or I
> may be missing your point.
> 
> Lee, thanks for the insights and VMware updates, as always. :-)
> 
> I've used FreeNAS in the past and like it, but I'm not sure I'm
> explaining my thoughts as well as I'd like.  But let me try this:
> 
> (I  wrote this next part before I learned that "hyper-converged" is what
> I mean, but I'm leaving it here in case it's still useful.)
> 
> Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x
> 2TB drives in RAID5 for 10TB local storage each.
> 
> 3 node cluster with redundant FreeNAS:
>	1. VM node1: CPU/RAM used for VMs, 10TB local space wasted
>	2. VM node2: CPU/RAM used, 10TB local space wasted
>	3. VM node3: CPU/RAM used, 10TB local space wasted
>	4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available
>	5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a
> mirror
> 
> 3 node cluster with -on-node-shared-storage- hyper-converged storage:
>	1. VM node1: CPU/RAM used, 20TB local/shared space available +
>	2. VM node2: CPU/RAM used, 20TB local/shared space available +
>	3. VM node3: CPU/RAM used, 20TB local/shared space available +
> + For 20TB I'm assuming (3-1) * 10TB, for some kind of parity space
> loss.  If it was REALLY smart, it would keep the store for the local VMs
> local while still replicating, but that would require a lot more hooks
> into the entire system, not just some kind of replicated system.
> 
> But the point here is that with my idea I have 2x the disk with 3/5ths
> the servers.  Or put another way, I can now do a 5 node cluster with
> even more CPU, RAM and space dedicated to actually running VMs, and not
> lose 2/5ths of the nodes to just storing the VMs.
> 
> That said, I'd thought about the rebuild overhead, but not in depth, and
> that--and general "parity" or redundancy however implemented--are
> significant.  So my 2/5ths comparisons are not 100% fair.  But still,
> the idea apparently does have merit.
> 
> 
> On 08/08/2018 07:50 PM, Lee H. Marzke wrote:
>> 
>> JP, if you want cheap storage for your lab , I think you can't beat FreeNAS or
>> equiv rackmount solutions from https://www.ixsystems.com. I run my lab on
>> FreeNAS and a Dell 2950 server with 6x2TB disks and 2 x SSD. If you put storage
>> into the servers you will find all sorts of edge cases that you hadn't planned
>> on.
>> 
>> Just taking down a server for quick RAM swap, will cause it to need to rebuild
>> using lots of network/CPU.   If you have TB of fast SSD storage on multiple
>> servers and
>> don't have 10GB connectivity between hosts, or you have slow HD's you will have
>> pain.
>> Generally you try to migrate data off nodes prior to maintenance - which may
>> take several days.
>> 
>> The VMware solutions have changed a lot, and though it does not meet JP's
>> needs for *free*,  they may fit someone else's needs for a reliable / highly
>> available solution.    Of course ESXi and VSAN are free for 60 day trial after
>> install for lab use.
>> 
>> First there is a VMware Migrate *both* mode where you can migrate both the
>> Hypervisor and
>> then storage in one go, where the two storage units are not connected across
>> Hypervisors.
>> Needless to say this takes a long time to sequentially move memory , then disk
>> to the remote
>> server, and it doesn't help improve HA.
>> 
>> Next VMware VSAN is caching on really fast and VMware is hiring like mad
>> to fill new VSAN technical sales roles Nationwide.   VSAN uses storage in each
>> host ( minimum of one SSD and one HD ) and uses high-performance object
>> storage on each compute node.  All VM objects are stored on two hosts
>> minimum, with vSAN taking care of all the distribution.  The hosts must be
>> linked on a 1GB ( pref 10GB ) private network for back-end communication.
>> Writes are sent and committed to two nodes before being acknowledged.
>> You get one big storage pool - and allocate storage to VM's as you like -
>> with no sub LUNs or anything else to manage.   If you have 4 or more hosts
>> instead of mirroring data over 2 hosts,  you can do erasure coding ( equiv
>> of RAID 5/6 but with disks spread out across hosts.   So now your not
>> losing 50% of your storage , but have more intensive CPU and network
>> operations.   The vSAN software is pre-installed into ESX these days - just
>> need to activate it and apply a license after the 60 free day trial.
>> 
>> Not sure why you say FreeNAS is wasting CPU in more nodes,  as those CPU cycles
>> would be used locally in the Hyperconverged solutions as well ( perhaps taking
>> 10%
>> to 20% cycles away from a host for storage and replication ) so you may need
>> more / larger
>> hosts in a hyperconverged solution to make up for that.   Remember mirroring
>> takes little CPU, but waste's 50% of your storage,  any erasure coding is
>> much more CPU intensive, and more network intensive.
>> 
>> The other solutions mentioned except a ZFS server are likely way too complex
>> for a lab storage solution.   Is a company really going to give
>> a lab team 6 months of effort to put together storage that may or may
>> not perform ?  Can you do a business justification to spend dozens of MM of
>> effort just to save the $20K on an entry level  TrueNAS ZFS ?
>> 
>> Lee
>> 
>> 
>> ----- Original Message -----
>>> From: "Vossen JP" <jp@jpsdomain.org>
>>> To: "Philadelphia Linux User's Group Discussion List"
>>> <plug@lists.phillylinux.org>
>>> Sent: Wednesday, 8 August, 2018 17:13:17
>>> Subject: [PLUG] Virtualization clusters & shared storage
>> 
>>> I have a question about virtualization cluster solutions.  One thing
>>> that has always bugged me is that VM vMotion/LiveMigration features
>>> require shared storage, which makes sense, but they always seem to
>>> assume that shared storage is external, as in a NAS or SAN.  What would
>>> be REALLY cool is a system that uses the cluster members "local" storage
>>> as JBOD that becomes the shared storage.  Maybe that's how some of
>>> solutions work (via Ceph, GlusterFS or ZFS?) and I've missed it, but
>>> that seems to me to be a great solution for the lab & SOHO market.
>>>
>>> What I mean is, say I have at least 2 nodes in a cluster, though 3+
>>> would be better.  Each node would have at least 2 partitions, one for
>>> the OS/Hypervisor/whatever and the other for shared & replicated
>>> storage.  The "shared & replicated" partition would be, well, shared &
>>> replicated across the cluster, providing shared storage without needing
>>> an external NAS/SAN.
>>>
>>> This is important to me because we have a lot of hardware sitting around
>>> that has a lot of local storage.  It's basically all R710/720/730 with
>>> PERC RAID and 6x or 8x drive bays full of 1TB to 4TB drives.  While I
>>> *can* allocate some nodes for FreeNAS or something, that increases my
>>> required node count and wastes the CPU & RAM in the NAS nodes while also
>>> wasting a ton of local storage on the host nodes.  It would be more
>>> resource efficient to just use the "local" storage that's already
>>> spinning.  The alternative we're using now (that sucks) is that the
>>> hypervisors are all just stand-alone with local storage.  I'd rather get
>>> all the cluster advantages without the NAS/SAN issues
>>> (connectivity/speed, resilience, yet more rack space & boxes).
>>>
>>> Are there solutions that work that way and I've just missed it?
>>>
>>>
>>> Related, I'm aware of these virtualization environment tools, any more
>>> good ones?
>>> 1. OpenStack, but this is way too complicated and overkill
>>> 2. Proxmox sounds very cool
>>> 3. Cloudstack likewise, except it's Java! :-(
>>> 4. Ganeti was interesting but it looks like it may have stalled out
>>> around 2016
>>> 5. https://en.wikipedia.org/wiki/OVirt except it's Java and too limited
>>> 6. https://en.wikipedia.org/wiki/OpenNebula with some Java and might do
>>> on-node-shared-storage?
>>> 7. Like AWS: https://en.wikipedia.org/wiki/Eucalyptus_(software) except
>>> it's Java
>>>
>>> I'm asking partly for myself to replace my free but not F/OSS ESXi
>>> server at home and partly for a work lab that my team needs to rebuild
>>> in the next few months.  We have a mishmash right now, much of it ESXi.
>>> We have a lot of hardware laying around, but we have *no budget* for
>>> licenses for anything.  I know Lee will talk about the VMware starter
>>> packs and deals like that but we not only have no budget, that kind of
>>> thing is a nightmare politically and procedurally and is a no-go; it's
>>> free or nothing.  And yes I know that free costs money in terms of
>>> people time, but that's already paid for and while we're already busy,
>>> this is something that has to happen.
>>>
>>> Also we might like to branch out from ESXi anyway...  We are doing a
>>> some work in AWS, but that's not a solution here, though cross cloud
>>> tools like Terraform (and Ansible) are in use and the more we can use
>>> them here too the better.
> Thanks,
> JP
> --  -------------------------------------------------------------------
> JP Vossen, CISSP | http://www.jpsdomain.org/ | http://bashcookbook.com/
> ___________________________________________________________________________
> Philadelphia Linux Users Group         --        http://www.phillylinux.org
> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
> General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
> ___________________________________________________________________________
> Philadelphia Linux Users Group         --        http://www.phillylinux.org
> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
> General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

-- 
"Between subtle shading and the absence of light lies the nuance of iqlusion..." - Kryptos 

Lee Marzke, lee@marzke.net http://marzke.net/lee/ 
IT Consultant, VMware, VCenter, SAN storage, infrastructure, SW CM 
+1 800-393-5217 voice/text 
+1 484-348-2230 fax
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug