John Von Essen on 12 Aug 2018 12:19:32 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


Hey there’s nothing wrong with Supermicro servers, they last like 6 months now before they break.

John

Sent from my iPhone

> On Aug 9, 2018, at 12:56 PM, Lee H. Marzke <lee@marzke.net> wrote:
> 
> JP,
> 
> Thanks for your further description.  From your use of 2 x FreeNAS I assume this is production and not a lab ?
> 
> For a lab,  you can usually obtain a 4h response time to HW failures,
> and many lab's might tolerate  6 hours downtime as a rare event.  So you can avoid
> the 2nd NAS unit entirely.  In my recent talk I showed that you can easily replicate any FreeNAS
> Volume to Amazon S3 if they are static; or ZFS send to Rsync.net if they contain running VM's.
> Most HW failures will not cause loss of the pool, so your back up and running in a few
> hours.  Loss of an entire pool would require full replication back, taking a long time per volume.
> 
> For production , you would typically use 2 controllers on 1 set of dual-port SCSI disks ( HA )
> instead of replication to a 2nd unit.  FreeNAS lacks support for HA so you would typically
> use a commercial unit such as TrueNAS or Nexenta.  I also like  Tegile and Nimble.
> 
> For your discussion of Hyper-converged,  you making assumptions that are not even close.
> 
> Non hyper-converged hosts should not have any local disks - just take them out.
> 
> For FreeNAS,  most CPU/RAM is used.  All free RAM is used for L2ARC,  and CPU
> is used intermittently for rebuilds, and scrubs. 
> 
> On a hyper-converged solution you typically get much less disk space than you think.
> VMware VSAN DOES NOT use parity disks locally.  It is an object store of data ( + cache )
> with one or more copies on another host.  Any failure ( disk, cache , or complete host )
> is handled by getting the data from 2nd host. All synchronous writes are ack'ed after
> two hosts have committed the write.  I think many of the other hyper-converged solutions
> are similar.  vSAN does not enforce keeping storage and VM's on same host,  while others
> may do that.  ( Datrium for instance keeps VM's in a local fast SSD cache on each host using
> a custom ESX module, with all writes re-ordered and written sequentially to a central JBOD box )
> 
> With only 3 hosts you are limited to mirroring data, losing 50% of storage. This is two
> storage locations and a witness for each object volume.  Think of a witness as a checksum
> of the objects in the volume.  So you can still prove a single object is correct by looking at the
> checksum on the witness.  The remaining usable storage must have reserved data for snapshots, VM
> swap files,  etc. so it is only 75% usable at best.
> 
> To lose less storage than a mirror you need 4 or more hosts, and a 10GB network for the replication.
> You have choice of RAID5 ( that tolerates loss of one host ) or RAID 6 ( tolerates loss of
> two hosts, min 5 hosts ).  Note that any disk or cache failure on a node causes that
> entire node to fail as there is no local RAID or mirroring.  You fix the disk issue, then
> rebuild the data on that node when it's back up.
> 
> So with 4 hosts, RAID 5 you have data on 3 nodes and parity on 1 , so you lose
> only 25%. Obviously using more hosts is beneficial, up to a limit of 32 hosts max.
> With disk RAID, you increase storage efficiency with more disks,  with vSAN you add more
> hosts + networking + disks.
> 
> You can also have multiple disk groups per host (HD + SSD), each replicated to a
> similar group on other hosts,  so loss of a disk only causes that disk group to
> fail, not the entire host.
> 
> I think it's very interesting that VSAN is using object storage under the covers.  But
> unlike others, just filling in one checkbox - and it's running on your ESXi hosts.  
> 
> Lee
> 
> 
> ----- Original Message -----
>> From: "Vossen JP" <jp@jpsdomain.org>
>> To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
>> Sent: Wednesday, 8 August, 2018 21:23:53
>> Subject: Re: [PLUG] Virtualization clusters & shared storage
> 
>> Firth, thanks to Lee and Andy for the term I didn't know for what I
>> mean: hyper-converged infrastructure.
>>    https://en.wikipedia.org/wiki/Hyper-converged_infrastructure
>> 
>> It looks like Proxmox does that:
>>    https://pve.proxmox.com/wiki/Hyper-converged_Infrastructure
>> 
>> Thanks Keith, I had seen LizardFS but was not aware of the implications.
>> 
>> Doug, Kubernetes is not in play, though it might be in the future.  Or I
>> may be missing your point.
>> 
>> Lee, thanks for the insights and VMware updates, as always. :-)
>> 
>> I've used FreeNAS in the past and like it, but I'm not sure I'm
>> explaining my thoughts as well as I'd like.  But let me try this:
>> 
>> (I  wrote this next part before I learned that "hyper-converged" is what
>> I mean, but I'm leaving it here in case it's still useful.)
>> 
>> Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x
>> 2TB drives in RAID5 for 10TB local storage each.
>> 
>> 3 node cluster with redundant FreeNAS:
>>    1. VM node1: CPU/RAM used for VMs, 10TB local space wasted
>>    2. VM node2: CPU/RAM used, 10TB local space wasted
>>    3. VM node3: CPU/RAM used, 10TB local space wasted
>>    4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available
>>    5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a
>> mirror
>> 
>> 3 node cluster with -on-node-shared-storage- hyper-converged storage:
>>    1. VM node1: CPU/RAM used, 20TB local/shared space available +
>>    2. VM node2: CPU/RAM used, 20TB local/shared space available +
>>    3. VM node3: CPU/RAM used, 20TB local/shared space available +
>> + For 20TB I'm assuming (3-1) * 10TB, for some kind of parity space
>> loss.  If it was REALLY smart, it would keep the store for the local VMs
>> local while still replicating, but that would require a lot more hooks
>> into the entire system, not just some kind of replicated system.
>> 
>> But the point here is that with my idea I have 2x the disk with 3/5ths
>> the servers.  Or put another way, I can now do a 5 node cluster with
>> even more CPU, RAM and space dedicated to actually running VMs, and not
>> lose 2/5ths of the nodes to just storing the VMs.
>> 
>> That said, I'd thought about the rebuild overhead, but not in depth, and
>> that--and general "parity" or redundancy however implemented--are
>> significant.  So my 2/5ths comparisons are not 100% fair.  But still,
>> the idea apparently does have merit.
>> 
>> 
>>> On 08/08/2018 07:50 PM, Lee H. Marzke wrote:
>>> 
>>> JP, if you want cheap storage for your lab , I think you can't beat FreeNAS or
>>> equiv rackmount solutions from https://www.ixsystems.com. I run my lab on
>>> FreeNAS and a Dell 2950 server with 6x2TB disks and 2 x SSD. If you put storage
>>> into the servers you will find all sorts of edge cases that you hadn't planned
>>> on.
>>> 
>>> Just taking down a server for quick RAM swap, will cause it to need to rebuild
>>> using lots of network/CPU.   If you have TB of fast SSD storage on multiple
>>> servers and
>>> don't have 10GB connectivity between hosts, or you have slow HD's you will have
>>> pain.
>>> Generally you try to migrate data off nodes prior to maintenance - which may
>>> take several days.
>>> 
>>> The VMware solutions have changed a lot, and though it does not meet JP's
>>> needs for *free*,  they may fit someone else's needs for a reliable / highly
>>> available solution.    Of course ESXi and VSAN are free for 60 day trial after
>>> install for lab use.
>>> 
>>> First there is a VMware Migrate *both* mode where you can migrate both the
>>> Hypervisor and
>>> then storage in one go, where the two storage units are not connected across
>>> Hypervisors.
>>> Needless to say this takes a long time to sequentially move memory , then disk
>>> to the remote
>>> server, and it doesn't help improve HA.
>>> 
>>> Next VMware VSAN is caching on really fast and VMware is hiring like mad
>>> to fill new VSAN technical sales roles Nationwide.   VSAN uses storage in each
>>> host ( minimum of one SSD and one HD ) and uses high-performance object
>>> storage on each compute node.  All VM objects are stored on two hosts
>>> minimum, with vSAN taking care of all the distribution.  The hosts must be
>>> linked on a 1GB ( pref 10GB ) private network for back-end communication.
>>> Writes are sent and committed to two nodes before being acknowledged.
>>> You get one big storage pool - and allocate storage to VM's as you like -
>>> with no sub LUNs or anything else to manage.   If you have 4 or more hosts
>>> instead of mirroring data over 2 hosts,  you can do erasure coding ( equiv
>>> of RAID 5/6 but with disks spread out across hosts.   So now your not
>>> losing 50% of your storage , but have more intensive CPU and network
>>> operations.   The vSAN software is pre-installed into ESX these days - just
>>> need to activate it and apply a license after the 60 free day trial.
>>> 
>>> Not sure why you say FreeNAS is wasting CPU in more nodes,  as those CPU cycles
>>> would be used locally in the Hyperconverged solutions as well ( perhaps taking
>>> 10%
>>> to 20% cycles away from a host for storage and replication ) so you may need
>>> more / larger
>>> hosts in a hyperconverged solution to make up for that.   Remember mirroring
>>> takes little CPU, but waste's 50% of your storage,  any erasure coding is
>>> much more CPU intensive, and more network intensive.
>>> 
>>> The other solutions mentioned except a ZFS server are likely way too complex
>>> for a lab storage solution.   Is a company really going to give
>>> a lab team 6 months of effort to put together storage that may or may
>>> not perform ?  Can you do a business justification to spend dozens of MM of
>>> effort just to save the $20K on an entry level  TrueNAS ZFS ?
>>> 
>>> Lee
>>> 
>>> 
>>> ----- Original Message -----
>>>> From: "Vossen JP" <jp@jpsdomain.org>
>>>> To: "Philadelphia Linux User's Group Discussion List"
>>>> <plug@lists.phillylinux.org>
>>>> Sent: Wednesday, 8 August, 2018 17:13:17
>>>> Subject: [PLUG] Virtualization clusters & shared storage
>>> 
>>>> I have a question about virtualization cluster solutions.  One thing
>>>> that has always bugged me is that VM vMotion/LiveMigration features
>>>> require shared storage, which makes sense, but they always seem to
>>>> assume that shared storage is external, as in a NAS or SAN.  What would
>>>> be REALLY cool is a system that uses the cluster members "local" storage
>>>> as JBOD that becomes the shared storage.  Maybe that's how some of
>>>> solutions work (via Ceph, GlusterFS or ZFS?) and I've missed it, but
>>>> that seems to me to be a great solution for the lab & SOHO market.
>>>> 
>>>> What I mean is, say I have at least 2 nodes in a cluster, though 3+
>>>> would be better.  Each node would have at least 2 partitions, one for
>>>> the OS/Hypervisor/whatever and the other for shared & replicated
>>>> storage.  The "shared & replicated" partition would be, well, shared &
>>>> replicated across the cluster, providing shared storage without needing
>>>> an external NAS/SAN.
>>>> 
>>>> This is important to me because we have a lot of hardware sitting around
>>>> that has a lot of local storage.  It's basically all R710/720/730 with
>>>> PERC RAID and 6x or 8x drive bays full of 1TB to 4TB drives.  While I
>>>> *can* allocate some nodes for FreeNAS or something, that increases my
>>>> required node count and wastes the CPU & RAM in the NAS nodes while also
>>>> wasting a ton of local storage on the host nodes.  It would be more
>>>> resource efficient to just use the "local" storage that's already
>>>> spinning.  The alternative we're using now (that sucks) is that the
>>>> hypervisors are all just stand-alone with local storage.  I'd rather get
>>>> all the cluster advantages without the NAS/SAN issues
>>>> (connectivity/speed, resilience, yet more rack space & boxes).
>>>> 
>>>> Are there solutions that work that way and I've just missed it?
>>>> 
>>>> 
>>>> Related, I'm aware of these virtualization environment tools, any more
>>>> good ones?
>>>> 1. OpenStack, but this is way too complicated and overkill
>>>> 2. Proxmox sounds very cool
>>>> 3. Cloudstack likewise, except it's Java! :-(
>>>> 4. Ganeti was interesting but it looks like it may have stalled out
>>>> around 2016
>>>> 5. https://en.wikipedia.org/wiki/OVirt except it's Java and too limited
>>>> 6. https://en.wikipedia.org/wiki/OpenNebula with some Java and might do
>>>> on-node-shared-storage?
>>>> 7. Like AWS: https://en.wikipedia.org/wiki/Eucalyptus_(software) except
>>>> it's Java
>>>> 
>>>> I'm asking partly for myself to replace my free but not F/OSS ESXi
>>>> server at home and partly for a work lab that my team needs to rebuild
>>>> in the next few months.  We have a mishmash right now, much of it ESXi.
>>>> We have a lot of hardware laying around, but we have *no budget* for
>>>> licenses for anything.  I know Lee will talk about the VMware starter
>>>> packs and deals like that but we not only have no budget, that kind of
>>>> thing is a nightmare politically and procedurally and is a no-go; it's
>>>> free or nothing.  And yes I know that free costs money in terms of
>>>> people time, but that's already paid for and while we're already busy,
>>>> this is something that has to happen.
>>>> 
>>>> Also we might like to branch out from ESXi anyway...  We are doing a
>>>> some work in AWS, but that's not a solution here, though cross cloud
>>>> tools like Terraform (and Ansible) are in use and the more we can use
>>>> them here too the better.
>> Thanks,
>> JP
>> --  -------------------------------------------------------------------
>> JP Vossen, CISSP | http://www.jpsdomain.org/ | http://bashcookbook.com/
>> ___________________________________________________________________________
>> Philadelphia Linux Users Group         --        http://www.phillylinux.org
>> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
>> General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
> 
> -- 
> "Between subtle shading and the absence of light lies the nuance of iqlusion..." - Kryptos 
> 
> Lee Marzke, lee@marzke.net http://marzke.net/lee/ 
> IT Consultant, VMware, VCenter, SAN storage, infrastructure, SW CM 
> +1 800-393-5217 voice/text 
> +1 484-348-2230 fax
> ___________________________________________________________________________
> Philadelphia Linux Users Group         --        http://www.phillylinux.org
> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
> General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug