John Von Essen on 12 Aug 2018 12:19:32 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Virtualization clusters & shared storage |
Hey there’s nothing wrong with Supermicro servers, they last like 6 months now before they break. John Sent from my iPhone > On Aug 9, 2018, at 12:56 PM, Lee H. Marzke <lee@marzke.net> wrote: > > JP, > > Thanks for your further description. From your use of 2 x FreeNAS I assume this is production and not a lab ? > > For a lab, you can usually obtain a 4h response time to HW failures, > and many lab's might tolerate 6 hours downtime as a rare event. So you can avoid > the 2nd NAS unit entirely. In my recent talk I showed that you can easily replicate any FreeNAS > Volume to Amazon S3 if they are static; or ZFS send to Rsync.net if they contain running VM's. > Most HW failures will not cause loss of the pool, so your back up and running in a few > hours. Loss of an entire pool would require full replication back, taking a long time per volume. > > For production , you would typically use 2 controllers on 1 set of dual-port SCSI disks ( HA ) > instead of replication to a 2nd unit. FreeNAS lacks support for HA so you would typically > use a commercial unit such as TrueNAS or Nexenta. I also like Tegile and Nimble. > > For your discussion of Hyper-converged, you making assumptions that are not even close. > > Non hyper-converged hosts should not have any local disks - just take them out. > > For FreeNAS, most CPU/RAM is used. All free RAM is used for L2ARC, and CPU > is used intermittently for rebuilds, and scrubs. > > On a hyper-converged solution you typically get much less disk space than you think. > VMware VSAN DOES NOT use parity disks locally. It is an object store of data ( + cache ) > with one or more copies on another host. Any failure ( disk, cache , or complete host ) > is handled by getting the data from 2nd host. All synchronous writes are ack'ed after > two hosts have committed the write. I think many of the other hyper-converged solutions > are similar. vSAN does not enforce keeping storage and VM's on same host, while others > may do that. ( Datrium for instance keeps VM's in a local fast SSD cache on each host using > a custom ESX module, with all writes re-ordered and written sequentially to a central JBOD box ) > > With only 3 hosts you are limited to mirroring data, losing 50% of storage. This is two > storage locations and a witness for each object volume. Think of a witness as a checksum > of the objects in the volume. So you can still prove a single object is correct by looking at the > checksum on the witness. The remaining usable storage must have reserved data for snapshots, VM > swap files, etc. so it is only 75% usable at best. > > To lose less storage than a mirror you need 4 or more hosts, and a 10GB network for the replication. > You have choice of RAID5 ( that tolerates loss of one host ) or RAID 6 ( tolerates loss of > two hosts, min 5 hosts ). Note that any disk or cache failure on a node causes that > entire node to fail as there is no local RAID or mirroring. You fix the disk issue, then > rebuild the data on that node when it's back up. > > So with 4 hosts, RAID 5 you have data on 3 nodes and parity on 1 , so you lose > only 25%. Obviously using more hosts is beneficial, up to a limit of 32 hosts max. > With disk RAID, you increase storage efficiency with more disks, with vSAN you add more > hosts + networking + disks. > > You can also have multiple disk groups per host (HD + SSD), each replicated to a > similar group on other hosts, so loss of a disk only causes that disk group to > fail, not the entire host. > > I think it's very interesting that VSAN is using object storage under the covers. But > unlike others, just filling in one checkbox - and it's running on your ESXi hosts. > > Lee > > > ----- Original Message ----- >> From: "Vossen JP" <jp@jpsdomain.org> >> To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org> >> Sent: Wednesday, 8 August, 2018 21:23:53 >> Subject: Re: [PLUG] Virtualization clusters & shared storage > >> Firth, thanks to Lee and Andy for the term I didn't know for what I >> mean: hyper-converged infrastructure. >> https://en.wikipedia.org/wiki/Hyper-converged_infrastructure >> >> It looks like Proxmox does that: >> https://pve.proxmox.com/wiki/Hyper-converged_Infrastructure >> >> Thanks Keith, I had seen LizardFS but was not aware of the implications. >> >> Doug, Kubernetes is not in play, though it might be in the future. Or I >> may be missing your point. >> >> Lee, thanks for the insights and VMware updates, as always. :-) >> >> I've used FreeNAS in the past and like it, but I'm not sure I'm >> explaining my thoughts as well as I'd like. But let me try this: >> >> (I wrote this next part before I learned that "hyper-converged" is what >> I mean, but I'm leaving it here in case it's still useful.) >> >> Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x >> 2TB drives in RAID5 for 10TB local storage each. >> >> 3 node cluster with redundant FreeNAS: >> 1. VM node1: CPU/RAM used for VMs, 10TB local space wasted >> 2. VM node2: CPU/RAM used, 10TB local space wasted >> 3. VM node3: CPU/RAM used, 10TB local space wasted >> 4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available >> 5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a >> mirror >> >> 3 node cluster with -on-node-shared-storage- hyper-converged storage: >> 1. VM node1: CPU/RAM used, 20TB local/shared space available + >> 2. VM node2: CPU/RAM used, 20TB local/shared space available + >> 3. VM node3: CPU/RAM used, 20TB local/shared space available + >> + For 20TB I'm assuming (3-1) * 10TB, for some kind of parity space >> loss. If it was REALLY smart, it would keep the store for the local VMs >> local while still replicating, but that would require a lot more hooks >> into the entire system, not just some kind of replicated system. >> >> But the point here is that with my idea I have 2x the disk with 3/5ths >> the servers. Or put another way, I can now do a 5 node cluster with >> even more CPU, RAM and space dedicated to actually running VMs, and not >> lose 2/5ths of the nodes to just storing the VMs. >> >> That said, I'd thought about the rebuild overhead, but not in depth, and >> that--and general "parity" or redundancy however implemented--are >> significant. So my 2/5ths comparisons are not 100% fair. But still, >> the idea apparently does have merit. >> >> >>> On 08/08/2018 07:50 PM, Lee H. Marzke wrote: >>> >>> JP, if you want cheap storage for your lab , I think you can't beat FreeNAS or >>> equiv rackmount solutions from https://www.ixsystems.com. I run my lab on >>> FreeNAS and a Dell 2950 server with 6x2TB disks and 2 x SSD. If you put storage >>> into the servers you will find all sorts of edge cases that you hadn't planned >>> on. >>> >>> Just taking down a server for quick RAM swap, will cause it to need to rebuild >>> using lots of network/CPU. If you have TB of fast SSD storage on multiple >>> servers and >>> don't have 10GB connectivity between hosts, or you have slow HD's you will have >>> pain. >>> Generally you try to migrate data off nodes prior to maintenance - which may >>> take several days. >>> >>> The VMware solutions have changed a lot, and though it does not meet JP's >>> needs for *free*, they may fit someone else's needs for a reliable / highly >>> available solution. Of course ESXi and VSAN are free for 60 day trial after >>> install for lab use. >>> >>> First there is a VMware Migrate *both* mode where you can migrate both the >>> Hypervisor and >>> then storage in one go, where the two storage units are not connected across >>> Hypervisors. >>> Needless to say this takes a long time to sequentially move memory , then disk >>> to the remote >>> server, and it doesn't help improve HA. >>> >>> Next VMware VSAN is caching on really fast and VMware is hiring like mad >>> to fill new VSAN technical sales roles Nationwide. VSAN uses storage in each >>> host ( minimum of one SSD and one HD ) and uses high-performance object >>> storage on each compute node. All VM objects are stored on two hosts >>> minimum, with vSAN taking care of all the distribution. The hosts must be >>> linked on a 1GB ( pref 10GB ) private network for back-end communication. >>> Writes are sent and committed to two nodes before being acknowledged. >>> You get one big storage pool - and allocate storage to VM's as you like - >>> with no sub LUNs or anything else to manage. If you have 4 or more hosts >>> instead of mirroring data over 2 hosts, you can do erasure coding ( equiv >>> of RAID 5/6 but with disks spread out across hosts. So now your not >>> losing 50% of your storage , but have more intensive CPU and network >>> operations. The vSAN software is pre-installed into ESX these days - just >>> need to activate it and apply a license after the 60 free day trial. >>> >>> Not sure why you say FreeNAS is wasting CPU in more nodes, as those CPU cycles >>> would be used locally in the Hyperconverged solutions as well ( perhaps taking >>> 10% >>> to 20% cycles away from a host for storage and replication ) so you may need >>> more / larger >>> hosts in a hyperconverged solution to make up for that. Remember mirroring >>> takes little CPU, but waste's 50% of your storage, any erasure coding is >>> much more CPU intensive, and more network intensive. >>> >>> The other solutions mentioned except a ZFS server are likely way too complex >>> for a lab storage solution. Is a company really going to give >>> a lab team 6 months of effort to put together storage that may or may >>> not perform ? Can you do a business justification to spend dozens of MM of >>> effort just to save the $20K on an entry level TrueNAS ZFS ? >>> >>> Lee >>> >>> >>> ----- Original Message ----- >>>> From: "Vossen JP" <jp@jpsdomain.org> >>>> To: "Philadelphia Linux User's Group Discussion List" >>>> <plug@lists.phillylinux.org> >>>> Sent: Wednesday, 8 August, 2018 17:13:17 >>>> Subject: [PLUG] Virtualization clusters & shared storage >>> >>>> I have a question about virtualization cluster solutions. One thing >>>> that has always bugged me is that VM vMotion/LiveMigration features >>>> require shared storage, which makes sense, but they always seem to >>>> assume that shared storage is external, as in a NAS or SAN. What would >>>> be REALLY cool is a system that uses the cluster members "local" storage >>>> as JBOD that becomes the shared storage. Maybe that's how some of >>>> solutions work (via Ceph, GlusterFS or ZFS?) and I've missed it, but >>>> that seems to me to be a great solution for the lab & SOHO market. >>>> >>>> What I mean is, say I have at least 2 nodes in a cluster, though 3+ >>>> would be better. Each node would have at least 2 partitions, one for >>>> the OS/Hypervisor/whatever and the other for shared & replicated >>>> storage. The "shared & replicated" partition would be, well, shared & >>>> replicated across the cluster, providing shared storage without needing >>>> an external NAS/SAN. >>>> >>>> This is important to me because we have a lot of hardware sitting around >>>> that has a lot of local storage. It's basically all R710/720/730 with >>>> PERC RAID and 6x or 8x drive bays full of 1TB to 4TB drives. While I >>>> *can* allocate some nodes for FreeNAS or something, that increases my >>>> required node count and wastes the CPU & RAM in the NAS nodes while also >>>> wasting a ton of local storage on the host nodes. It would be more >>>> resource efficient to just use the "local" storage that's already >>>> spinning. The alternative we're using now (that sucks) is that the >>>> hypervisors are all just stand-alone with local storage. I'd rather get >>>> all the cluster advantages without the NAS/SAN issues >>>> (connectivity/speed, resilience, yet more rack space & boxes). >>>> >>>> Are there solutions that work that way and I've just missed it? >>>> >>>> >>>> Related, I'm aware of these virtualization environment tools, any more >>>> good ones? >>>> 1. OpenStack, but this is way too complicated and overkill >>>> 2. Proxmox sounds very cool >>>> 3. Cloudstack likewise, except it's Java! :-( >>>> 4. Ganeti was interesting but it looks like it may have stalled out >>>> around 2016 >>>> 5. https://en.wikipedia.org/wiki/OVirt except it's Java and too limited >>>> 6. https://en.wikipedia.org/wiki/OpenNebula with some Java and might do >>>> on-node-shared-storage? >>>> 7. Like AWS: https://en.wikipedia.org/wiki/Eucalyptus_(software) except >>>> it's Java >>>> >>>> I'm asking partly for myself to replace my free but not F/OSS ESXi >>>> server at home and partly for a work lab that my team needs to rebuild >>>> in the next few months. We have a mishmash right now, much of it ESXi. >>>> We have a lot of hardware laying around, but we have *no budget* for >>>> licenses for anything. I know Lee will talk about the VMware starter >>>> packs and deals like that but we not only have no budget, that kind of >>>> thing is a nightmare politically and procedurally and is a no-go; it's >>>> free or nothing. And yes I know that free costs money in terms of >>>> people time, but that's already paid for and while we're already busy, >>>> this is something that has to happen. >>>> >>>> Also we might like to branch out from ESXi anyway... We are doing a >>>> some work in AWS, but that's not a solution here, though cross cloud >>>> tools like Terraform (and Ansible) are in use and the more we can use >>>> them here too the better. >> Thanks, >> JP >> -- ------------------------------------------------------------------- >> JP Vossen, CISSP | http://www.jpsdomain.org/ | http://bashcookbook.com/ >> ___________________________________________________________________________ >> Philadelphia Linux Users Group -- http://www.phillylinux.org >> Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce >> General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug > > -- > "Between subtle shading and the absence of light lies the nuance of iqlusion..." - Kryptos > > Lee Marzke, lee@marzke.net http://marzke.net/lee/ > IT Consultant, VMware, VCenter, SAN storage, infrastructure, SW CM > +1 800-393-5217 voice/text > +1 484-348-2230 fax > ___________________________________________________________________________ > Philadelphia Linux Users Group -- http://www.phillylinux.org > Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce > General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug