Lee H. Marzke on 10 Aug 2018 12:31:20 -0700 |
[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]
Re: [PLUG] Virtualization clusters & shared storage |
Keith, LizardFs seems to have similar disk efficiency as VSAN set to host failure2Tolerate=2. 60TB total disk, ~20TB usable, or perhaps ~40TB with erasure coding. Might be great for a lab, However this is now using 5 physical servers for just storage, and no hypervisors. Quite a lot of complexity and setup for storage compared to a single dual-head TrueNAS box. Does this have any De-duplication ? Note that Tegile has de-duplication on all models, and Nimble has de-dup on the all-Flash units, which may significantly save space in a lab with many repetitive environments for test. Lee ----- Original Message ----- > From: "Keith C. Perry" <kperry@daotechnologies.com> > To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org> > Sent: Friday, 10 August, 2018 14:43:24 > Subject: Re: [PLUG] Virtualization clusters & shared storage > From JP... > > Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x > 2TB drives in RAID5 for 10TB local storage each. > > ...one of your builds... > > 3 node cluster with redundant FreeNAS: > 1. VM node1: CPU/RAM used for VMs, 10TB local space wasted > 2. VM node2: CPU/RAM used, 10TB local space wasted > 3. VM node3: CPU/RAM used, 10TB local space wasted > 4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available > 5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a mirror > > My build with LizardFS with the same hardware... > > 5 node cluster, with standard goal of 3 or EC(5,2) > 1. LizardsFS master + chunkserver 1 (~12Tb for data storage) + VM node 1 > 2. LizardsFS shadow master + chunkserver 2 (~12Tb for data storage) + VM node 2 > 3. LizardsFS metalogger + chunkserver 3 (~12Tb for data storage) + VM node 3 > 4. LizardsFS metalogger + chunkserver 4 (~12Tb for data storage) + VM node 4 > 5. LizardsFS metalogger + chunkserver 5 (~12Tb for data storage) + VM node 3 > > This deployment would give you around 60Tb of raw space. For argument sake if > assume you're using the same goal type for all data, then goal 3 would give you > about 20Tb of data space and EC(5,2) would give you 42.86Tb. In addition, some > other common points... > > 1. I'm ignoring system and metadata requirements because they are negligible- > Even if the OS was on 500Gb drive that would be more than enough space for the > OS and metadata that could manage 10's of millions of files. This why the > master and shadow master can also be chunkservers. However, it is a general > practice to use small RAID-1 for the OS volume for better availability. > 2. While using RAID or LVM is possible it is completely unnecessary with > LizardFS. The system can use single drives directly on a chunkserver thus the > overhead is just for the file system formatting. > 3. R710's if I recall correctly can have 2 or 4 ports. I would strongly > recommend using bonded NICs so that you have a highest network throughput > possible (mode 5 or mode 6 if your NICs can do it. LACP/mode4 if you have a > capable switch). > > This build would give you a completely fault tolerate and available storage > system with low storage overhead. I won't say anything about IOPS because that > depends on your workloads but network bonds help quite a bit structurally since > that is a typical bottleneck. Specifically, this system can tolerate: > > 1. any 2 disks can fail anywhere... LizardFS automatically load balances your > data across all your disk on all your chunkservers so whether you use > individual disks, LVM or RAID the system will do that. When a disk fails, the > system will begin to migrate data to maintain your goals if there are available > resources. If you specific disks directly in the system, you would just > replace the failed one- no messing with RAID and LVM procedures to replace a > failed disk. LizardFS will re-balance the data again when failed resources are > replaced. There is no downtime or data loss > 2. any 2 chunkservers can fail... extending from above any 2 nodes can die > completely- even the master server (which has the metadata). When the master > fails, the shadow master takes over automatically. There is no downtime or > data loss > 3. if both the master and shadow master go down, there are 3 metaloggers which > just receive metadata. Any of those 3 servers can manually have the master > service brought or just be a place from which the real master restores is > metadata from. Here there will be downtime but still no data loss. > > So, with 5 servers LizardFS can give you a very resilient system which scales > up, out or both and be adjusted easily to facilitate multiple strategies in one > storage system. > > > ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ > Keith C. Perry, MS E.E. > Managing Member, DAO Technologies LLC > (O) +1.215.525.4165 x2033 > (M) +1.215.432.5167 > www.daotechnologies.com > > ----- Original Message ----- > From: "JP Vossen" <jp@jpsdomain.org> > To: plug@lists.phillylinux.org > Sent: Wednesday, August 8, 2018 9:23:53 PM > Subject: Re: [PLUG] Virtualization clusters & shared storage > > Firth, thanks to Lee and Andy for the term I didn't know for what I > mean: hyper-converged infrastructure. > https://en.wikipedia.org/wiki/Hyper-converged_infrastructure > > It looks like Proxmox does that: > https://pve.proxmox.com/wiki/Hyper-converged_Infrastructure > > Thanks Keith, I had seen LizardFS but was not aware of the implications. > > Doug, Kubernetes is not in play, though it might be in the future. Or I > may be missing your point. > > Lee, thanks for the insights and VMware updates, as always. :-) > > I've used FreeNAS in the past and like it, but I'm not sure I'm > explaining my thoughts as well as I'd like. But let me try this: > > (I wrote this next part before I learned that "hyper-converged" is what > I mean, but I'm leaving it here in case it's still useful.) > > Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x > 2TB drives in RAID5 for 10TB local storage each. > > 3 node cluster with redundant FreeNAS: > 1. VM node1: CPU/RAM used for VMs, 10TB local space wasted > 2. VM node2: CPU/RAM used, 10TB local space wasted > 3. VM node3: CPU/RAM used, 10TB local space wasted > 4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available > 5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a > mirror > > 3 node cluster with -on-node-shared-storage- hyper-converged storage: > 1. VM node1: CPU/RAM used, 20TB local/shared space available + > 2. VM node2: CPU/RAM used, 20TB local/shared space available + > 3. VM node3: CPU/RAM used, 20TB local/shared space available + > + For 20TB I'm assuming (3-1) * 10TB, for some kind of parity space > loss. If it was REALLY smart, it would keep the store for the local VMs > local while still replicating, but that would require a lot more hooks > into the entire system, not just some kind of replicated system. > > But the point here is that with my idea I have 2x the disk with 3/5ths > the servers. Or put another way, I can now do a 5 node cluster with > even more CPU, RAM and space dedicated to actually running VMs, and not > lose 2/5ths of the nodes to just storing the VMs. > > That said, I'd thought about the rebuild overhead, but not in depth, and > that--and general "parity" or redundancy however implemented--are > significant. So my 2/5ths comparisons are not 100% fair. But still, > the idea apparently does have merit. > > > On 08/08/2018 07:50 PM, Lee H. Marzke wrote: >> >> JP, if you want cheap storage for your lab , I think you can't beat FreeNAS or >> equiv rackmount solutions from https://www.ixsystems.com. I run my lab on >> FreeNAS and a Dell 2950 server with 6x2TB disks and 2 x SSD. If you put storage >> into the servers you will find all sorts of edge cases that you hadn't planned >> on. >> >> Just taking down a server for quick RAM swap, will cause it to need to rebuild >> using lots of network/CPU. If you have TB of fast SSD storage on multiple >> servers and >> don't have 10GB connectivity between hosts, or you have slow HD's you will have >> pain. >> Generally you try to migrate data off nodes prior to maintenance - which may >> take several days. >> >> The VMware solutions have changed a lot, and though it does not meet JP's >> needs for *free*, they may fit someone else's needs for a reliable / highly >> available solution. Of course ESXi and VSAN are free for 60 day trial after >> install for lab use. >> >> First there is a VMware Migrate *both* mode where you can migrate both the >> Hypervisor and >> then storage in one go, where the two storage units are not connected across >> Hypervisors. >> Needless to say this takes a long time to sequentially move memory , then disk >> to the remote >> server, and it doesn't help improve HA. >> >> Next VMware VSAN is caching on really fast and VMware is hiring like mad >> to fill new VSAN technical sales roles Nationwide. VSAN uses storage in each >> host ( minimum of one SSD and one HD ) and uses high-performance object >> storage on each compute node. All VM objects are stored on two hosts >> minimum, with vSAN taking care of all the distribution. The hosts must be >> linked on a 1GB ( pref 10GB ) private network for back-end communication. >> Writes are sent and committed to two nodes before being acknowledged. >> You get one big storage pool - and allocate storage to VM's as you like - >> with no sub LUNs or anything else to manage. If you have 4 or more hosts >> instead of mirroring data over 2 hosts, you can do erasure coding ( equiv >> of RAID 5/6 but with disks spread out across hosts. So now your not >> losing 50% of your storage , but have more intensive CPU and network >> operations. The vSAN software is pre-installed into ESX these days - just >> need to activate it and apply a license after the 60 free day trial. >> >> Not sure why you say FreeNAS is wasting CPU in more nodes, as those CPU cycles >> would be used locally in the Hyperconverged solutions as well ( perhaps taking >> 10% >> to 20% cycles away from a host for storage and replication ) so you may need >> more / larger >> hosts in a hyperconverged solution to make up for that. Remember mirroring >> takes little CPU, but waste's 50% of your storage, any erasure coding is >> much more CPU intensive, and more network intensive. >> >> The other solutions mentioned except a ZFS server are likely way too complex >> for a lab storage solution. Is a company really going to give >> a lab team 6 months of effort to put together storage that may or may >> not perform ? Can you do a business justification to spend dozens of MM of >> effort just to save the $20K on an entry level TrueNAS ZFS ? >> >> Lee >> >> >> ----- Original Message ----- >>> From: "Vossen JP" <jp@jpsdomain.org> >>> To: "Philadelphia Linux User's Group Discussion List" >>> <plug@lists.phillylinux.org> >>> Sent: Wednesday, 8 August, 2018 17:13:17 >>> Subject: [PLUG] Virtualization clusters & shared storage >> >>> I have a question about virtualization cluster solutions. One thing >>> that has always bugged me is that VM vMotion/LiveMigration features >>> require shared storage, which makes sense, but they always seem to >>> assume that shared storage is external, as in a NAS or SAN. What would >>> be REALLY cool is a system that uses the cluster members "local" storage >>> as JBOD that becomes the shared storage. Maybe that's how some of >>> solutions work (via Ceph, GlusterFS or ZFS?) and I've missed it, but >>> that seems to me to be a great solution for the lab & SOHO market. >>> >>> What I mean is, say I have at least 2 nodes in a cluster, though 3+ >>> would be better. Each node would have at least 2 partitions, one for >>> the OS/Hypervisor/whatever and the other for shared & replicated >>> storage. The "shared & replicated" partition would be, well, shared & >>> replicated across the cluster, providing shared storage without needing >>> an external NAS/SAN. >>> >>> This is important to me because we have a lot of hardware sitting around >>> that has a lot of local storage. It's basically all R710/720/730 with >>> PERC RAID and 6x or 8x drive bays full of 1TB to 4TB drives. While I >>> *can* allocate some nodes for FreeNAS or something, that increases my >>> required node count and wastes the CPU & RAM in the NAS nodes while also >>> wasting a ton of local storage on the host nodes. It would be more >>> resource efficient to just use the "local" storage that's already >>> spinning. The alternative we're using now (that sucks) is that the >>> hypervisors are all just stand-alone with local storage. I'd rather get >>> all the cluster advantages without the NAS/SAN issues >>> (connectivity/speed, resilience, yet more rack space & boxes). >>> >>> Are there solutions that work that way and I've just missed it? >>> >>> >>> Related, I'm aware of these virtualization environment tools, any more >>> good ones? >>> 1. OpenStack, but this is way too complicated and overkill >>> 2. Proxmox sounds very cool >>> 3. Cloudstack likewise, except it's Java! :-( >>> 4. Ganeti was interesting but it looks like it may have stalled out >>> around 2016 >>> 5. https://en.wikipedia.org/wiki/OVirt except it's Java and too limited >>> 6. https://en.wikipedia.org/wiki/OpenNebula with some Java and might do >>> on-node-shared-storage? >>> 7. Like AWS: https://en.wikipedia.org/wiki/Eucalyptus_(software) except >>> it's Java >>> >>> I'm asking partly for myself to replace my free but not F/OSS ESXi >>> server at home and partly for a work lab that my team needs to rebuild >>> in the next few months. We have a mishmash right now, much of it ESXi. >>> We have a lot of hardware laying around, but we have *no budget* for >>> licenses for anything. I know Lee will talk about the VMware starter >>> packs and deals like that but we not only have no budget, that kind of >>> thing is a nightmare politically and procedurally and is a no-go; it's >>> free or nothing. And yes I know that free costs money in terms of >>> people time, but that's already paid for and while we're already busy, >>> this is something that has to happen. >>> >>> Also we might like to branch out from ESXi anyway... We are doing a >>> some work in AWS, but that's not a solution here, though cross cloud >>> tools like Terraform (and Ansible) are in use and the more we can use >>> them here too the better. > Thanks, > JP > -- ------------------------------------------------------------------- > JP Vossen, CISSP | http://www.jpsdomain.org/ | http://bashcookbook.com/ > ___________________________________________________________________________ > Philadelphia Linux Users Group -- http://www.phillylinux.org > Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce > General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug > ___________________________________________________________________________ > Philadelphia Linux Users Group -- http://www.phillylinux.org > Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce > General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug -- "Between subtle shading and the absence of light lies the nuance of iqlusion..." - Kryptos Lee Marzke, lee@marzke.net http://marzke.net/lee/ IT Consultant, VMware, VCenter, SAN storage, infrastructure, SW CM +1 800-393-5217 voice/text +1 484-348-2230 fax ___________________________________________________________________________ Philadelphia Linux Users Group -- http://www.phillylinux.org Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce General Discussion -- http://lists.phillylinux.org/mailman/listinfo/plug