Keith C. Perry on 11 Aug 2018 15:23:05 -0700


[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage


JP, not to give you more reading material but along the same lines...  https://docs.lizardfs.com/cookbook/hypervisors.html#using-lizardfs-as-shared-storage-for-proxmoxve

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 
Keith C. Perry, MS E.E. 
Managing Member, DAO Technologies LLC 
(O) +1.215.525.4165 x2033 
(M) +1.215.432.5167 
www.daotechnologies.com

----- Original Message -----
From: "JP Vossen" <jp@jpsdomain.org>
To: plug@lists.phillylinux.org
Sent: Saturday, August 11, 2018 6:15:36 PM
Subject: Re: [PLUG] Virtualization clusters & shared storage

I've been reading the Proxmox docs 
(https://pve.proxmox.com/pve-docs/pve-admin-guide.html), and if I'm 
understanding them, for HCI it requires 3+ nodes and it uses Ceph.  I 
haven't finished reading and I haven't experimented, so I might be wrong.

*If* I am not wrong, the result sounds a lot like what Keith described, 
but I need to keep reading, then experiment.  It sounds like you can 
experiment with nested virtualization, so I'll probably try it on my 
home ESXi, but I do have hardware if I need it.

I have no idea if there is de-dup in that solution.

This might be interesting: 
https://www.reddit.com/r/sysadmin/comments/5uulqm/best_distributed_file_system_glusterfs_vs_ceph_vs/

*If* I get anywhere with this I'll report back, and possibly do a talk 
on it.


On 08/10/2018 05:10 PM, Keith C. Perry wrote:
> In my scenario, since I'm a KVM guy all 5 of those servers could mount the storage and run VM guests so I've actually added VM hosts and network throughput.  My apologies for not mentioning that.  I would only run them on the 3 systems that were not actively handling the metadata since more CPU and RAM is used there.  The chunkservers and metaloggers are light on CPU.
> 
> I would not say this is complex at all.  I was illustrating how LizardFS could use JP's resources much more efficiently while still being incredibly fault tolerant.  I could go so far as to reduce this to one server and run multiple chunkservers to build out the same redundancy.  Some users do that for single server storage systems because working with LizardFS is much easier than dealing with RAID, LVM or even filesystems like ZFS or BTRFS when it comes to recovering or re-balancing data.  Is that overkill or overly complex- I would say it depends.  I used to think so but after working with this system for over a year, I see a lot value in that.  The real point here is that you can scale the system from a single server deployment up and out to whatever your needs are- usually without taking the system offline.
> 
> As far as dedup...
> 
> https://www.phoronix.com/scan.php?page=news_item&px=LizardFS-2018-Roadmap
> 
> "LizardFS in 2017 achieved ACL support with in-memory deduplication, a new task engine, initial work on a Hadoop plug-in, read-ahead caching, secondary group support, recursive remove, FreeBSD support, and more."
> 
> 
> ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
> Keith C. Perry, MS E.E.
> Managing Member, DAO Technologies LLC
> (O) +1.215.525.4165 x2033
> (M) +1.215.432.5167
> www.daotechnologies.com
> 
> ----- Original Message -----
> From: "Lee H. Marzke" <lee@marzke.net>
> To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
> Sent: Friday, August 10, 2018 3:31:13 PM
> Subject: Re: [PLUG] Virtualization clusters & shared storage
> 
> Keith,
> 
> LizardFs seems to have similar disk efficiency as VSAN set to host failure2Tolerate=2.
> 
> 60TB total disk,   ~20TB usable,  or perhaps ~40TB with erasure coding.
> 
> Might be great for a lab, However this is now using 5 physical servers for
> just storage, and no hypervisors.
> 
> Quite a lot of complexity and setup for storage compared to a single dual-head
> TrueNAS box.
> 
> Does this have any De-duplication ?
> 
> Note that Tegile has de-duplication on all models, and Nimble has de-dup on
> the all-Flash units,  which may significantly save space in a lab with many
> repetitive environments for test.
> 
> 
> Lee
> 
> 
> ----- Original Message -----
>> From: "Keith C. Perry" <kperry@daotechnologies.com>
>> To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
>> Sent: Friday, 10 August, 2018 14:43:24
>> Subject: Re: [PLUG] Virtualization clusters & shared storage
> 
>>  From JP...
>>
>> Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x
>> 2TB drives in RAID5 for 10TB local storage each.
>>
>> ...one of your builds...
>>
>> 3 node cluster with redundant FreeNAS:
>>         1. VM node1: CPU/RAM used for VMs, 10TB local space wasted
>>         2. VM node2: CPU/RAM used, 10TB local space wasted
>>         3. VM node3: CPU/RAM used, 10TB local space wasted
>>         4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available
>>         5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a mirror
>>
>> My build with LizardFS with the same hardware...
>>
>> 5 node cluster, with standard goal of 3 or EC(5,2)
>>         1. LizardsFS master + chunkserver 1 (~12Tb for data storage) + VM node 1
>>         2. LizardsFS shadow master + chunkserver 2 (~12Tb for data storage) + VM node 2
>>         3. LizardsFS metalogger + chunkserver 3 (~12Tb for data storage) + VM node 3
>>         4. LizardsFS metalogger + chunkserver 4 (~12Tb for data storage) + VM node 4
>>         5. LizardsFS metalogger + chunkserver 5 (~12Tb for data storage) + VM node 3
>>
>> This deployment would give you around 60Tb of raw space.  For argument sake if
>> assume you're using the same goal type for all data, then goal 3 would give you
>> about 20Tb of data space and EC(5,2) would give you 42.86Tb.  In addition, some
>> other common points...
>>
>> 1. I'm ignoring system and metadata requirements because they are negligible-
>> Even if the OS was on 500Gb drive that would be more than enough space for the
>> OS and metadata that could manage 10's of millions of files.  This why the
>> master and shadow master can also be chunkservers.  However, it is a general
>> practice to use small RAID-1 for the OS volume for better availability.
>> 2. While using RAID or LVM is possible it is completely unnecessary with
>> LizardFS.  The system can use single drives directly on a chunkserver thus the
>> overhead is just for the file system formatting.
>> 3. R710's if I recall correctly can have 2 or 4 ports.  I would strongly
>> recommend using bonded NICs so that you have a highest network throughput
>> possible (mode 5 or mode 6 if your NICs can do it. LACP/mode4 if you have a
>> capable switch).
>>
>> This build would give you a completely fault tolerate and available storage
>> system with low storage overhead.  I won't say anything about IOPS because that
>> depends on your workloads but network bonds help quite a bit structurally since
>> that is a typical bottleneck.  Specifically, this system can tolerate:
>>
>> 1. any 2 disks can fail anywhere... LizardFS automatically load balances your
>> data across all your disk on all your chunkservers so whether you use
>> individual disks, LVM or RAID the system will do that.  When a disk fails, the
>> system will begin to migrate data to maintain your goals if there are available
>> resources.  If you specific disks directly in the system, you would just
>> replace the failed one- no messing with RAID and LVM procedures to replace a
>> failed disk. LizardFS will re-balance the data again when failed resources are
>> replaced.  There is no downtime or data loss
>> 2. any 2 chunkservers can fail...  extending from above any 2 nodes can die
>> completely- even the master server (which has the metadata).  When the master
>> fails, the shadow master takes over automatically.  There is no downtime or
>> data loss
>> 3. if both the master and shadow master go down, there are 3 metaloggers which
>> just receive metadata.  Any of those 3 servers can manually have the master
>> service brought or just be a place from which the real master restores is
>> metadata from.  Here there will be downtime but still no data loss.
>>
>> So, with 5 servers LizardFS can give you a very resilient system which scales
>> up, out or both and be adjusted easily to facilitate multiple strategies in one
>> storage system.
>>
>>
>> ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
>> Keith C. Perry, MS E.E.
>> Managing Member, DAO Technologies LLC
>> (O) +1.215.525.4165 x2033
>> (M) +1.215.432.5167
>> www.daotechnologies.com
>>
>> ----- Original Message -----
>> From: "JP Vossen" <jp@jpsdomain.org>
>> To: plug@lists.phillylinux.org
>> Sent: Wednesday, August 8, 2018 9:23:53 PM
>> Subject: Re: [PLUG] Virtualization clusters & shared storage
>>
>> Firth, thanks to Lee and Andy for the term I didn't know for what I
>> mean: hyper-converged infrastructure.
>> 	https://en.wikipedia.org/wiki/Hyper-converged_infrastructure
>>
>> It looks like Proxmox does that:
>> 	https://pve.proxmox.com/wiki/Hyper-converged_Infrastructure
>>
>> Thanks Keith, I had seen LizardFS but was not aware of the implications.
>>
>> Doug, Kubernetes is not in play, though it might be in the future.  Or I
>> may be missing your point.
>>
>> Lee, thanks for the insights and VMware updates, as always. :-)
>>
>> I've used FreeNAS in the past and like it, but I'm not sure I'm
>> explaining my thoughts as well as I'd like.  But let me try this:
>>
>> (I  wrote this next part before I learned that "hyper-converged" is what
>> I mean, but I'm leaving it here in case it's still useful.)
>>
>> Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x
>> 2TB drives in RAID5 for 10TB local storage each.
>>
>> 3 node cluster with redundant FreeNAS:
>> 	1. VM node1: CPU/RAM used for VMs, 10TB local space wasted
>> 	2. VM node2: CPU/RAM used, 10TB local space wasted
>> 	3. VM node3: CPU/RAM used, 10TB local space wasted
>> 	4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available
>> 	5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a
>> mirror
>>
>> 3 node cluster with -on-node-shared-storage- hyper-converged storage:
>> 	1. VM node1: CPU/RAM used, 20TB local/shared space available +
>> 	2. VM node2: CPU/RAM used, 20TB local/shared space available +
>> 	3. VM node3: CPU/RAM used, 20TB local/shared space available +
>> + For 20TB I'm assuming (3-1) * 10TB, for some kind of parity space
>> loss.  If it was REALLY smart, it would keep the store for the local VMs
>> local while still replicating, but that would require a lot more hooks
>> into the entire system, not just some kind of replicated system.
>>
>> But the point here is that with my idea I have 2x the disk with 3/5ths
>> the servers.  Or put another way, I can now do a 5 node cluster with
>> even more CPU, RAM and space dedicated to actually running VMs, and not
>> lose 2/5ths of the nodes to just storing the VMs.
>>
>> That said, I'd thought about the rebuild overhead, but not in depth, and
>> that--and general "parity" or redundancy however implemented--are
>> significant.  So my 2/5ths comparisons are not 100% fair.  But still,
>> the idea apparently does have merit.
>>
>>
>> On 08/08/2018 07:50 PM, Lee H. Marzke wrote:
>>>
>>> JP, if you want cheap storage for your lab , I think you can't beat FreeNAS or
>>> equiv rackmount solutions from https://www.ixsystems.com. I run my lab on
>>> FreeNAS and a Dell 2950 server with 6x2TB disks and 2 x SSD. If you put storage
>>> into the servers you will find all sorts of edge cases that you hadn't planned
>>> on.
>>>
>>> Just taking down a server for quick RAM swap, will cause it to need to rebuild
>>> using lots of network/CPU.   If you have TB of fast SSD storage on multiple
>>> servers and
>>> don't have 10GB connectivity between hosts, or you have slow HD's you will have
>>> pain.
>>> Generally you try to migrate data off nodes prior to maintenance - which may
>>> take several days.
>>>
>>> The VMware solutions have changed a lot, and though it does not meet JP's
>>> needs for *free*,  they may fit someone else's needs for a reliable / highly
>>> available solution.    Of course ESXi and VSAN are free for 60 day trial after
>>> install for lab use.
>>>
>>> First there is a VMware Migrate *both* mode where you can migrate both the
>>> Hypervisor and
>>> then storage in one go, where the two storage units are not connected across
>>> Hypervisors.
>>> Needless to say this takes a long time to sequentially move memory , then disk
>>> to the remote
>>> server, and it doesn't help improve HA.
>>>
>>> Next VMware VSAN is caching on really fast and VMware is hiring like mad
>>> to fill new VSAN technical sales roles Nationwide.   VSAN uses storage in each
>>> host ( minimum of one SSD and one HD ) and uses high-performance object
>>> storage on each compute node.  All VM objects are stored on two hosts
>>> minimum, with vSAN taking care of all the distribution.  The hosts must be
>>> linked on a 1GB ( pref 10GB ) private network for back-end communication.
>>> Writes are sent and committed to two nodes before being acknowledged.
>>> You get one big storage pool - and allocate storage to VM's as you like -
>>> with no sub LUNs or anything else to manage.   If you have 4 or more hosts
>>> instead of mirroring data over 2 hosts,  you can do erasure coding ( equiv
>>> of RAID 5/6 but with disks spread out across hosts.   So now your not
>>> losing 50% of your storage , but have more intensive CPU and network
>>> operations.   The vSAN software is pre-installed into ESX these days - just
>>> need to activate it and apply a license after the 60 free day trial.
>>>
>>> Not sure why you say FreeNAS is wasting CPU in more nodes,  as those CPU cycles
>>> would be used locally in the Hyperconverged solutions as well ( perhaps taking
>>> 10%
>>> to 20% cycles away from a host for storage and replication ) so you may need
>>> more / larger
>>> hosts in a hyperconverged solution to make up for that.   Remember mirroring
>>> takes little CPU, but waste's 50% of your storage,  any erasure coding is
>>> much more CPU intensive, and more network intensive.
>>>
>>> The other solutions mentioned except a ZFS server are likely way too complex
>>> for a lab storage solution.   Is a company really going to give
>>> a lab team 6 months of effort to put together storage that may or may
>>> not perform ?  Can you do a business justification to spend dozens of MM of
>>> effort just to save the $20K on an entry level  TrueNAS ZFS ?
>>>
>>> Lee
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Vossen JP" <jp@jpsdomain.org>
>>>> To: "Philadelphia Linux User's Group Discussion List"
>>>> <plug@lists.phillylinux.org>
>>>> Sent: Wednesday, 8 August, 2018 17:13:17
>>>> Subject: [PLUG] Virtualization clusters & shared storage
>>>
>>>> I have a question about virtualization cluster solutions.  One thing
>>>> that has always bugged me is that VM vMotion/LiveMigration features
>>>> require shared storage, which makes sense, but they always seem to
>>>> assume that shared storage is external, as in a NAS or SAN.  What would
>>>> be REALLY cool is a system that uses the cluster members "local" storage
>>>> as JBOD that becomes the shared storage.  Maybe that's how some of
>>>> solutions work (via Ceph, GlusterFS or ZFS?) and I've missed it, but
>>>> that seems to me to be a great solution for the lab & SOHO market.
>>>>
>>>> What I mean is, say I have at least 2 nodes in a cluster, though 3+
>>>> would be better.  Each node would have at least 2 partitions, one for
>>>> the OS/Hypervisor/whatever and the other for shared & replicated
>>>> storage.  The "shared & replicated" partition would be, well, shared &
>>>> replicated across the cluster, providing shared storage without needing
>>>> an external NAS/SAN.
>>>>
>>>> This is important to me because we have a lot of hardware sitting around
>>>> that has a lot of local storage.  It's basically all R710/720/730 with
>>>> PERC RAID and 6x or 8x drive bays full of 1TB to 4TB drives.  While I
>>>> *can* allocate some nodes for FreeNAS or something, that increases my
>>>> required node count and wastes the CPU & RAM in the NAS nodes while also
>>>> wasting a ton of local storage on the host nodes.  It would be more
>>>> resource efficient to just use the "local" storage that's already
>>>> spinning.  The alternative we're using now (that sucks) is that the
>>>> hypervisors are all just stand-alone with local storage.  I'd rather get
>>>> all the cluster advantages without the NAS/SAN issues
>>>> (connectivity/speed, resilience, yet more rack space & boxes).
>>>>
>>>> Are there solutions that work that way and I've just missed it?
>>>>
>>>>
>>>> Related, I'm aware of these virtualization environment tools, any more
>>>> good ones?
>>>> 1. OpenStack, but this is way too complicated and overkill
>>>> 2. Proxmox sounds very cool
>>>> 3. Cloudstack likewise, except it's Java! :-(
>>>> 4. Ganeti was interesting but it looks like it may have stalled out
>>>> around 2016
>>>> 5. https://en.wikipedia.org/wiki/OVirt except it's Java and too limited
>>>> 6. https://en.wikipedia.org/wiki/OpenNebula with some Java and might do
>>>> on-node-shared-storage?
>>>> 7. Like AWS: https://en.wikipedia.org/wiki/Eucalyptus_(software) except
>>>> it's Java
>>>>
>>>> I'm asking partly for myself to replace my free but not F/OSS ESXi
>>>> server at home and partly for a work lab that my team needs to rebuild
>>>> in the next few months.  We have a mishmash right now, much of it ESXi.
>>>> We have a lot of hardware laying around, but we have *no budget* for
>>>> licenses for anything.  I know Lee will talk about the VMware starter
>>>> packs and deals like that but we not only have no budget, that kind of
>>>> thing is a nightmare politically and procedurally and is a no-go; it's
>>>> free or nothing.  And yes I know that free costs money in terms of
>>>> people time, but that's already paid for and while we're already busy,
>>>> this is something that has to happen.
>>>>
>>>> Also we might like to branch out from ESXi anyway...  We are doing a
>>>> some work in AWS, but that's not a solution here, though cross cloud
>>>> tools like Terraform (and Ansible) are in use and the more we can use
>>>> them here too the better.
Later,
JP
--  -------------------------------------------------------------------
JP Vossen, CISSP | http://www.jpsdomain.org/ | http://bashcookbook.com/
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug