Re: [PLUG] Virtualization clusters & shared storage

JP Vossen on 11 Aug 2018 15:15:31 -0700

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]

Re: [PLUG] Virtualization clusters & shared storage

From: JP Vossen <jp@jpsdomain.org>
To: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Subject: Re: [PLUG] Virtualization clusters & shared storage
Date: Sat, 11 Aug 2018 18:15:26 -0400
Reply-to: Philadelphia Linux User's Group Discussion List <plug@lists.phillylinux.org>
Sender: "plug" <plug-bounces@lists.phillylinux.org>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1

I'll get to the newer replies but I want to address this one first.

First, thanks again for all the thought and replies, I'm learning a lot.

Second, Lee, we're still not communicating, but I think I figured outwhy. You are assuming a rational world where you can actually spendmoney and use services where it makes sense to do that, and wherepeople's time is factored in. Unfortunately, that's not where I work.

Assume instead no rational thought, no budget for anything ever, noexternal services allowed ever (in this context), and people are alreadypaid for and time doesn't matter. I can use whatever I can scroungeduring whatever time I can steal. That not 100% true or fair, but forthis context it's true. (And my immediate boss & team are awesome, theproblems are...elsewhere.)


One thing in particular really sums it up:

You said: "Non hyper-converged hosts should not have any local disks -just take them out."But I can't do that, that is a complete waste of an average of 12TB ofraw disk per node! That's why I thought of what I now know is HCI tobegin with.


So...

* This is a lab
** Actually it will be 2 labs merged into 1
*** One in VA, one in CA, merging into VA in months
** Performance, especially for disk, is not critical

** We mostly need lots of pretty small (4 CPU, 4G RAM, 500G thin-disk)nodes with snapshot trees (and thus lots of storage, but speed is notcritical)

** I'm doing all of this in my work "spare time" because someone has to
** We might want to do some amount of "self-service"
** We have free ESXi silos now, 1+ in one lab, 2+ in the other...I think
* I have zero budget!

* But I can scrounge a *bunch* of old servers, with LOTS of local disk;NOT counting the existing ESXi

** If the servers are not out of support already, they will be soon!

** I don't know how many or what the specs are, at least 3-5 as I talkedabout, probably a bunch more with even more local disk** AFAIK, all the servers have 4 port 1G NICs (whatever Dell puts inR7[123]0's)* I have no idea what network gear is available, but it's not much andit's NOT fiber or 10G or recent...

* I have zero budget, which means no external services like rsync.net or S3
** And if I used one and someone found out I'd probably get fired

So obviously there's a whole lot of projects steps I've left out of thisdiscussion, because this is about what I *can* do going forward, *if* Ican even talk them into not just doing a bunch more free ESXi silos.


But thanks again for the details and insights!


On 08/09/2018 12:56 PM, Lee H. Marzke wrote:

JP,

Thanks for your further description. From your use of 2 x FreeNAS I assume this is production and not a lab ?

For a lab, you can usually obtain a 4h response time to HW failures,
and many lab's might tolerate 6 hours downtime as a rare event. So you can avoid
the 2nd NAS unit entirely. In my recent talk I showed that you can easily replicate any FreeNAS
Volume to Amazon S3 if they are static; or ZFS send to Rsync.net if they contain running VM's.
Most HW failures will not cause loss of the pool, so your back up and running in a few
hours. Loss of an entire pool would require full replication back, taking a long time per volume.

For production , you would typically use 2 controllers on 1 set of dual-port SCSI disks ( HA )
instead of replication to a 2nd unit. FreeNAS lacks support for HA so you would typically
use a commercial unit such as TrueNAS or Nexenta. I also like Tegile and Nimble.

For your discussion of Hyper-converged, you making assumptions that are not even close.

Non hyper-converged hosts should not have any local disks - just take them out.

For FreeNAS, most CPU/RAM is used. All free RAM is used for L2ARC, and CPU
is used intermittently for rebuilds, and scrubs.

On a hyper-converged solution you typically get much less disk space than you think.
VMware VSAN DOES NOT use parity disks locally. It is an object store of data ( + cache )
with one or more copies on another host. Any failure ( disk, cache , or complete host )
is handled by getting the data from 2nd host. All synchronous writes are ack'ed after
two hosts have committed the write. I think many of the other hyper-converged solutions
are similar. vSAN does not enforce keeping storage and VM's on same host, while others
may do that. ( Datrium for instance keeps VM's in a local fast SSD cache on each host using
a custom ESX module, with all writes re-ordered and written sequentially to a central JBOD box )

With only 3 hosts you are limited to mirroring data, losing 50% of storage. This is two
storage locations and a witness for each object volume. Think of a witness as a checksum
of the objects in the volume. So you can still prove a single object is correct by looking at the
checksum on the witness. The remaining usable storage must have reserved data for snapshots, VM
swap files, etc. so it is only 75% usable at best.

To lose less storage than a mirror you need 4 or more hosts, and a 10GB network for the replication.
You have choice of RAID5 ( that tolerates loss of one host ) or RAID 6 ( tolerates loss of
two hosts, min 5 hosts ). Note that any disk or cache failure on a node causes that
entire node to fail as there is no local RAID or mirroring. You fix the disk issue, then
rebuild the data on that node when it's back up.

So with 4 hosts, RAID 5 you have data on 3 nodes and parity on 1 , so you lose
only 25%. Obviously using more hosts is beneficial, up to a limit of 32 hosts max.
With disk RAID, you increase storage efficiency with more disks, with vSAN you add more
hosts + networking + disks.

You can also have multiple disk groups per host (HD + SSD), each replicated to a
similar group on other hosts, so loss of a disk only causes that disk group to
fail, not the entire host.

I think it's very interesting that VSAN is using object storage under the covers. But
unlike others, just filling in one checkbox - and it's running on your ESXi hosts.

Lee

----- Original Message -----

From: "Vossen JP" <jp@jpsdomain.org>
To: "Philadelphia Linux User's Group Discussion List" <plug@lists.phillylinux.org>
Sent: Wednesday, 8 August, 2018 21:23:53
Subject: Re: [PLUG] Virtualization clusters & shared storage

Firth, thanks to Lee and Andy for the term I didn't know for what I
mean: hyper-converged infrastructure.
	https://en.wikipedia.org/wiki/Hyper-converged_infrastructure

It looks like Proxmox does that:
	https://pve.proxmox.com/wiki/Hyper-converged_Infrastructure

Thanks Keith, I had seen LizardFS but was not aware of the implications.

Doug, Kubernetes is not in play, though it might be in the future.  Or I
may be missing your point.

Lee, thanks for the insights and VMware updates, as always. :-)

I've used FreeNAS in the past and like it, but I'm not sure I'm
explaining my thoughts as well as I'd like.  But let me try this:

(I  wrote this next part before I learned that "hyper-converged" is what
I mean, but I'm leaving it here in case it's still useful.)

Assume for simplicity that I have 5 R710s with 24M RAM, 8 CPUs and 6x
2TB drives in RAID5 for 10TB local storage each.

3 node cluster with redundant FreeNAS:
	1. VM node1: CPU/RAM used for VMs, 10TB local space wasted
	2. VM node2: CPU/RAM used, 10TB local space wasted
	3. VM node3: CPU/RAM used, 10TB local space wasted
	4. FreeNAS 1: CPU/RAM somewhat wasted, 10TB space available
	5. FreeNAS 2: CPU/RAM somewhat wasted, 0TB space available since it's a
mirror

3 node cluster with -on-node-shared-storage- hyper-converged storage:
	1. VM node1: CPU/RAM used, 20TB local/shared space available +
	2. VM node2: CPU/RAM used, 20TB local/shared space available +
	3. VM node3: CPU/RAM used, 20TB local/shared space available +
+ For 20TB I'm assuming (3-1) * 10TB, for some kind of parity space
loss.  If it was REALLY smart, it would keep the store for the local VMs
local while still replicating, but that would require a lot more hooks
into the entire system, not just some kind of replicated system.

But the point here is that with my idea I have 2x the disk with 3/5ths
the servers.  Or put another way, I can now do a 5 node cluster with
even more CPU, RAM and space dedicated to actually running VMs, and not
lose 2/5ths of the nodes to just storing the VMs.

That said, I'd thought about the rebuild overhead, but not in depth, and
that--and general "parity" or redundancy however implemented--are
significant.  So my 2/5ths comparisons are not 100% fair.  But still,
the idea apparently does have merit.


On 08/08/2018 07:50 PM, Lee H. Marzke wrote:


JP, if you want cheap storage for your lab , I think you can't beat FreeNAS or
equiv rackmount solutions from https://www.ixsystems.com. I run my lab on
FreeNAS and a Dell 2950 server with 6x2TB disks and 2 x SSD. If you put storage
into the servers you will find all sorts of edge cases that you hadn't planned
on.

Just taking down a server for quick RAM swap, will cause it to need to rebuild
using lots of network/CPU.   If you have TB of fast SSD storage on multiple
servers and
don't have 10GB connectivity between hosts, or you have slow HD's you will have
pain.
Generally you try to migrate data off nodes prior to maintenance - which may
take several days.

The VMware solutions have changed a lot, and though it does not meet JP's
needs for *free*,  they may fit someone else's needs for a reliable / highly
available solution.    Of course ESXi and VSAN are free for 60 day trial after
install for lab use.

First there is a VMware Migrate *both* mode where you can migrate both the
Hypervisor and
then storage in one go, where the two storage units are not connected across
Hypervisors.
Needless to say this takes a long time to sequentially move memory , then disk
to the remote
server, and it doesn't help improve HA.

Next VMware VSAN is caching on really fast and VMware is hiring like mad
to fill new VSAN technical sales roles Nationwide.   VSAN uses storage in each
host ( minimum of one SSD and one HD ) and uses high-performance object
storage on each compute node.  All VM objects are stored on two hosts
minimum, with vSAN taking care of all the distribution.  The hosts must be
linked on a 1GB ( pref 10GB ) private network for back-end communication.
Writes are sent and committed to two nodes before being acknowledged.
You get one big storage pool - and allocate storage to VM's as you like -
with no sub LUNs or anything else to manage.   If you have 4 or more hosts
instead of mirroring data over 2 hosts,  you can do erasure coding ( equiv
of RAID 5/6 but with disks spread out across hosts.   So now your not
losing 50% of your storage , but have more intensive CPU and network
operations.   The vSAN software is pre-installed into ESX these days - just
need to activate it and apply a license after the 60 free day trial.

Not sure why you say FreeNAS is wasting CPU in more nodes,  as those CPU cycles
would be used locally in the Hyperconverged solutions as well ( perhaps taking
10%
to 20% cycles away from a host for storage and replication ) so you may need
more / larger
hosts in a hyperconverged solution to make up for that.   Remember mirroring
takes little CPU, but waste's 50% of your storage,  any erasure coding is
much more CPU intensive, and more network intensive.

The other solutions mentioned except a ZFS server are likely way too complex
for a lab storage solution.   Is a company really going to give
a lab team 6 months of effort to put together storage that may or may
not perform ?  Can you do a business justification to spend dozens of MM of
effort just to save the $20K on an entry level  TrueNAS ZFS ?

Lee


----- Original Message -----

From: "Vossen JP" <jp@jpsdomain.org>
To: "Philadelphia Linux User's Group Discussion List"
<plug@lists.phillylinux.org>
Sent: Wednesday, 8 August, 2018 17:13:17
Subject: [PLUG] Virtualization clusters & shared storage

I have a question about virtualization cluster solutions.  One thing
that has always bugged me is that VM vMotion/LiveMigration features
require shared storage, which makes sense, but they always seem to
assume that shared storage is external, as in a NAS or SAN.  What would
be REALLY cool is a system that uses the cluster members "local" storage
as JBOD that becomes the shared storage.  Maybe that's how some of
solutions work (via Ceph, GlusterFS or ZFS?) and I've missed it, but
that seems to me to be a great solution for the lab & SOHO market.

What I mean is, say I have at least 2 nodes in a cluster, though 3+
would be better.  Each node would have at least 2 partitions, one for
the OS/Hypervisor/whatever and the other for shared & replicated
storage.  The "shared & replicated" partition would be, well, shared &
replicated across the cluster, providing shared storage without needing
an external NAS/SAN.

This is important to me because we have a lot of hardware sitting around
that has a lot of local storage.  It's basically all R710/720/730 with
PERC RAID and 6x or 8x drive bays full of 1TB to 4TB drives.  While I
*can* allocate some nodes for FreeNAS or something, that increases my
required node count and wastes the CPU & RAM in the NAS nodes while also
wasting a ton of local storage on the host nodes.  It would be more
resource efficient to just use the "local" storage that's already
spinning.  The alternative we're using now (that sucks) is that the
hypervisors are all just stand-alone with local storage.  I'd rather get
all the cluster advantages without the NAS/SAN issues
(connectivity/speed, resilience, yet more rack space & boxes).

Are there solutions that work that way and I've just missed it?


Related, I'm aware of these virtualization environment tools, any more
good ones?
1. OpenStack, but this is way too complicated and overkill
2. Proxmox sounds very cool
3. Cloudstack likewise, except it's Java! :-(
4. Ganeti was interesting but it looks like it may have stalled out
around 2016
5. https://en.wikipedia.org/wiki/OVirt except it's Java and too limited
6. https://en.wikipedia.org/wiki/OpenNebula with some Java and might do
on-node-shared-storage?
7. Like AWS: https://en.wikipedia.org/wiki/Eucalyptus_(software) except
it's Java

I'm asking partly for myself to replace my free but not F/OSS ESXi
server at home and partly for a work lab that my team needs to rebuild
in the next few months.  We have a mishmash right now, much of it ESXi.
We have a lot of hardware laying around, but we have *no budget* for
licenses for anything.  I know Lee will talk about the VMware starter
packs and deals like that but we not only have no budget, that kind of
thing is a nightmare politically and procedurally and is a no-go; it's
free or nothing.  And yes I know that free costs money in terms of
people time, but that's already paid for and while we're already busy,
this is something that has to happen.

Also we might like to branch out from ESXi anyway...  We are doing a
some work in AWS, but that's not a solution here, though cross cloud
tools like Terraform (and Ansible) are in use and the more we can use
them here too the better.




Later,
JP
--  -------------------------------------------------------------------
JP Vossen, CISSP | http://www.jpsdomain.org/ | http://bashcookbook.com/
___________________________________________________________________________
Philadelphia Linux Users Group         --        http://www.phillylinux.org
Announcements - http://lists.phillylinux.org/mailman/listinfo/plug-announce
General Discussion  --   http://lists.phillylinux.org/mailman/listinfo/plug

References:
- [PLUG] Virtualization clusters & shared storage
  - From: JP Vossen <jp@jpsdomain.org>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Lee H. Marzke" <lee@marzke.net>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: JP Vossen <jp@jpsdomain.org>
- Re: [PLUG] Virtualization clusters & shared storage
  - From: "Lee H. Marzke" <lee@marzke.net>

Prev by Date: [PLUG] Need to enumerate all the subdomains of a series of IPv4 addresses
Next by Date: Re: [PLUG] Virtualization clusters & shared storage
Previous by thread: Re: [PLUG] Virtualization clusters & shared storage
Next by thread: Re: [PLUG] Virtualization clusters & shared storage
Index(es):
- Date
- Thread