Cloud, DevOps, Evangelism

The Best Part of vSphere 5: FDM

vSphere 5 is announced. Long live the king!

Now that we've got that out of the way, I'd like to mention what I think is the single most important new feature that I think will get overlooked. vSphere 5 brings a whole ton of neato new stuff that plenty of my colleagues will write about (and so will I), but if there's one that I had to pick, it would be the new Fault Domain Manager.

History of HA

VMware has had an 'HA' feature for years at this point. It worked fine for what it was designed to do (restart VMs on host failure), but had a number of limitations:

  1. It felt very 'hacked on', although VMware did a great job of making it look baked in. Indeed, it was a licensed product from Legato (now EMC owned) called Automated Availability Manager, or AAM. It had its own separate binaries, log files, and problems. Users had to go searching around in arcane log files for its information, its errors messages were often vague or misleading, and it seemed reletivly unstable from a configuration perspective.
  2. It had a number of limitations around master hosts. The AAM model only allowed for 5 master hosts, and VMware didn't expose that into the APIs or vCenter at all. If you lost all 5 hosts, you lost HA. For users with blade chassis systems and no control over master placement, there was no control over this scenario. You could try to force it by selectivly shutting down hosts to force a new master to be elected, or play games with your cluster designed to make sure that no more than 4 hosts in a cluster belonged in any given chassis, but thats painful.
  3. It also had scaling issues - VMware had no way to handle larger HA clusters.
  4. It exclusively used the network for determining if something bad happened.

Bring On FDM

Fault Domain Manager is a ground up rewrite of the VMware HA system. It fixes all of the problems above.

FDM is now built in to the base of vSphere ESXi - there are no magic special binaries, nothing additional for vCenter to install upon joining a cluster, and just 1 easy log file (fdm.log).

The master election process is now much more interesting. Instead of 5 master situation where these masters because a 'single' point of failure, its been totally redesigned. Now, there is one master and many many secondaries. The master controls distribution on information to the secondaries, but is not a single point of failure during a host failure. During a host failure (including that of the master), any of the secondaries can run the recovery process, and part of that is to elect a new master (which happens very quickly). Now, no more need to worry about how many blades are in a chassis (at least not from an HA master perspective).

Scaling - while no changes to the limits have been announced to my knowledge, this new design would certainly allow for larger clusters if needed.

Isolation detection has also been improved. FDM can now distinguish between a fully network partitioned host that hasn't actually crashed and a host that just plain old crashed. It does this because all HA-enabled hosts in a cluster now use the datastores as a second wayto send hearbeats, allowing for better determination of HA-event cause and effect.

And, a positive for many environment, HA no longer uses DNS at all - e.g. no dependency on DNS or hosts files!

Lastly, a neat little new feature to test HA rather than pulling the plug on your host:

Run this: vsish -e set /reliability/crashMe/Panic 1

And you get an instance PSOD and simulation of a crashed host

I hope you are excited about FDM as I am. If you want to learn more, Duncan Epping and Frank Denneman have updated their HA deep dive book: buy it here: