There are various high availability options available but which is best fit for your application is something we as Architect always have to make a decision.
To make this decision, you must understand business requirement.
Case:01 Zero Downtime
It means you cannot take a downtime during
OS Upgrade and maintenance
DB upgrade and maintenance
Application Update and maintenance
So we need to design for fail at OS, DB and Application and virtualization environment even for ESXi host level failure
vSphere HA will protect against ESXi host failure
OS & Application level failure will be protected using in guest clustering e.g. MSCS, Veritas Cluster, FT (only for application which can be scaled horizontally on single vCPU)
Only application level failure can be achieved using vSphere HA for application protection introduced in 5.5.
You can use Symantec Application HA.You can find the list of application which are supported here.
Note:In some environments, Operations team may not want automatic restart of the application in the event of only an application error. Instead, immediate notification and manual intervention may be preferred to determine the root cause of the problem.
Scenario: 01 Suppose vSphere HA event is triggered due to ESXi host failure, which in turn will fail OS, MSCS/Veritas cluster will detect OS is failed. It will move the application to the other node. During this failover services won’t be able. Time is generally within seconds but no data is lost neither user experiences any appreciable outage
Scenario: 02 Suppose OS inside the VM fails, MSCS/Veritas detects it and fails over application to Another Node. During this failover services won’t be able. Time is generally within minutes but no data is lost neither user experiences any appreciable outage but it again clearly ruled by failover time. So to say it is zero down it not a right term
Scenario: 03 Suppose application service restarts, MSCS/Veritas detects tries to re-start it, if re-start fails it moves the application to another Node.During this failover services won’t be able. Time is generally within minutes but no data is lost neither user experiences any appreciable outage but it again clearly ruled by failover time. So to say it is zero down it not a right term
Scenario: 04 Suppose OS/Application needs a maintenance window, simple failover application using MSCS/Veritas to anther node.
In scenario 02,03,04 discussed above, Downtime is time needed to failover services, it can vary from few seconds to few minutes. So here complete protection is done at Hardware, OS and application level. If any of the layers fails, it won’t impact end users. Biggest reason in vSphere people prefer in-guest agent to get rolling upgrades. We will discuss this in more detail below.
Hosting Business Critical Application in Cloud
In-guest clustering is very complicated to configure. This complexity increases further when you wish to host such application inside vCloud director.
As of 5.1 (haven’t seen anything 5.5 yet)
There is no support for clustering inside vCloud Director
There is no support for RDM when using in vCloud Director
So you can configure a cluster using vSphere but the moment vCloud director comes into picture we face technical limitations. Please note in vCloud director VM is created via vCloud portal not via vCenter
Update: One of the experts in vCloud director actually contacted me and explained why vCD doesn’t support RDM and there is no plan for it. Getting RDM inside cloud breaks the principle of portability in case customer wish to move workloads between cloud.
Business critical application which uses oracle for hosting their application bring another challenge with them. Oracle license policy is most inflexible. Hosting oracle database inside cloud means dedicating host to oracle. It is simply not going to meet the economies of scale. Though we can use VM-Host affinity rule to do so but this is not explicitly accepted or denied by Oracle. You need to read lot into the license agreement as mentioned by Michael Webster here
Understanding Oracle Certification, Support and Licensing for VMware Environments white paper published by VMware
Case of In-Guest Clustering (Why?)
Rolling upgrade is the only use case for recommending in-guest cluster agent . During rolling application remains online i.e from application perspective zero down time. I have always asked my customers don’t we have schedule maintenance window? If answer is always Yes, then 9 out of 10 cases I have not recommended using in-guest cluster. As all plan upgrades/changes to the OS, Database can be done during this window.
Over and above following points makes my cases further strong against in-guest cluster
1. We can use Snapshot technology to do upgrade of OS or database which gives us roll back point.
2. We have vSphere 5.0 improved HA functionality which smartly detects host is isolated or gone down.At the max VM comes up in 15 minutes (along with Applications what I refer as “Ready to Serve”) when HA event occurs. So just for 15 minutes downtime (extremely conservative estimate, refer here), I don’t like operations team to carry overhead of configuring in-guest clustering and bring complexity when it gets on virtualized platform.
3. And how many times in a year application has failed and it needs monitoring? If failure rate is almost no, then again this makes case for no in-guest clustering.
The final design choice will be ruled by how much downtime a business is ready to tolerate, and the cost they are willing to invest in the extra resources and skills to install and operate software that provides application monitoring. It is a trade-off.