Tuesday, March 17, 2009

VMkernel Scheduler

Original Post
Introduction
Details on the ESX Server scheduler are commonly requested when I engage customers and partners. People want to know more about how the scheduler works, when SMP should be used, and what the deal is with SMP co-scheduling. This page will answer these questions and others as they arise in the forum or the discussion portion of this page.

Terminology and Architecture
In VMware parlance, the monitor is the part of our products that provides a virtual interface to the guest operating systems. The VMkernel is the part of our products that manages interactions with the devices, handles memory allocation, and schedules access to the CPU resources, among other things. This is shown in the following figure.



This document will provide information on one part of the VMkernel: the scheduler.

Performance Scaling and the Scheduler
It is a critical requirement for enterprise deployments that an operating system provide fast and fair access to the underlying resources. As a critical part of this design, the scheduler has undergone countless engineer-years of development to guarantee that this requirement is met. We've now released dozens of papers showing linear scaling of workloads as vCPU count is scaled up within a single VM and VM count is scaled up within a single host. Here are a few such papers that contain supporting data.

  1. VM scaling as demonstrated by Oracle databases.
  2. VM and vCPU scaling under IBM DB2 load.
  3. VM and vCPU scaling with SQL Server running in the VM.

The scheduler's ability to fairly scale up to and beyond totally committed CPU resources is no accident. In fact, in a conversation I had with a QA manager I was assured that the VMkernel's scheduler would fairly distribute CPU resources to all VMs at least up to 4x CPU overcommitment. Of course, on a system with the CPU over-committed by 4x each VM will only run at 1/4 native speed but the scheduler keeps the VMs running at that performance. Not one at 1/8 speed, one at 1/10 speed, and another at 1/4 speed.

SMP and the Scheduler
As ESX Server supports uniprocessor (UP) and symmetric multiprocessor (SMP) VMs, the fair-and-fast requirement for the scheduler must be upheld in the presence of concurrently executing UP and SMP VMs. Internal testing of this requirement shows fair scheduling even in the presence of concurrently executing 1-way, 2-way, and 4-way VMs.

In fact, the ability to fairly execute under such environments is a very tricky problem for a scheduler. We've run analysis on competitors' products and found that the ability to fairly balance differently-sized VMs is something of which ESX Server alone is capable. Stay tuned in the coming months as we back this claim up with performance data.

Cell Size
One construct that assists the scheduler in optimally placing VMs on a heavily utilized system is a cell. A cell is a logical grouping of a subset of CPU cores in the system. In ESX 3 versions the cell size is equal to four. Since the cell is statically assigned to physical cores, this means that each four-core processor is in exactly one cell. When only dual-core processors are present, a cell is comprised of two sockets. The most important thing to know about cells is the following:

A VM cannot span more than one cell.

This means that four-way VMs run on only one socket at a time in systems with quad-core CPUs. For this case, the number of options presented to the scheduler is equal to the number of sockets. In future versions of ESX we plan to increase the cell size to eight. In some cases (such as systems with hexa-core CPUs) a modification of the cell size can improve performance. See KB article 1007361 for more information.

UP or SMP?
When and if to use SMP is a common question from VMware users. The simple answer to this is to only use SMP when needed. Why only use SMP when needed? There are two reasons:

  1. SMP schedulers are less efficient than UP schedulers. This is a simple experiment that can be confirmed with trivial benchmarks like Netperf or Passmark. On UP systems (either virtual or native) the UP hardware abstraction layer (HAL) will provide marginally better results than the SMP HAL.
  2. Even when unused, virtualization of idle vCPUs requires resources by the kernel. Memory is needed to maintain data structures and CPU resources are needed to virtualize the idle system. The amount of work needed to support an idle CPU varies greatly but usually is in the realm of 1-2% of a single CPU core.
  3. The work required to deliver timer interrupts increases quadratically with the number of vCPUs, like RHEL5, the number of timing interrupts delivered by the VMkernel can be quite high. See Red Hat Enterprise Linux for more information on this issue.

What About Co-scheduling?

Back in the days of ESX Server 2.5, SMP VMs had to have their vCPUs co-scheduled at the same instant to begin running. Because only 2-way VMs were supported at this time, that meant that two CPU cores had to be available simultaneously to launch a 2-way VM. On a server with a total of only two cores, this meant that the VM could not be launched concurrently with any other process on the server. This would include the service console, the web interface, or any other process.

This requirement was reduced in ESX Server 3.0 through a process called relaxed co-scheduling. Effectively SMP VMs can have their vCPUs scheduled at slightly different times and idle vCPUs didn't necessarily have to be scheduled concurrently with running vCPUs. More details on this are available in the Co-scheduling SMP VMs in VMware ESX Server page.

NUMA Considerations
Support for non-uniform memory access (NUMA) architectures was introduced in ESX Server 2. This meant that the scheduler became aware that memory was not uniform across each CPU. Each CPU node had access to its own local memory and a larger pool of remote memory (which was divided as local memory for the other CPU nodes.) Memory access to local memory is much faster than remote memory so the scheduler should favor the placement of processes on nodes that held the processes' memory.

Subsequent generations of ESX Server continued to optimize for the use of NUMA memory. This included placement of vCPUs next to needed memory and startup of VMs at NUMA nodes with resources available for execution. All of this is transparently handled by the scheduler but it should be noted that the newer your version of ESX Server, the better its NUMA scheduling is.

Configure freshly booted ESX and ESXi with PowerShell

These scripts automate much of the drudgery needed to incorporate a fresh ESXi or ESX Classic Server into a Virtual Center DRS cluster. Automation is the backbone of scalability.

Script Download Link

Excellent article about ESX memory

While reading my newsfeeds, one technical post was awesome! It's a technical paper about the memory usage of you ESX server viewed from the Service Console written by fellow blogger Gabrie van Zanten. Take your time and read through every bit! It is worth it: link .

Nice reading: VDI article

uben Spruijt from PQR has written a nice article on VDI "VDI makes sense!" :

English version

I think it's a quite good article. Easy to read and understand. He also explains some things about the HP Remote Graphics Software (RGS) and a BladePC.

Virtualization Blogs

ESXi ssh and non-root users

Original Post

I’ve never seen this before. I wrote an article about root SSH access to a ESXi system. Today I noticed a blog entry that describes how you can disable root access for SSH and create users which can use “su” to become root! Cool stuff.

Check the article here! Here’s the procedure:

1. Log in to the console,
2. edit the inetd.conf:

vi /etc/inetd.conf

3. search for the following line (type: “/ssh”) (This is the line you uncommented to enable SSH in the first place.)

ssh stream tcp nowait root /sbin/dropbearmulti dropbear ++min=0,swap,group=shell -i

4. add -w to the end of this line: (type: “i” for insert mode):

ssh stream tcp nowait root /sbin/dropbearmulti dropbear ++min=0,swap,group=shell -i -w

5. Exit and save the file (press escape, type “: x”)
6. Create a /home directory

mkdir /home

7. Create a new unprivileged user:

useradd your_name

8. Change the password for this user:

passwd your_name

9. The file or sub-dir on / dir which you created will be deleted everytime you reboot!

So, I succeed like this way.

tar cvf home.tar /home
mv home.tar /opt

edit /etc/rc.local bottom line
tar xvf /opt/home.tar -C /

10. Reboot the server

reboot

11. Once rebooted,
Log in with SSH using your new unprivileged user
12. Use

su -

to change to the root user.

Tested on:
VMware ESXi 3.5.0_Update_2-103909


ESXi and SSH, what’s next

Original Post

I get a lot of questions about ESXi and SSH. Most people manage to connect to their ESXi but don’t know what to do next because there’s no actual Service Console there. Well the answer is short and simple: vim-cmd.

A couple examples of stuff you can do with vim-cmd:
enter maintenance mode: vim-cmd /hostsvc/maintenance_mode_enter

List all registered vm’s: vim-cmd /vmsvc/getallvms

Install VMware Tools for VM with ID: vim-cmd /vmsvc/tools.install [vmid]

Power on a specific VM: vim-cmd /vmsvc/power.on [vmid]

So check out the link above and start trying out this poweful command

HowTo:connecting using ssh without password

1. create your sshkey with the help of puttykeygen (Win) or ssh-keygen (Linux)
2. create “.ssh” directory
3. Place the keyfile on the host
4. cat keyfile >> authorized_keys ( you can also use vi and use c&p)
5. chmod 0600 on .ssh and authorized_keys

HOWTO: ESXi and SSH

By default this isn’t possible. But there’s a way to get this working, just do the following:
  1. Go to the ESXi console and press alt+F1
  2. Type: unsupported (One thing to note here is that there is no prompt when you press Alt-F1. You just type “unsupported” blindly.)
  3. Enter the root password
  4. At the prompt type “vi /etc/inetd.conf”
  5. Look for the line that starts with “#ssh” (you can search with pressing “/”)
  6. Remove the “#” (press the “x” if the cursor is on the character)
  7. Save “/etc/inetd.conf” by typing “:wq!”
  8. Restart the management service “/sbin/services.sh restart”
  9. Just do a kill -HUP `ps | grep inetd`


Done!


Monday, March 16, 2009

How to Provision VMs Using NetApp FlexClones

Original Post

When properly implemented and configured, VMware Virtual Infrastructure can make provisioning new servers a task that takes only minutes. In fact, in my own lab (running equipment that is, admittedly, several years old and woefully underpowered), I can provision new servers running Windows Server 2003 R2 in less than 10 minutes. That’s pretty impressive.

As impressive as those numbers may be (and I’m sure there are readers out there with even more impressive numbers), if we leverage some vendor-specific storage functionality we can achieve some really impressive times. For example, leveraging NetApp FlexClones could allow us to provision new VMs in seconds. Let’s take a quick look at how that’s done.

In this article, I’m going to discuss how to use FlexClones for provisioning new VMs in a VMware VI3 environment. This is not an exhaustive treatise on the subject, but rather an introduction to the process and some of the configuration that needs to take place in your environment. (Disclaimer: Use this stuff at your own risk.)

Configuring ESX Server

First, we need to change the configuration of ESX Server to enable it to see the FlexClones on the SAN. The change we need to make is to enable resignaturing; that is, to enable ESX to recognize an existing VMFS datastore even if it is presented on a different LUN ID than the LUN ID it had when it was created. When a VMFS datastore is created, ESX (or VirtualCenter) places a signature in the datastore that contains the LUN ID (among other information). If this datastore is then presented back out with a LUN ID that doesn’t match the LUN ID in the signature, then it won’t be recognized by ESX Server. Since we’ll be using FlexClones to make identical copies of VMFS datastores (including their signatures) and then present them out as new LUNs (with different LUN IDs than the original), we need to enable resignaturing in order for ESX Server to see the new LUNs.

There are two ways to enable resignaturing:

  • From the command line, type esxcfg-advcfg -s 1 /LVM/EnableResignature (you must be root)
  • From VirtualCenter, select the ESX Server, go to the Configuration tab, select Advanced Setings, choose LVM from the list on the left, and then change the value of LVM.EnableResignature to 1

Once this change is set, ESX will recognize LUNs in FlexClones as “snap-XXXXXXXX-name. You can easily rename them once they have been added to VirtualCenter.

Please note that this process can introduce some oddities in your storage discovery/creation process. Make sure that you have the LUN properly recognized and configured for access by all applicable hosts before you start placing VMs on the LUN.

Creating/Preparing VMs for Cloning

One advantage that VirtualCenter’s cloning has over this technique is that the process of preparing a VM for cloning is all automated—VirtualCenter handles all that behind the scenes, launching SysPrep for Windows guests or using open source software for other guests. All an administator has to do is just make sure that SysPrep is installed on VirtualCenter properly.

In this process, the guest OS preparation has to be done manually, and the placement of VMs onto the VMFS datastores has to be considered. Since we will be making exact copies of the VMFS datastores, all VMs on that datastore will also be copied. If you are sure that one of the cloned VMs will never be started up from the cloned VMFS, then you can leave it alone, but any guest OS that will be started up in the cloned datastore will need to be prepared first. Again, for Windows guests, this means running SysPrep to generate new SIDs and reseal the operating system to factory defaults.

Let’s say you wanted to be able to quickly provision servers running Windows Server 2003 using FlexClones. You’d need to first create a new VM and the accompanying VMDK files, selecting to put that onto a VMFS that is either a) empty and will contain only this VM; or b) contains VMs that will not ever be powered on after they are cloned. You’d then need to install Windows Server 2003 on that VM, install VMware Tools (not required but very recommended), install any applicable patches or third-party software packages, and finally run SysPrep to prepare it for cloning. After all those steps have been done, you can create the FlexClone.

Creating FlexClones on the Storage System

Please note that there is a tremendous amount of additional information pertaining to the use of Snapshots in VMware environments that I have not covered here. I highly recommend TR 3428 from NetApp, which covers this information in detail, including best practices for volume configuration, Snapshot reserve, fractional reserve, etc.

Now, having said all that, and assuming that you’ve followed some of these guidelines, here’s how we go about creating FlexClones on the storage system. (This assumes you’ve built a VM and prepared it for cloning as described in the previous section.)

  1. Logged into the storage system with appropriate permissions, take a snapshot of the FlexVol containing the LUN that has the VMFS datastore you want cloned. You can call this Snapshot something like “base_clone_snapshot or similar, but be sure to use a name that makes sense to you and helps you understand the purpose of this snapshot. The command to do this would be:
    snap create fvol_master clone_base_snapshot

    This creates a Snapshot of the FlexVol fvol_master named clone_base_snapshot.

  2. Create a FlexClone based on the Snapshot you just created:
    vol clone create fvol_clone1 -b fvol_master clone_base_snapshot

    This creates a new FlexVol named fvol_clone1, which is based on the Snapshot named clone_base_snapshot in the FlexVol fvol_master.

  3. Because this is an exact copy of the original flexible volume, including LUNs and LUN maps, Data ONTAP will spit out some messages about LUNs being taken offline and such. To fix this, unmap the LUN(s) in the new FlexClone and remap them with different LUN IDs:
    lun unmap /vol/fvol_clone1/lun_name igroupname
    lun map /vol/fvol_clone1/lun_name igroupname 3

    Obviously, substitute the appropriate LUN ID for the 3 in the above command line. This remaps the LUN to the specified igroup with a new LUN ID and, assuming you’ve enable resignaturing, makes the LUN (which is a VMFS datastore) visible to ESX Server and VirtualCenter.

  4. Unless you want Snapshots of the FlexClone, disable scheduled Snapshots on the FlexClone using the snap sched command:
    snap sched fvol_clone1 0

    This disables scheduled Snapshots, but manual Snapshots are still allowed. (To disable all Snapshots, you’d need to set the no_snap volume option.)

At this point, you now have the original VMFS datastore and any virtual machines contained therein (contained in the LUN on the original FlexVol), as well as an exact copy of that VMFS datastore (contained in the LUN on the FlexClone).

Registering the VMs

The VMs (comprised of the VMX, VMXF, NVRAM, and all VMDK files) were cloned along with the LUN and the FlexVol, but VMware doesn’t know they are there. In order for the VMs to be usable, we must first register them.

  1. Log into one of the ESX servers as root. You may either SSH in as a normal user and su to root, or login at the console as root.
  2. Use the vmware-cmd utility to register the VMs. Let’s assume that you called the FlexClone “san-lun-clone1 in VirtualCenter, and that a VM called win1 exists on that VMFS datastore. The command to use would look something like this:
    vmware-cmd -s register /vmfs/volumes/san-lun-clone1/win1/win1.vmx
  3. For each VM on the datastore that needs to be recognized by ESX (and has been properly prepared in advance, as noted above), repeat this process. With a little work, it should be fairly easy to write a script that finds all the *.vmx files on a datastore and registers them. (Anyone care to take up that challenge?)

At this point, you now have the following:

  • The original SAN LUN, with all the VMs stored there
  • A cloned SAN LUN, with all the same data as the original but occupying far less space than a traditional copy)
  • VMs registered and ready for use from both SAN LUNs

Having already enabled resignaturing, created and prepared the VMs, and taken the base snapshot, you could now easily create additional clones by simply creating the FlexClone and registering the VMs. If you were to have a script that automated that process (perhaps using SSH shared keys or RSH to access the NetApp storage system from ESX), that entire process could be fairly easily automated. I’ll leave that automation as an exercise for enterprising readers.

As a matter of best practice, please note that leaving resignaturing enabled (i.e., leaving the LVM.EnableResignature setting to 1) may lead to problems if LUNs are inadvertently re-signed. For long-term operation, I would advise users to disable resignaturing once cloned LUNs have been re-signed and are visible in the VI Client.

In future articles, we’ll take a closer look at the question of Should I use FlexClones? instead of How do I use FlexClones?.

Co-scheduling SMP VMs in VMware ESX3 Server

Original post
Background
VMware ESX Server efficiently manages a mix of uniprocessor and multiprocessor VMs, providing a rich set of controls for specifing both absolute and relative VM execution rates. For general information on cpu scheduling controls and other resource management topics, please see the official VMware Resource Management Guide.

For a multiprocessor VM (also known as an "SMP VM"), it is important to present the guest OS and applications executing within the VM with the illusion that they are running on a dedicated physical multiprocessor. ESX Server faithfully implements this illusion by supporting near-synchronous coscheduling of the virtual CPUs within a single multiprocessor VM.

The term "coscheduling" refers to a technique used in concurrent systems for scheduling related processes to run on different processors at the same time. This approach, alternately referred to as "gang scheduling", had historically been applied to running high-performance parallel applications, such as scientific computations. VMware ESX Server pioneered a form of coscheduling that is optimized for running SMP VMs efficiently.

Motivation
An operating system generally assumes that all of the processors it manages run at approximately the same rate. This is certainly true in non-virtualized environments, where the OS manages physical processor hardware. However, in a virtualized environment, the processors managed by a guest OS are actually virtual cpu abstractions scheduled by the hypervisor, which time-slices physical processors across multiple VMs.

At any particular point in time, each virtual cpu (VCPU) may be scheduled, descheduled, preempted, or blocked waiting for some event. Without coscheduling, the VCPUs associated with an SMP VM would be scheduled independently, breaking the guest's assumptions regarding uniform progress. We use the term "skew" to refer to the difference in execution rates between two or more VCPUs associated with an SMP VM.

Inter-VCPU skew violates the assumptions of guest software. Non-trivial skew can result in severe performance problems, and may even induce failures when the guest expects inter-VCPU operations to complete quickly. Let's first consider the performance implications of skew. Guest OS kernels typically use spin locks for interprocessor synchronization. If the VCPU currently holding a lock is descheduled, then the other VCPUs in the same VM will waste time busy-waiting until the lock is released. Similar performance problems can also occur in multi-threaded
user-mode applications, which may also synchronize using locks or barriers. Unequal VCPU progress will also confuse the guest OS cpu scheduler, which attempts to balance load across VCPUs.

An extreme form of this performance problem may also lead to correctness issues. For example, a guest kernel may perform inter-processor operations, such as TLB shootdowns, that are expected to complete quickly on physical hardware (e.g. several microseconds). The guest OS may timeout if it finds that such operations have not completed after an unreasonably long period of time (e.g. several milliseconds). Without coscheduling, we have observed this behavior in practice for several different guest operating systems, including Windows BSODs, and Linux kernel panics.

Strict coscheduling in ESX Server 2.x
VMware introduced support for running SMP VMs with the release of ESX Server 2 in 2003. ESX Server 2.x implemented coscheduling using an approach based on skew detection and enforcement.

The ESX scheduler maintains a fine-grained cumulative skew value for each VCPU within an SMP VM. A VCPU is considered to be making progress when it is running or idling. A VCPU's skew increases when it is not making progress while at least one of its sibling VCPUs is making progress. A VCPU is considered to be "skewed" if its cumulative skew value exceeds a configurable threshold, typically a few milliseconds.

Once any VCPU is skewed, all of its sibling VCPUs within the same SMP VM are forcibly descheduled ("co-stopped") to prevent additional skew. After a VM has been co-stopped, the next time any VCPU is scheduled, all of its sibling VCPUs must also be scheduled ("co-started"). This approach is called "strict" coscheduling, since all VCPUs must be scheduled simultaneously after skew has been detected.

In some situations, such as when the physical machine has few cores, and is running a mix of UP and SMP VMs, coscheduling may incur "fragmentation" overhead. For example, consider an ESX Server with two physical cores running one dual-VCPU VM and one single-VCPU VM. When the UP VM is running, the scheduler cannot use the remaining physical core to run just one of the SMP VM's two VCPUs. This effect is typically negligible in systems with larger numbers of cores (or with hyperthreading enabled), due to the increased flexibility available when mapping VCPUs to hardware execution contexts.

Note that a VCPU executing in the guest OS idle loop can be descheduled without affecting coscheduling, since the guest OS can't tell the difference. In other words, an idle VCPU does not accumulate skew, and is treated as if it were running for coscheduling purposes. This optimization ensures that idle guest VCPUs don't waste physical processor resources, which can instead be allocated to other VMs. For example, an ESX Server with two physical cores may be running one VCPU each from two different VMs, if their sibling VCPUs are idling, without incurring any coscheduling overhead. Similarly, in the fragmentation example above, if one of the SMP VM's VCPU is idling, then there will be no coscheduling fragmentation, since its sibling VCPU can be scheduled concurrently with the UP VM.

Relaxed coscheduling in ESX Server 3.x
The coscheduling algorithm employed by the ESX scheduler was significantly enhanced with the release of ESX Server 3 in 2006. The basic coscheduling approach is still based on skew detection and enforcement.

However, instead of requiring all VCPUs to be co-started, only those VCPUs that are skewed must be co-started. This ensures that when any VCPU is scheduled, all other VCPUs that are "behind" will also be scheduled, reducing skew. This approach is called "relaxed" coscheduling, since only a subset of a VM's VCPUs must be scheduled simultaneously after skew has been detected.

To be more precise, suppose an SMP VM consists of multiple VCPUs, including VCPUs A, B, and C. Suppose VCPU A is skewed, but VCPUs B and C are not skewed. Since VCPU A is skewed, VCPU B can be scheduled to run only if VCPU A is also co-started. This ensures that the skew between A and B will be reduced. But note that VCPU C need not be co-started to run VCPU B. As an optimization, the ESX scheduler will still try to co-start VCPU C opportunistically, but will not require this as a precondition for running VCPU B.

Relaxed coscheduling significantly reduces the possibility of coscheduling fragmentation, improving overall processor utilization.

Conclusions
ESX Server employs sophisticated cpu scheduling algorithms that enforce rate-based quality-of-service for both uniprocessor and multiprocessor VMs. For multiprocessor VMs, coscheduling techniques ensure that virtual CPUs make uniform progress, faithfully implementing the illusion that the VM is running on dedicated multiprocessor hardware, ensuring efficient execution of guest software. Optimizations such as relaxed coscheduling and descheduling idle VCPUs provide a high-performance execution environment that efficiently utilizes physical host resources.

Appendix: ESX Server coscheduling statistics
ESX Server 3.x exports statistics related to the coscheduling behavior of multiprocessor VMs. The "esxtop" utility can be used to examine these statistics on a live ESX system.

The %CSTP column in the CPU statistics panel shows the fraction of time the VCPUs of a VM spent in the "co-stopped" state, waiting to be "co-started". This gives an indication of the coscheduling overhead incurred by the VM. If this value is low, then any performance problems should be attributed to other issues, and not to the coscheduling of the VM's virtual cpus.

VCP-VI 3 Reference Card

A good collection of cards to remember Vmware Virtual Center's properties - VI 3 Reference Card

Friday, March 13, 2009

HowTo ESXi 3.5 Update 2 on a USB memory key

For those like me who would like to check ESXi 3.5 update 2 but don’t want to install on a local harddisk. Here’s a good pdf about how to install it on a USB memory key.

In short:

  1. First get the following tools: 7-Zip(Free), WinImage(Demo)
  2. Download the ESXi ISO
  3. Open the ISO with 7-Zip
  4. Extract “install.tgz”
  5. Open “install.tgz” with 7-Zip
  6. Click on “install.tar”
  7. Browse to “usr\lib\vmware\installer\”
  8. Open “VMware-VMvisor-big-3.5.0_Update_2-103909.i386.dd.bz2″
  9. Extract “VMware-VMvisor-big-3.5.0_Update_2-103909.i386.dd”
  10. Open WinImage and go to Disk, click on “Restore Virtual Harddisk Image on physical drive”
  11. Select a physical drive
  12. Select “VMware-VMvisor-big-3.5.0_Update_2-103909.i386.dd”
  13. And click “yes” to write the DD image to the USB Disk

Done! For a more detailed procedure check the pdf above, it also includes screenshots!

Inside of VMware High Availability


Everybody probably knows the basics of VMware HA so I’m not going to explain how to set it up or that is uses a heartbeat for monitoring outages or isolation.

However I do want to explain the different kind of nodes, roles and isolation response. Here we go…

Primary and Secondary nodes

A VMware HA Cluster consists of nodes, primary and secondary nodes. Primary nodes hold cluster settings and all “node states” which are synced between primaries. Secondary nodes send their state info(resource occupation) to the primary nodes.

Nodes send a heartbeat to each other, which is the mechanism to detect possible outages. Primary nodes send heartbeats to primary nodes only. Secondary nodes also send their heartbeats only to primary nodes. Nodes send out these heartbeats every second by default. However this is a changeable value: das.failuredetectioninterval. (Advanced Settings on your HA-Cluster)

The first 5 hosts that join the VMware HA cluster are automatically selected as primary nodes. All the others are automatically selected as secondary nodes. When you do a reconfigure for HA the primary nodes and secondary nodes are selected again, this is random. When a primary node fails one might expect that an automatic re-election or promotion occurs. Be aware that this isn’t the case! A re-election/promotion of a single host only occurs when the failed host is either put in “Maintenance Mode” or removed from the cluster!

You are probably wondering what will happen when for instance all your primary nodes fail at the same time… No HA restart will take place! This is an unaddressed issue, that’s the reason why you can only account for 4 host failures within a cluster! Remember five primaries, four hosts failures = one primary left!

The fail-over coordinator

You need at least one primary because the “fail-over coordinator” role will be assigned to this primary. The fail-over coordinator coordinates the restart of virtual machines on the remaining primary and secondary hosts. The coordinator takes restart priorities in account. Keep in mind, when two hosts fail at the same time it will handle the restart sequentially. In other words, restart the VM’s of the first failed host(taking restart priorities in account) and then restart the VM’s of the host that failed as second(again taking restart priorities in account). If the fail-over coordinator fails one of the primaries will take over.

Isolation Response

Now that we are talking about restarts…. As of ESX 3.5 U3 / vCenter 3.5 U3 the new default isolation response value is “leave powered on”. In other words, when a host network isolation occurs the VM’s remain running. The default setting used to be “Power Off VM”, which meant a restart of the VM on a different host.

One might argue this setting. I prefer “Power off vm” because I don’t want to use a deprecated host. But to each his own. When the isolation response is set to “Leave powered on” it still restarts VM’s when a host completely dies.

I guess most of you would like to know how HA knows if the host is isolated or completely unavailable. This is because of the VMDK file lock, no other host will be able to boot a VM when the files are locked. Yes indeed, they will try to boot them because these hosts won’t know that the “missing” host is isolated, it could be completely dead. So if all network connections are dead, including the vSwitch for the VM’s these VM’s will not be restarted on a different host if they are using FC that is. They will only be restarted if for whatever reason also the storage connection fails and the VMDK file locks are timed out.

This also means that when using iSCSI you should never ever set this setting to “Leave powered on” cause this will cause a split brain scenario for sure. (VM’s will be restarted on a new host because the lock timed out, while the VM’s also are still running on the original host.)

So what happens when a host is completely dead, yes you’ve probably guesed it by now the VMDK file locks times out and the VM’s are restarted on another host.