Mastering VMware® Infrastructure3 - читать бесплатно онлайн полную версию книги автора Chris McCain (Chapter 10 High Availability and Business Continuity) #12

Chapter 10 High Availability and Business Continuity

Once your servers are installed, storage is provisioned, virtual networking is pinging, and virtual machines are running, it is time come to define the strategies to put into place a virtual infrastructure that provides high availability and business continuity. The deployment of a virtual infrastructure opens many new doors for disaster-recovery planning. The virtual infrastructure administrator will lead the charge into a new era of ideologies and methodologies for ensuring that business continues as efficiently as possible in the face of corrupted data, failed servers, or even lost data centers.

In this chapter you will learn to:

Cluster virtual machines with Microsoft Clustering Services (MSCS)

Implement and manage VMware High Availability (HA)

Back up virtual machines with VMware Consolidated Backup (VCB)

Restore virtual machines with VMware Consolidated Backup (VCB)

Clustering Virtual Machines

When critical services and applications call for the highest levels of availability, many network administrators turn to Microsoft Cluster Services (MSCS). Microsoft Windows Server 2003 supports network load-balancing clusters and server clusters depending on the version of Windows Server 2003 installed on the server.

Microsoft Clustering

The network load-balancing (NLB) configuration involves an aggregation of servers that balances the requests for applications or services. In a typical NLB cluster, all nodes are active participants in the cluster and are consistently responding to requests for services. NLB clusters are most commonly deployed as a means of providing enhanced performance and availability. NLB clusters are best suited for scenarios involving Internet Information Services (IIS), virtual private networking, and Internet Security and Acceleration (ISA) Server, to name a few. Figure 10.1 details the architecture of an NLB cluster.

Figure 10.1 An NLB cluster can contain up to 32 active nodes that distribute traffic equally across each node. The NLB software allows the nodes to share a common name and IP address that is referenced by clients.

NLB Support from VMware

As of this writing, VMware does not support the use of NLB clusters across virtual machines. This is not to say it cannot be configured, or that it will not work; however, it is not a VMware-supported configuration.

Unlike NLB clusters, server clusters are used solely for the sake of availability. Server clusters do not provide performance enhancements outside of high availability. In a typical server cluster, multiple nodes are configured to be able to own a service or application resource, but only one node owns the resource at a given time. Server clusters are most often used for applications like Microsoft Exchange, Microsoft SQL Server, and DHCP services, which each share a need for a common datastore. The common datastore houses the information accessible by the node that is online and currently owns the resource, as well as the other possible owners that could assume ownership in the event of failure. Each node requires at least two network connections: one for the production network and one for the cluster service heartbeat between nodes. Figure 10.2 details the structure of a server cluster.

The different versions of Windows Server 2003 offer various levels of support for NLB and server clusters. Table 10.1 outlines the cluster support available in each version of Windows Server 2003.

Windows Clustering Storage Architectures

Server clusters built on Windows Server 2003 can only support up to eight nodes when using a fibre channel-switched fabric. Storage architectures that use SCSI disks as direct attached storage or that use a fibre channel-arbitrated loop result in a maximum of only two nodes in a server cluster. Clustering virtual machines in ESX Server utilizes a simulated SCSI shared storage connection and is therefore limited to only two-node clustering. In addition, the clustered virtual machine solution uses only SCSI-2 reservations, not SCSI-3 reservations, and supports only the SCSI miniport drivers, not the STORPort drivers.

Figure 10.2 Server clusters are best suited for applications and services like SQL Server, Exchange Server, DHCP, etc., that use a common data set.

Table 10.1: Windows Server 2003 Clustering Support

Operating System	Network Load Balancing (NLB)	Server Cluster
Windows Server 2003 Web Edition	Yes (up to 32 nodes)	No
Windows Server 2003 Standard Edition	Yes (up to 32 nodes)	No
Windows Server 2003 Enterprise Edition	Yes (up to 32 nodes)	Yes (up to 8 nodes in fibre channel switch fabric)
Windows Server 2003 Datacenter Edition	Yes (up to 32 nodes)	Yes (up to 8 nodes in fibre channel switch fabric)

MSCS, when constructed properly, provides automatic failover of services and applications hosted across multiple cluster nodes. When multiple nodes are configured as a cluster for a service or application resource, only one node owns the resource at any given time. When the current resource owner experiences failure, causing a loss in the heartbeat between the cluster nodes, another node assumes ownership of the resource to allow continued access with minimal data loss. To configure multiple Windows Server 2003 nodes into a Microsoft cluster, the following requirements must be met:

♦ Nodes must be running either Windows Server 2003 Enterprise Edition or Datacenter Edition.

♦ All nodes should have access to the same storage device(s).

♦ All nodes should have two similarly connected and configured network adapters: one for the production network and one for the heartbeat network.

♦ All nodes should have Microsoft Cluster Services installed.

Virtual Machine Clustering Scenarios

The clustering of Windows Server 2003 virtual machines using Microsoft Cluster Services (MSCS) can be done in one of three different configurations:

Cluster-in-a-Box The clustering of two virtual machines that exist in the same ESX Server host.

Cluster-Across-Boxes The clustering of two virtual machines that are running on different ESX Server hosts.

Physical-to-Virtual Clustering The clustering of a physical server and a virtual machine that is running on an ESX Server host.

Clustering has long been considered an advanced technology implemented only by those with high technical skills in implementing and managing high-availability environments. While this might be more rumor than truth, it is certainly a more complex solution once the virtual machine solution is blended into the deployment.

While you may achieve results setting up clustered virtual machines, you may not receive support for your clustered solution if you violate any of the clustering restrictions put forth by VMware. The following list summarizes the dos and don'ts of clustering virtual machines as published by VMware:

♦ Only 32-bit virtual machines with a boot disk on local storage can be configured as nodes in a server cluster.

♦ Only two-node clustering is allowed.

♦ Virtual machines configured as cluster nodes must use the LSI Logic SCSI adapter and the vmxnet network adapter.

♦ Virtual machines in a clustered configuration cannot be connected to a switch configured with a NIC team.

♦ Virtual machines in a clustered configuration are not valid candidates for VMotion, nor can they be part of a DRS or HA cluster.

♦ ESX Servers hosts that run virtual machines that are part of a server cluster cannot be configured to perform a boot from SAN.

♦ ESX Server hosts that run virtual machines that are part of a server cluster cannot have both QLogic and Emulex HBAs.

Cluster-in-a-Box

The cluster-in-a-box scenario involves configuring two virtual machines hosted by the same ESX Server as nodes in a server cluster. The shared disks of the server cluster can exist as .vmdk files stored on local VMFS volumes. Figure 10.3 details the configuration of a cluster-in-a-box scenario.

After reviewing the diagram of a cluster-in-a-box configuration, you might wonder why you would want to deploy such a thing. The truth is, you wouldn't want to deploy cluster in a box because it still maintains a single point of failure. With both virtual machines running on the same host, if that host fails, both virtual machines fail. This architecture contradicts the very reason for creating failover clusters. A cluster-in-a-box still contains a single point of failure that can result in downtime of the clustered application. If the ESX Server hosting the two-node cluster-in-a-box fails, then both nodes are lost and a failover does not occur. This setup might, and I use might loosely, be used only to "play" with clustering services or to test clustering services and configurations. But ultimately, even for testing, it is best to use the cluster-across-boxes to get a better understanding of how this might be deployed in a production scenario.

Figure 10.3 A cluster-in-a-box configuration does not provide protection against a single point of failure. Therefore, it is not a common or suggested form of deploying Microsoft server clusters in virtual machines.

Cluster-in-a-Box

As suggested in the first part of this chapter, server clusters are deployed for high availability. High availability is not achieved using a cluster-in-a-box, and therefore this configuration should be avoided for any type of critical production applications and services.

Cluster-Across-Boxes

While the cluster-in-a-box scenario is more of an experimental or education tool for clustering, the cluster-across-boxes configuration provides a solid solution for critical virtual machines with stringent uptime requirements — for example, the enterprise-level servers and services like SQL Server and Exchange Server that are heavily relied on by the bulk of the end-user community. The cluster-across-boxes scenario, as the name applies, draws its high availability from the fact that the two nodes in the cluster are managed on different ESX Server hosts. In the event that one of the hosts fails, the second node of the cluster will assume ownership of the cluster group, and its resources and the service or application will continue responding to client requests.

The cluster-across-boxes configuration requires that virtual machines have access to the same shared storage, which must reside on a storage device external to the ESX Server hosts where the virtual machines run. The virtual hard drives that make up the operating system volume of the cluster nodes can be a standard VMDK implementation; however, the drives used as the shared storage must be set up as a special kind of drive called a raw device mapping. The raw device mapping is a feature that allows a virtual machine to establish direct access to a LUN on a SAN device.

Raw Device Mappings (RDMs)

A raw device mapping (RDM) is not direct access to a LUN, nor is it a normal virtual hard disk file. An RDM is a blend between the two. When adding a new disk to a virtual machine, as you will soon see, the Add Hardware Wizard presents the Raw Device Mappings as an option on the Select a Disk page. This page defines the RDM as having the ability to give a virtual machine direct access to the SAN, thereby allowing SAN management. I know this seems like a contradiction to the opening statement of this sidebar; however, we're getting to the part that oddly enough makes both statements true.

By selecting an RDM for a new disk, you're forced to select a compatibility mode for the RDM. An RDM can be configured in either Physical Compatibility mode or Virtual Compatibility mode. The Physical Compatibility mode option allows the virtual machine to have direct raw LUN access. The Virtual Compatibility mode, however, is the hybrid configuration that allows raw LUN access but only through a VMDK file acting as a proxy. The image shown here details the architecture of using an RDM in Virtual Compatibility mode:

So why choose one over the other if both are ultimately providing raw LUN access? Since the RDM in Virtual Compatibility mode uses a VMDK proxy file, it offers the advantage of allowing snapshots to be taken. By using the Virtual Compatibility mode, you will gain the ability to use snapshots on top of the raw LUN access in addition to any SAN-level snapshot or mirroring software. Or, of course, in the absence of SAN-level software, the VMware snapshot feature can certainly be a valuable tool. The decision to use Physical Compatibility or Virtual Compatibility is predicated solely on the opportunity and/or need to use VMware snapshot technology.

A cluster-across-boxes requires a more complex setup than the cluster-in-a-box scenario. When clustering across boxes, all proper communication between virtual machines and all proper communication from virtual machines and storage devices must be configured properly. Figure 10.4 provides details on the setup of a two-node virtual machine cluster-across-boxes using Windows Server 2003 guest operating systems.

Perform the following steps to configure Microsoft Cluster Services across virtual machines on separate ESX Server hosts.

Figure 10.4 A Microsoft cluster built on virtual machines residing on separate ESX hosts requires shared storage access from each virtual machine using a raw device mapping (RDM).

Creating the First Cluster Node

To create the first cluster node, follow these steps:

1. Create a virtual machine that is a member of a Windows Active Directory domain.

2. Right-click the new virtual machine and select the Edit Settings option.

3. Click the Add button and select the Hard Disk option.

4. Select the Raw Device Mappings radio button option, as shown in Figure 10.5, and then click the Next button.

Figure 10.5 Raw device mappings allow virtual machines to have direct LUN access.

5. Select the appropriate target LUN from the list of available targets, as shown in Figure 10.6.

6. Select the datastore location, shown in Figure 10.7, where the VMDK proxy file should be stored, and then click the Next button.

Figure 10.6 The list of available targets includes only the LUNs not formatted as VMFS.

Figure 10.7 By default the VMDK file that points to the LUN is stored in the same location as the existing virtual machine files.

Select the Virtual radio button option to allow VMware snapshot functionality for the raw device mapping, as shown in Figure 10.8. Then click Next.

Figure 10.8 The Virtual Compatibility mode enables VMware snapshot functionality for RDMs. The physical mode allows raw LUN access but without VMware snapshots.

8. Select the virtual device node to which the RDM should be connected, as shown in Figure 10.9. Then click Next.

Figure 10.9 The virtual device node for the additional RDMs in a cluster node must be on a different SCSI node.

SCSI nodes for RDMs

RDMs used for shared storage in a Microsoft server cluster must be configured on a SCSI node that is different from the SCSI to which the hard disk is connected that holds the operating system. For example, if the operating system's virtual hard drive is configured to use the SCSI0 node, then the RDM should use the SCSI1 node.

9. Click the Finish button.

10. Right-click the virtual machine and select the Edit Settings option.

11. Select the new SCSI controller that was added as a result of adding the RDMs on a separate SCSI controller.

12. Select the Virtual radio button option under the SCSI Bus Sharing options, as shown in Figure 10.10.

13. Repeat steps 2 through 9 to configure additional RDMs for shared storage locations needed by nodes of a Microsoft server cluster.

14. Configure the virtual machine with two network adapters. Connect one network adapter to the production network and connect the other network adapter to the network used for heartbeat communications between nodes. Figure 10.11 shows a cluster node with two network adapters configured.

NICs in a Cluster

Because of PCI addressing issues, all RDMs should be added prior to configuring the additional network adapters. If the NICs are configured first, you may be required to revisit the network adapter configuration after the RDMs are added to the cluster node.

Figure 10.10 The SCSI bus sharing for the new SCSI adapter must be set to Virtual to support running a virtual machine as a node in a Microsoft server cluster.

Figure 10.11 A node in a Microsoft server cluster requires at least two network adapters. One adapter must be able to communicate on the production network, and the second adapter is configured for internal cluster heartbeat communication.

15. Power on the first node of the cluster, and assign valid IP addresses to the network adapters configured for the production and heartbeat networks. Then format the additional drives and assign drive letters, as shown in Figure 10.12.

16. Shut down the first cluster node.

17. In the VirtualCenter inventory, select the ESX Server host where the first cluster node is configured and then select the Configuration tab.

18. Select Advanced Settings from the Software menu.

Figure 10.12 The RDMs presented to the first cluster node are formatted and assigned drive letters.

19. In the Advanced Settings dialog box, configure the following options, as shown in Figure 10.13:

♦ Set the Disk.ResetOnFailure option to 1.

♦ Set the Disk.UseLunReset option to 1.

♦ Set the Disk.UseDeviceReset option to 0.

Figure 10.13 ESX Server hosts with virtual machines configured as cluster nodes require changes to be made to several advanced disk configuration settings.

20. Proceed to the next section to configure the second cluster node and the respective ESX Server host.

Creating the Second Cluster Node

To create the second cluster node, follow these steps:

1. Create a second virtual machine that is a member of the same Active Directory domain as the first cluster node.

2. Add the same RDMs to the second cluster node using the same SCSI node values. For example, if the first node used SCSI 1:0 for the first RDM and SCSI 1:1 for the second RDM, then configure the second node to use the same configuration. As in the first cluster node configuration, add all RDMs to the virtual machine before moving on to step 3 to configure the network adapters. Don't forget to edit the SCSI bus sharing configuration for the new SCSI adapter.

3. Configure the second node with an identical network adapter configuration.

4. Verify that the hard drives corresponding to the RDMs can be seen in Disk Manager. At this point the drives will show as a status of Healthy but drive letters will not be assigned.

5. Power off the second node.

6. Edit the advanced disk settings for the ESX Server host with the second cluster node.

Creating the Management Cluster

To create the management cluster, follow these steps:

1. If you have the authority, create a new user account that belongs to the same Windows Active Directory domain as the two cluster nodes. The account does not need to be granted any special group memberships at this time.

2. Power on the first node of the cluster and log in as a user with administrative credentials.

3. Click Start→Programs→Administrative Tools, and select the Cluster Administrator console.

4. Select the Create new cluster option from the Open Connection to Cluster dialog box, as shown in Figure 10.14. Click OK.

Figure 10.14 The first cluster created is used to manage the nodes of the cluster.

5. Provide a unique name for the name of the cluster, as shown in Figure 10.15. Ensure that it does not match the name of any existing computers on the network.

Figure 10.15 Configuring a Microsoft server cluster is heavily based on domain membership and the cluster name. The name provided to the cluster must be unique within the domain to which it belongs.

6. As shown in Figure 10.16, the next step is to execute the cluster feasibility analysis to check for all cluster-capable resources. Then click Next.

Figure 10.16 The cluster analysis portion of the cluster configuration wizard identifies that all cluster-capable resources are available.

7. Provide an IP address for cluster management. As shown in Figure 10.17, the IP address configured for cluster management should be an IP address that is accessible from the network adapters configured on the production network. Click Next.

Figure 10.17 The IP address provided for cluster management should be unique and accessible from the production network.

Cluster Management

To access and manage a Microsoft cluster, create a Host (A) record in the zone that corresponds to the domain to which the cluster nodes belong.

8. Provide the account information for the cluster service user account created in step 1 of the “Creating the Management Cluster” section. Note that the Cluster Service Account page of the New Server Cluster Wizard, shown in Figure 10.18, acknowledges that the account specified will be granted membership in the local administrators group on each cluster node. Therefore, do not share the cluster service password with users who should not have administrative capabilities. Click Next.

Figure 10.18 The cluster service account must be a domain account and will be granted local administrator rights on each cluster node.

9. At the completion of creating the cluster timeline, shown in Figure 10.19, click Next.

Figure 10.19 The cluster installation timeline provides a running report of the items configured as part of the installation process.

10. Continue to review the Cluster Administrator snap-in and review the new management cluster that was created, shown in Figure 10.20.

Figure 10.20 The completion of the initial cluster management creation wizard will result in a Cluster Group and all associated cluster resources.

Adding the Second Node to the Management Cluster

To add the second node to the management cluster, follow these steps:

1. Leave the first node powered on and power on the second node.

2. Right-click the name of the cluster, select the New option, and then click the Node option, as shown in Figure 10.21.

Figure 10.21 Once the management cluster is complete, an additional node can be added.

3. Specify the name of the node to be added to the cluster and then click Next, as shown in Figure 10.22.

Figure 10.22 You can type the name of the second node into the text box or find it using the Browse button.

4. Once the cluster feasibility check has completed (see Figure 10.23), click the Next button.

Feasibility Stall

If the feasibility check stalls and reports a 0x00138f error stating that a cluster resource cannot be found, the installation will continue to run. This is a known issue with the Windows Server 2003 cluster configuration. If you allow the installation to continue, it will eventually complete and function as expected. For more information visit http://support.microsoft.com/kb/909968.

5. Proceed to review the Cluster Administrator identifying that two nodes now exist within the new cluster.

Figure 10.23 A feasibility check is executed against each potential node to validate the hardware configuration that supports the appropriate shared resources and network configuration parameters.

At this point the management cluster is complete; from here, application and service clusters can be configured. Some applications like Microsoft SQL Server 2005 and Microsoft Exchange Server 2007 are not only cluster-aware applications but also allow for the creation of a server cluster as part of the standard installation wizard. Other cluster-aware applications and services can be configured into a cluster using the cluster administrator.

Physical-to-Virtual Clustering

The last type of clustering scenario to discuss is the physical-to-virtual clustering configuration. As you may have guessed, this involves building a cluster with two nodes where one node is a physical machine and the other node is a virtual machine. Figure 10.24 details the setup of a two-node physical-to-virtual cluster.

Figure 10.24 Clustering physical machines with virtual machine counterparts can be a cost-effective way of providing high availability.

The constraints surrounding the construction of a physical-to-virtual cluster are identical to those noted in the previous configuration. Likewise, the steps to configure the virtual machine acting as a node in the physical-to-virtual cluster are identical to the steps outlined in the previous section. The virtual machine must have access to all the same storage locations as the physical machine. The virtual machine must also have access to the same pair of networks used by the physical machine for production and heartbeat communication, respectively.

The advantage to implementing a physical-to-virtual cluster is the resulting high availability with reduced financial outlay. Physical-to-virtual clustering, due to the two-node limitation of virtual machine clustering, ends up as an N+1 clustered solution, where N is the number of physical servers in the environment plus one additional physical server to host the virtual machines. In each case, each physical-virtual machine cluster creates a failover pair. With the scope of the cluster design limited to a failover pair, the most important design aspect in a physical-to-virtual cluster is the scale of the host running ESX Server. As you may have figured, the more powerful the ESX Server, the more failover incidents it can handle. A more powerful ESX Server will scale better to handle multiple physical host failures, whereas a less powerful ESX Server might only handle a single physical host failure before performance levels experience a noticeable decline.

VMware High Availability (HA)

High availability has been an industry buzzword that has stood the test of time. The need and/or desire for high availability is often a significant component to network infrastructure design. Within the scope of ESX Server, VMware High Availability (HA) is a component of the VI3 Enterprise product that provides for automatic failover of virtual machines. But — and it's a big but at this point in time — HA does not provide high availability in the traditional sense of the term. Commonly, the term HA means automatic failover of a service or application to another server.

Understanding HA

The VMware HA feature provides an automatic restart of the virtual machines that were running on an ESX Server host at the time it became unavailable, shown in Figure 10.25.

Figure 10.25 VMware HA provides an automatic restart of virtual machines that were running on an ESX Server host when it failed.

In the case of VMware HA, there is still a period of downtime when a server fails. Unfortunately, the duration of the downtime is not a value that can be calculated because it is unknown ahead of time how long it will take to boot a series of virtual machines. From this you can gather that, at this point in time, VMware HA does not provide the same level of high availability as found in a Microsoft server cluster solution. When a failover occurs between ESX Server hosts as a result of the HA feature, there is potential for data loss as a result of the virtual machine that was immediately powered off when the server failed and then brought back up minutes later on another server.

HA: Within, but Not Between, Sites

A requisite of HA is that each node in the HA cluster must have access to the same SAN LUNs. This requirement prevents HA from being able to failover between ESX Server hosts in different locations unless both locations have been configured to have access to the same storage devices. It is not acceptable just to have the data in LUNs the same due to SAN-replication software. Mirroring data from a LUN on a SAN in one location to a LUN on a SAN in a hot site is not conducive to allowing HA (VMotion or DRS).

In the VMware HA scenario, two or more ESX Server hosts are configured in a cluster. Remember, a VMware cluster represents a logical aggregation of CPU and memory resources, as shown in Figure 10.26. By editing the cluster settings, the VMware HA feature can be enabled for a cluster. The HA cluster then determines the number of hosts failures it must support.

Figure 10.26 A VMware ESX Server cluster logically aggregates the CPU and memory resources from all nodes in the cluster.

When ESX Server hosts are configured into a VMware HA cluster, they receive all the cluster information. VirtualCenter informs each node in the HA cluster about the cluster configuration.

HA and VirtualCenter

While VirtualCenter is most certainly required to enable and manage VMware HA, it is not required to execute HA. VirtualCenter is a tool that notifies each VMware HA-cluster node about the HA configuration. Once the nodes have been updated with the information about the cluster, VirtualCenter no longer maintains a persistent connection with each node. Each node continues to function as a member of the HA cluster independent of its communication status with VirtualCenter.

When an ESX Server host is added to a VMware HA cluster, a set of HA specific components are installed on the ESX Server. These components, shown in Figure 10.27, include:

♦ Automatic Availability Manager (AAM)

♦ VMap

♦ vpxa

Figure 10.27 Adding an ESX Server host to an HA cluster automatically installs the AAM, VMap, and possibly the vpxa components on the host.

The AAM, effectively the engine for HA, is a Legato-based component that keeps an internal database of the other nodes in the cluster. The AAM is responsible for the intracluster heartbeat used to identify available and unavailable nodes. Each node in the cluster establishes a heartbeat with each of the other nodes over the Service Console network. As a best practice, you should provide redundancy to the AAM heartbeat by establishing the Service Console port group on a virtual switch with an underlying NIC team. Though the Service Console could be multihomed and have an AAM heartbeat over two different networks, this configuration is not as reliable as the NIC team. The AAM is extremely sensitive to hostname resolution; the inability to resolve names will most certainly result in an inability to execute HA. When problems arise with HA functionality, look first at hostname resolution. Having said that, during HA troubleshooting, you should identify the answers to questions like:

♦ Is the DNS server configuration correct?

♦ Is the DNS server available?

♦ If DNS is on a remote subnet, is the default gateway correct and functional?

♦ Does the /etc/hosts file have bad entries in it?

♦ Does the /etc/resolv.conf have the right search suffix?

♦ Does the /etc/resolv.conf have the right DNS server?

Adding a Host to VirtualCenter

When a new host is added into the VirtualCenter inventory, the host must be added by its hostname or the HA will not function properly. As just noted, HA is heavily reliant on successful name resolution. ESX Server hosts should not be added to the VirtualCenter inventory using IP addresses.

The AAM on each ESX Server host keeps an internal database of the other hosts belonging to the cluster. All hosts in a cluster are considered either a primary host or a secondary host. However, only one ESX Server in the cluster is considered the primary host at a given time, with all others considered secondary hosts. The primary host functions as the source of information for all new hosts and defaults to the first host added to the cluster. If the primary host experiences failure, the HA cluster will continue to function. In fact, in the event of primary host failure, one of the secondary hosts will move up to the status of primary host. The process of promoting secondary hosts to primary is limited to four other hosts. Only five hosts could assume the role of primary host in an HA cluster.

While the AAM is busy managing the intranode communications, the vpxa service manages the HA components. The vpxa service communicates to the AAM through a third component called the vMap.

Configuring HA

Before we detail how to set up and configure the HA feature, let's review the requirements of HA. To implement HA, all of the following requirements should be met:

♦ All hosts in an HA cluster must have access to the same shared storage locations used by all virtual machines on the cluster. This includes any fibre channel, iSCSI, and NFS datas tores used by virtual machines.

♦ All hosts in an HA cluster should have an identical virtual networking configuration. If a new switch is added to one host, the same new switch should be added to all hosts in the cluster.

♦ All hosts in an HA cluster must resolve the other hosts using DNS names.

A Test for HA

An easy and simple test for identifying HA capability for a virtual machine is to perform a VMotion. The requirements of VMotion are actually more stringent than those for performing an HA failover, though some of the requirements are identical. In short, if a virtual machine can successfully perform a VMotion across the hosts in a cluster, then it is safe to assume that HA will be able to power on that virtual machine from any of the hosts. To perform a full test of a virtual machine on a cluster with four nodes, perform a VMotion from node 1 to node 2, node 2 to node 3, node 3 to node 4, and finally node 4 back to node 1. If it works, then you have passed the test!

First and foremost, to configure HA a cluster must be created. Once the cluster is created, you can enable and configure HA. Figure 10.28 shows a cluster enabled for HA.

Figure 10.28 A cluster of ESX Server hosts can be configured with HA and DRS. The features are not mutually exclusive and can work together to provide availability and performance optimization.

Configuring an HA cluster revolves around three different settings:

♦ Host failures allowed

♦ Admission control

♦ Virtual machine options

The configuration option for the number of host failures to allow, shown in Figure 10.29, is a critical setting. It directly influences the number of virtual machines that can run in the cluster before the cluster is in jeopardy of being unable to support an unexpected host failure.

Figure 10.29 The number of host failures allowed dictates the amount of spare capacity that must be retained for use in recovering virtual machines after failure.

Real World Scenario

HA Configuration Failure

It is not uncommon for a host in a cluster to fail during the configuration of HA. Remember the stress we put on DNS earlier in the chapter. Well, if DNS is not set correctly, you will find that the host cannot be configured for HA. Take, for example, a cluster with three nodes being configured as an HA cluster to support two-node failure. Enabling HA forces a configuration of each node in the cluster. The image here shows an HA cluster where one of the nodes, Silo104, has thrown an error related to the HA agent and is unable to complete the HA configuration:

In this example, because the cluster was attempting to allow for two-node failure and there are only two nodes successfully configured, this would be impossible. The cluster in this case is now warning that there are insufficient resources to satisfy the HA failover level. Naturally, with only two nodes we cannot cover two-node failure. The image here shows an error on the cluster due to the failure in Silo104:

In the tasks pane of the graphic, you may have noticed that Silo105 and Silo106 both completed the HA configuration successfully. This provides evidence that the problem is probably isolated to Silo104. Reviewing the Tasks & Events tab to get more detail on the error reveals exactly that. The next image shows that the error was caused by an inability to resolve a name. This confirms the suspicion that the error is with DNS.

Perform the following steps to review or edit the DNS server for an ESX Server:

1. Use the Virtual Infrastructure (VI) Client to connect to a VirtualCenter server.

2. Click the name of the host in the inventory tree on the left.

3. Click the Configuration tab in the details pane on the right.

4. Click DNS and Routing from the Advanced menu.

5. If needed, edit the DNS server, as shown in the image here, to a server that can resolve the other nodes in the HA cluster:

Although they should not be edited on a regular basis, you can also check the /etc/hosts and /etc/resolv.conf files, which should contain static lists of hostnames to IP addresses or the DNS search domain and name servers, respectively. This image offers a quick look at the information inside the /etc/hosts and /etc/resolv.conf files. These tools can be valuable for troubleshooting name resolution.

Once the DNS server, /etc/hosts, or /etc/resolv.conf has been corrected, the host with the failure can be reconfigured for HA. It's not necessary to remove the HA configuration from the cluster and then re-enable it. The image here shows the right-click context menu of Silo104, where it can be reconfigured for HA now that the name resolution problem has been fixed.

Upon completion of the configuration of the final node, the errors at the host and cluster levels will be removed, the cluster will be configured as desired, and the error regarding the inability to satisfy the failover level will disappear.

To explain the workings of HA and the differences in the configuration settings, let's look at implementation scenarios. For example, consider five ESX Server hosts named Silo101 through Silo105. All five hosts belong to an HA cluster configured to support single-host failure. Each node in the cluster is equally configured with 12GB of RAM. If each node runs eight virtual machines with 1GB of memory allocated to each virtual machine, then 8GB of unused memory across four hosts is needed to support a single-host failure. The 12GB of memory on each host minus 8GB for virtual machines leaves 4GB of memory per host. Figure 10.30 shows our five-node cluster in normal operating mode.

Figure 10.30 A five-node cluster configured to allow single-host failure.

Let's assume that service console and virtual machine overhead consume 1GB of memory, leaving 3GB of memory per host. If Silo101 fails, the remaining four hosts will each have 3GB of memory to contribute to running the virtual machines orphaned by the failure. The 8GB of virtual machines will then be powered on across the remaining four hosts that collectively have 12GB of memory to spare. In this case, the configuration supported the failover. Figure 10.31 shows our five-node cluster down to four after the failure of Silo101. Assume in this same scenario that Silo101 and Silo102 both experience failure. That leaves 16GB of virtual machines to cover across only three hosts with 3GB of memory to spare? In this case, the cluster is deficient and not all of the orphaned virtual machines will be restarted.

Figure 10.31 A five-node cluster configured to allow single-host failure is deficient in resources to support a second failed node.

Primary Host Limit

In the previous section introducing the HA feature, we mentioned that the Automated Availability Manager (AAM) caps the number of primary hosts at five. This limitation translates into a maximum of four host failures allowed in a cluster.

The admission control setting goes hand in hand with the Number of host failures allowed setting. There are two possible settings for admission control:

♦ Do not power on virtual machines if they violate availability constraints (known as strict admission control).

♦ Allow virtual machines to be powered on even if they violate availability constraints (known as guaranteed admission control).

In the previous example, virtual machines would not power on when Silo102 experienced failure because by default an HA cluster is configured to use strict admission control. Figure 10.32 shows an HA cluster configured to use the default setting of strict admission control.

Figure 10.32 Strict admission control for an HA cluster prioritizes resource balance and fairness over resource availability.

With strict admission control, the cluster will reach a point at which it will no longer start virtual machines. Figure 10.33 shows a cluster configured for two-node failover. A virtual machine with more than 3GB of memory reserved is powering on, and the resulting error is posted stating that insufficient resources are available to satisfy the configured HA level.

Figure 10.33 Strict admission control imposes a limit at which no more virtual machines can be powered on because the HA level would be jeopardized.

If the admission control setting of the cluster is changed from strict admission control to guaranteed admission control, then virtual machines will power on even in the event that the HA failover level is jeopardized. Figure 10.34 shows a cluster reconfigured to use guaranteed admission control.

Figure 10.34 Guaranteed admission control reflects the idea that when failure occurs, availability is more important than resource fairness and balance.

With that same cluster now configured with guaranteed admission control, the virtual machine with more than 3GB of memory can now successfully power on. In Figure 10.35, the virtual machine has successfully powered on despite the large memory use and lack of available unused resources to achieve the proper HA failover.

Figure 10.35 Guaranteed admission control allows resource consumption beyond the levels required to maintain spare resources for use in the event of a server failure.

Overcommitment in an HA cluster

When the admission control setting is set to allow virtual machines to be powered on even if they violate availability constraints, you could find yourself in a position where there is more physical memory allocated to virtual machines than actually exists. This situation, called overcommitment, can lead to poor performance on virtual machines that become forced to page information from fast RAM out to the slower disk based swap file.

HA Restart Priority

Not all virtual machines are equal. There are some that are more important or more critical and that require higher priority when ensuring availability. When an ESX Server host experiences failure and the remaining cluster nodes are tasked with bringing virtual machines back on line, they have a finite amount of resources to fill before there are no more resources to allocate to virtual machines that need to be powered on. Rather than leave the important virtual machines to chance, an HA cluster allows for the prioritization of virtual machines. The restart priority options for virtual machines in an HA cluster include Low, Medium, High, and Disabled. For those virtual machines that should be brought up first, the restart priority should be set to High. For those virtual machines that should be brought up if resources are available, the restart priority can be set to Medium and/or Low. For those virtual machines that will not be missed for a period of time and should not be brought on line during the period of reduced resource availability, the restart priority should be set to Disabled. Figure 10.36 shows an example of virtual machines with various restart priorities configured to reflect their importance. The diagram details a configuration where virtual machines like domain controllers, database servers, and cluster nodes are assigned higher restart priority.

Figure 10.36 Restart priorities help minimize the downtime for more important virtual machines.

The restart priority is only put into place for the virtual machines running on the ESX Server host that experienced an unexpected failure. Virtual machines running on hosts that have not failed are not affected by the restart priority. It is possible then that virtual machines configured with a restart priority of High may not be powered on by the HA feature due to limited resources, which are in part due to lower-priority virtual machines that continue to run. For example, as shown in Figure 10.37, Silo101 hosts five virtual machines with a priority of High and five other virtual machines with priority values of Medium and Low. Meanwhile, Silo102 and Silo103 each hold ten virtual machines, but of the 20 virtual machines between them, only four are considered of high priority. When Silo101 fails, Silo102 and Silo103 will begin powering the virtual machines with a high priority. However, assume there were only enough resources to power on four of the five virtual machines with high priority. That leaves a high-priority virtual machine powered off while all other virtual machines of medium and low priority continue to run on Silo102 and Silo103.

Figure 10.37 High-priority virtual machines from a failed ESX Server host may not be powered on because of a lack of resources — resources consumed by virtual machines with a lower priority that are running on the other hosts in an HA cluster.

At this point in the VI3 product suite, you can still manually remedy this imbalance. Any disaster-recovery plan in a virtual environment built on VI3 should include a contingency plan that identifies virtual machines to be powered off to make resources available for those virtual machines with higher priority as a result of the network services they provide. If the budget allows, construct the HA cluster to ensure that there are ample resources to cover the needs of the critical virtual machines, even in times of reduced computing capacity.

HA Isolation Response

Previously, we introduced the Automated Availability Manager (AAM) and its role in conducting the heartbeat that occurs among all the nodes in the HA cluster. The heartbeat among the nodes in the cluster identifies the presence of each node to the other nodes in the cluster. When a heartbeat is no longer presented from a node in the HA cluster, the other cluster nodes spring into action to power on all the virtual machines that the missing node was running.

But what if the node with the missing heartbeat was not really missing? What if the heartbeat was missing but the node was still running? And what if the node with the missing heartbeat is still locking the virtual machine files on SAN LUN, thereby preventing the other nodes from powering on the virtual machines?

Let's look at two particular examples of a situation VMware refers to as a split-brained HA cluster. Let's assume there are three nodes in an HA cluster: Silo101, Silo102, and Silo103. Each node is configured with a single virtual switch for VMotion, and a second virtual switch consisting of a Service Console port and a virtual machines port group, as shown in Figure 10.38.

Figure 10.38 ESX Server hosts in an HA cluster using a single virtual switch for Service Console and virtual machine communication.

To continue with the example, suppose that an administrator mistakenly unplugs the Silo101 Service Console network cable. When each of the nodes identifies a missing heartbeat from another node, the discovery process begins. After 15 seconds of missing heartbeats, each node then pings an address called the isolation response address. By default this address is the default gateway IP address configured for the Service Console. If the ping attempt receives a reply, the node considers itself valid and continues as normal. If a host does not receive a response, as Silo101 wouldn't, it considers itself in isolation mode. At this point, the node will identify the cluster's Isolation Response configuration, which will guide the host to either power off the existing virtual machines it is hosting or leave them powered on. This isolation response value, shown in Figure 10.39, is set on a per-virtual machine basis. So what should you do? Power off the existing virtual machine? Or leave it powered on?

The answer to this question is highly dependent on the virtual and physical network infrastructures in place. In our example, the Service Console and virtual machines are connected to the same virtual switch bound to a single network adapter. In this case, when the cable for the Service Console was unplugged, communication to the Service Console and every virtual machine on that computer was lost. The solution, then, should be to power off the virtual machines. By doing so, the other nodes in the cluster will identify the releases on the locks and begin to power on the virtual machines that were not otherwise included.

Figure 10.39 The Isolation Response identifies the action to occur when an ESX Server host determines it is offline but powered on.

In the next example, we have the same scenario but a different infrastructure, so we don't need to worry about powering off virtual machines in a split-brain situation. Figure 10.40 diagrams a virtual networking architecture in which the Service Console, VMotion, and virtual machines all communicate through individual virtual switches bound to different physical network adapters. In this case, if the network cable connecting the Service Console is removed, the heartbeat will once again be missing; however, the virtual machines will be unaffected since they reside on a different network that is still passing communications between the virtual and physical networks.

Figure 10.40 Redundancy in the physical infrastructure with isolation of virtual machines from the Service Console in the virtual infrastructure provides greater flexibility for isolation response.

Figure 10.41 shows the isolation response setting of Leave powered on, which would accompany an infrastructure built with redundancy at the virtual and physical network levels.

Figure 10.41 The option to leave virtual machines running when a host is isolated should only be set when the virtual and the physical networking infrastructures support high availability.

Real World Scenario

Configuring the Isolation Response Address

In some highly secure virtual environments, Service Console access is limited to a single, nonrouted management network. In some cases, the security plan calls for the elimination of a default gateway on the Service Console port configuration. The idea is to lock the Service Console onto the local subnet, thus preventing any type of remote network access. The disadvantage, as you might have guessed, is that without a default gateway IP address configured for the Service Console, there is no isolation address to ping as a determination of isolation status.

It is possible, however, to customize the isolation response address for scenarios just like this. The IP address can be any IP address, but should be an IP address that is not going to be unavailable or taken from the network at any time.

Perform the following steps to define a custom isolation response address:

1. Use the VI Client to connect to a VirtualCenter server.

2. Right-click on an existing cluster and select the Edit Settings option.

3. Click the VMware HA node.

4. Click the Advanced Options button.

5. Type das.isolationaddress in the Option column in the Advanced Options (HA) dialog box.

6. Type the IP address to be used as the isolation response address for ESX Server hosts that miss the AAM heartbeat. The following image shows a sample configuration in which the servers will ping the IP address 172.30.0.2:

7. Click the OK button twice.

This interface can also be configured with the following options:

♦ das.isolationaddress1: To specify the first address to try.

♦ das.isolationaddress2: To specify the second address to try.

♦ das.defaultfailoverhost: To identify the preferred host to fail over to.

♦ das.failuredetectiontime: Used to change the amount of time required for failover detection.

To support a redundant HA architecture, it is best to ensure that the Service Console port is sitting atop a NIC team where each physical NIC bound to the virtual switch is connected to a different physical switch.

Backing Up with VMware Consolidated Backup (VCB)

Virtual machines are no less likely to lose data, become corrupted, or fail the way their physical counterparts might. And though some may argue against that point, it is most certainly the best way for you to look at virtual machines. With this school of thought, you might be jeopardizing the infrastructure with overconfidence in virtual machine stability. It's better to be safe than sorry. When it comes to virtual machine backup planning, VMware suggests three different methods you can use to support your disaster recovery/business continuity plan. The three methods include:

♦ Using backup agents inside the virtual machine.

♦ Using VCB to perform virtual machine backups.

♦ Using VCB to perform file-level backups (Windows guests only).

VMware Consolidated Backup (VCB) is VMware's first entry into the backup space. (For those of you who have worked with ESX 2,I am not counting the vmsnap.pl and vmres.pl as attempts to provide a backup product). VCB is a framework for backing up that integrates easily into a handful of third-party products. Although VCB can be used on its own, it lacks some of the nice features third-party backup products bring to the table. These include features like cataloging backups, scheduling capability, and media management backups. For this reason, I recommend that you use the VCB framework in conjunction with third-party products that have been tested.

More than likely, none of the three methods listed will suffice if used alone. As this chapter provides more details about each of the methods, you'll see how a solid backup strategy is based on using several of these methods in a complementary fashion.

Using Backup Agents in a Virtual Machine

Oh so many years ago when virtualization was not even a spot on your IT roadmap, you were backing up your physical servers according to some kind of business need. For most organizations, the solution involved the purchase, installation, configuration, and execution of third-party backup agents on the operating systems running on physical hardware. Now that you have jumped onto the cutting edge of technology by leading the server consolation charge into a virtual IT infrastructure, you can still back up using the same traditional methods. Virtual machines like physical machines are targets for third-party backup tools. The downside to this time-tested model is the need to continue paying for the licenses needed to perform backups across all servers. As shown in Figure 10.42, you'll need a license for every virtual machine you wish to back up: 100 virtual machines = 100 licenses. Some vendors allow for a single ESX Server license that permits an unlimited number of agent licenses to be installed on virtual machines on that host.

In this case, virtualization has not lowered total ownership costs and the return on investment has not changed with regard to the fiscal accountability to the third-party backup company. So perhaps this is not the best avenue down which you should travel. With that being said, let's look at other options that rely heavily on the virtualized aspect of the guest operating system. These options include:

♦ Using VCB for full virtual machine backups.

♦ Using VCB for single VMDK backups.

♦ Using VCB for file-level backups.

Figure 10.42 Using third-party backup agents inside a virtual machine does not take advantage of virtualization. Virtual machines are treated just like their physical counterparts for the sake of a disaster recovery or business continuity plan.

Using VCB for Full Virtual Machine Backups

As we mentioned briefly in the opening section, VCB is a framework for backup that integrates with third-party backup software. It is a series of scripts that performs online, LAN-free backups of virtual machines or virtual machine files.

VCB for Fibre Channel… and iSCSI Too!

When first released, VCB was offered as a fibre channel-only solution; VMware did not support VCB used over an iSCSI storage network. The latest release of VCB offers support for use with iSCSI storage.

The requirements for VCB 1.1 include:

♦ A physical server running Windows Server 2003 Service Pack 1.

♦ If using Windows Server 2003 Standard Edition, the VCB server must be configured not to automatically assign drive letters using diskpart to execute automount disable and automount scrub.

♦ Network connectivity for access to VirtualCenter.

♦ Fibre channel HBA with access to all SAN LUNs where virtual machine files are stored.

♦ Installation of the third-party software prior to installing and configuring VCB.

VCB on Fibre Channel Without Multipathing

The VCB proxy requires a fibre channel HBA to communicate with fibre channel SAN LUNs regarding backup processes. VCB does not, however, support multiple HBAs or multipathing software like EMC PowerPath. Insert only one fibre channel HBA into a VCB proxy.

Figure 10.43 looks at the VCB components and architecture.

Figure 10.43 A VCB deployment relies on several communication mediums, including network access to VirtualCenter and fibre channel access to all necessary SAN LUNs.

VMware regularly tests third-party support for VCB, and, as a result, the list of supported backup products continues to change. As of this writing, the following products were listed as backup products for which VMware provides integration scripts:

♦ EMC NetWorker

♦ Symantec Backup Exec

♦ Tivoli Storage Manager

♦ Veritas NetBackup

In addition to these four products, VMware lists the following products as having created integrations to allow their products to capitalize on the VCB framework:

♦ Vizioncore vRanger Pro, formerly ESXRanger Pro

♦ Certificate Associates (CA) BrightStor ARCServe

♦ CommVault Galaxy

♦ EMC Avamar

♦ HP Data Protector versions 5.5 and 6

Be sure to regularly visit VMware's website to download and review the PDF at http://www.vmware.com/pdf/vi_3backupguide.pdf.

Although considered a framework for backup, VCB can actually be used as a backup product. However, it lacks the nice scheduling and graphical interface features of third-party products like Vizioncore vRanger Pro. Two of the more common VCB commands, shown in Figures 10.44 and 10.45, are:

♦ vcbVmName: This command enumerates the various ways a virtual machine can be referenced in the vcbMounter. Here's an example:

vcbVmName -h 172.30.0.120 -u administrator -p Sybex!! -s ipaddr:172.30.0.24

where

 ♦ -h

 ♦ -u

 ♦ -p

 ♦ -s ipaddr:

♦ vcbMounter:

 ♦ -h

 ♦ -u

 ♦ -p

 ♦ -a :

 ♦ -t [fullvm | file]

 ♦ -r

VCB Proxy Backup Directory

When specifying the VCB backup directory using the -r parameter, do not specify an existing folder. For example, if a backup directory E:\VCBBackups already exists and a new backup should be stored in a subdirectory named Server1, then specify the subdirectory without creating it first. In this case, the -r parameter might read as follows:

-r E:\VCBBackups\Server1

The vcbMounter will create the new directory as needed. If the directory is created first, an error will be thrown at the beginning of the backup process. The error will state that the directory already exists.

Figure 10.44 The vcbVmName command enumerates the various virtual machine identifiers that can be used when running the VCB command.

Figure 10.45 The vcbMounter command can be used to perform full virtual machines backups or file-level backups for Windows virtual machines.

When VCB performs a full backup of a virtual machine, it engages the VMware snapshot functionality to quiesce the file system and perform the backup. Remember that snapshots are not complete copies of data. Instead, a snapshot is the creation of a differencing file (or redo log) to which all changes are written. When the vcbMounter command is used, a snapshot is taken of the virtual machine, as shown in Figure 10.46.

Figure 10.46 The snapshots created by VCB can be viewed in the snapshot manager of a virtual machine.

Any writes that occur during the backup are done to the differencing file. Meanwhile, VCB is busy making a copy of the VMDK, which is now read-only for the duration of the backup job. Figure 10.47 details the full virtual machine backup process. Once the backup job is completed, the snapshot is deleted, forcing all writes that occurred to the differencing file to be written to the virtual machine disk file.

Figure 10.47 Performing a full virtual machine backup utilizes the VMware snapshot functionality, which ensures that an online backup is correct as of that point in time.

Snapshots and VMFS Locking

Snapshots grow by default in 16MB increments, and for the duration of time it takes to grow a snapshot, a lock is held on the VMFS volume so the respective metadata can be updated to reflect the change in the snapshot. For this reason, do not instantiate a snapshot for many virtual machines at once. Although the lock is held only for the update to the metadata, the more virtual machines trying at the same time, the greater the chance of contention on the VMFS metadata. From an IT standpoint, this factor should drive your backup strategy to perform backups of many virtual machines only if the virtual machine files have been located on separate VMFS volumes.

Perform the following steps to perform a full virtual machine backup using VCB:

1. Log in to the backup proxy where VCB is installed.

2. Open a command prompt and change directories to the C:\Program Files\VMware\VMware Consolidated Backup Framework directory.

3. Use the vcbVmName tool to enumerate virtual machine identifiers. At the command prompt, type:

vcbVmName  -u  -p  -s ipaddr:

4. From the results of running the vcbVmName tool, select which identifier to use (moref, name, uuid, or ipaddr) in the vcbMounter command.

5. At the command prompt, type:

VcbMounter -h  -u  -p  -a ipaddr: -t fullvm -r

Once the backup is complete, a list of the files can be reviewed in the directory provided in the backup script. Figure 10.48 shows the files created as part of the completed full backup of a virtual machine named Server1.

Figure 10.48 A full virtual machine backup using VCB creates a directory of files that include a configuration file (VMX), log files, and virtual machine hard drives (VMDK), among others.

Redundant Paths

Let's look at an example of a VCB backup proxy with a single QLogic fibre channel HBA that is connected to a single fibre channel switch connected to two storage processors on the storage device. This configuration results in two different paths being available to the VCB server. The image here shows that a VCB server with a single HBA will find LUNs twice because of the redundancy at the storage-processor level:

When Disk Management shows this configuration for the older versions of VCB, it presents a problem that causes every backup attempt to fail. For the pre-VCB 1.0.3 versions, the LUNs identified as Unknown and Unreadable must be disabled in Disk Management. The option to disable is located on the properties of a LUN. The following image displays the General tab of LUN properties from Disk Management where a path to a LUN can be disabled. To remove the redundant unused paths from Computer Management, the Device Usage drop-down list should be set to Do not use this device (disable).

With redundant paths disabled, this will, of course, present a problem when a LUN trespasses to another storage processor. This requires a path to the LUN that is likely disabled.

Using VCB for Single VMDK Backups

Sometimes a full backup is just too much: too much data that hasn't changed, or too much data that is backed up more regularly and isn't needed again. For example, what if just the operating system drive needs to be backed up, not all the user data stored on other virtual machine disk files? A full backup would back up everything. For those situations, VCB provides a means of performing single virtual machine disk backups.

Perform the following steps for a single VMDK backup:

1. Log in to the backup proxy where VCB is installed.

2. Open a command prompt and change directories to the C:\Program Files\VMware\VMware Consolidated Backup Framework directory.

3. Use the vcbVmName tool to enumerate virtual machine identifiers. At the command prompt, type:

vcbVmName  -u  -p  -s ipaddr:

4. From the results of running the vcbVmName command, note the moref value of the virtual machine.

5. Use the following command to create a snapshot of the virtual machine:

vcbSnapshot -h  -u  -p  -c

6. Note the snapshot ID (SsiDd) from the results of step 5.

7. Enumerate the disks within the snapshot using the vcbSnapshot command:

vcbSnapshot -h  -u  -p  -l

8. Change to the backup directory of the virtual machine and export the desired VMDK using the vcbExport command:

vcbExport -d  -s

9. Remove the snapshot by once again using the vcbSnapshot command:

vcbSnapshot -h  -u  -p  -d

Using VCB for File-Level Backups

For Windows virtual machines, and only for Windows virtual machines, VCB offers file-level backups. A file-level backup is an excellent complement to the full virtual machine or the single VMDK backup discussed in the previous sections. For example, suppose you built a virtual machine using two virtual disks: one for the operating system and one for the custom user data. The operating system's virtual disk will not change often with the exception of the second Tuesday of each month when new patches are released. So that virtual disk does not need consistent and regular backups. On the other hand, the virtual disk that stores user data might be updated quite frequently. To get the best of both worlds and implement an efficient backup strategy, you need to do a single VMDK backup (for the OS) and file-level backup (for the data).

Perform the following steps to conduct a file-level backup using VCB:

1. Log in to the backup proxy where VCB is installed.

2. Open a command prompt and change directories to C:\Program Files\VMware\VMware Consolidated Backup Framework.

3. Use the vcbVmName tool to enumerate virtual machine identifiers. At the command prompt, type:

vcbVmName  -u  -p  -s ipaddr:

4. From the results of running the vcbVmName tool, select which identifier to use (moref, name, uuid, or ipaddr) in the vcbMounter command.

5. As shown in Figure 10.49, type the following at the command prompt:

VcbMounter -h  -u  -p  -s ipaddr: -t file -r

6. Browse to the mounted directory to back up the required files and folders.

7. After the file- or folder-level backup is complete, use the following command, shown in Figure 10.50, to remove the mount point:

mountvm -u

8. Exit the command prompt.

Figure 10.49 A file- or folder-level backup begins with mounting the virtual machine drives as directories under a mount point on the VCB server.

Figure 10.50 After performing a file-or folder-level backup using the vcbMounter command, the mount point must be removed using the mountvm command.

Sticky Snapshots

If a snapshot refuses to delete when the mountvm -u command is issued, it can always be deleted from the snapshot manager user interface, which is accessible through the VI Client.

Real World Scenario

VCB with Third-Party Products

Once you have mastered the VCB framework by understanding the VCB mounter commands and the way that VCB works, then working with VCB and third-party products is an easy transition. The third-party products simply call upon the VCB framework to perform the vcbmounter command. All the while the process is wrapped up nicely inside the GUI of the third-party product. This allows for scheduling the backups through backup jobs.

Let's look at an example with Symantec Backup Exec 11d. Once the 11d product is installed, followed by the installation of VCB, a set of integration scripts can be extracted from VCB to support the Backup Exec installation. When a backup job is created in Backup Exec 11d, a pre-backup script runs (which calls vcbmounter to create the snapshot and mount the VMDKs), and once the backup job completes, a post-backup script runs to unmount the VMDKs. During the period of time that VMDKs are mounted into the file system, the Backup Exec product has access to the mounted VMDKs in order to back them up to disk or tape, as specified in the backup job. See the following sample scripts, which perform a full virtual machine backup of a virtual machine with the IP address 192.168.4.1.

First, the pre-backup script example:

"C:\Program Files\VMware\VMware Consolidated Backup Framework\backupexec\pre-backup.bat" Server1_FullVM 192.168.4.1-fullvm

Now, the post-backup script example:

"C:\Program Files\VMware\VMware Consolidated Backup Framework\backupexec\post-backup.bat" Server1_FullVM 192.168.1.10-fullVM

Notice that there is no reference to vcbmounter or the parameters required to run the command. Behind the scenes the pre-backup. bat and post-backup.bat files are reading a configuration file named config.js to pull defaults for some of the parameters for vcbmounter and then using the information given in the lines shown earlier. When vcbmounter extracts the virtual machine files to the file system of the VCB proxy, the files will be found in a folder named 192.168.4.1-fullvm in a directory specified in the configuration file. A portion of the configuration file is shown here. Note that the file identifies the directory to mount the backups to (F:\\mnt) as well as the VirtualCenter server to connect to (vc01.vlearn.vmw) and the credentials to be used (administrator /Password1):

/*

* Generic configuration file for VMware Consolidated Backup (VCB).

*/

/*

* Directory where all the VM backup jobs are supposed to reside in.

* For each backup job, a directory with a unique name derived from the

* backup type and the VM name will be created here.

* If omitted, BACKUPROOT defaults to c:\\mnt.

* Make sure this directory exists before attempting any VM backups.

*/

BACKUPROOT="F:\\mnt";/*

* URL that is used by "mountvm" to obtain the block list for a

* disk image that is to be mounted on the backup proxy.

* Specifying this option is mandatory. There is no default

* value.

*/

HOST="vc01.learn.vmw";

/*

* Port for communicating with all the VC SDK services.

* Defaults to 902

*/

// PORT="902";

/*

* Username/password used for authentication against the mountvm server.

* Specifying these options is mandatory.

*/

USERNAME="administrator"; PASSWORD="Password1";

The combination of the configuration file with the parameters passed at the time of execution results in a successful mount and copy of the virtual machine disk files followed by an unmount.

Depending on the size of the virtual machines to be backed up, it might be more feasible to back up to disk and then create a second backup job to take the virtual machine backups to a tape device.

Restoring with VMware Consolidated Backup (VCB)

Restoring data in a virtual environment can take many forms. If using VCB in combination with an approved third-party backup application, there are three specific types of restores that can be defined. These restore types include:

Centralized Restore One backup agent on the VCB proxy.

Decentralized Restore Several backup agents installed around the network, but not every system has one.

Self-Service Restore Each virtual machine contains a backup agent.

Why are we discussing backup agents in the restore section? Remember, the number of backup agents purchased directly influences the virtual machines that can also be restored directly.

No matter how you implant the whole backup/restore process, you must understand that it's either ‘‘pay me now or pay me later.’’ Something that is easier to back up is often more difficult to restore. On the flip side, something that is the most difficult to back up is often easier to restore. Figure 10.51 shows the difference between the centralized restore and the self-service restore.

Figure 10.51 Backup agents are not just for backup. They also allow restore capability. The number of backup agents purchased and installed directly affects the recovery plan.

Self-Service Restore Is Always Quicker

If you are looking for a restore solution focused solely on speed of restore and administrative effort, then the self-service restore method is ideal. Of course, the price is a bit heftier than its counterparts because an agent is required in the virtual machine. A centralized restore methodology would require two touches on the data to be restored. The first touch gets the data from backup media to the VCB proxy server, and the second touch gets the data from the VCB proxy server back into the virtual machine. The latter happens via standard Server Message Block (SMB) or Common Internet File System (CIFS) traffic in a Windows environment. This is a literal \\servername\sharename copy of the data back to the virtual machine where the data exists.

Perhaps the best solution is to find a happy, solid relationship between the self-service restore and the centralized restore methods. This way you can reduce (not necessarily minimize) the number of backup agents, while still allowing critical virtual machines to have data restored immediately.

To demonstrate a restore of a full virtual machine backup, let's continue with the earlier examples. Server1 at this point has a full backup created. Figure 10.52 shows that Server1 has now been deleted and is gone.

Restoring a Full Virtual Machine Backup

When bad things happen, such as the deletion or corruption of a virtual machine, a restore from a full virtual machine backup will return the environment to the point in time when the backup was taken.

Figure 10.52 A server from the inventory is missing and a search through the data-stores does not locate the virtual machine disk files.

Perform the following steps on a virtual machine from a full virtual machine backup:

1. Connect to the VCB proxy and use FastSCP, shown in Figure 10.53, or WinSCP to establish a secure copy protocol session with the remote host. Shown in Figure 10.53, the data from the E:\VCBBackups\Server1 folder can be copied into a temporary directory in the service console. The temporary directory houses all of the virtual machine files from the backup of the original virtual machine.

Figure 10.53 The FastSCP utility, as the name proclaims, offers a fast, secure copy protocol application to move files back and forth between Windows and ESX.

2. Upon completion of the restore to a temporary location process, verify the existence of the files by navigating to the shared site, as shown in Figure 10.54. Use Putty.exe to connect to the Service Console and navigate to the temporary directory where the backup files are stored. Then use the ls command to list all the files in the temporary directory.

Figure 10.54 Virtual machine files needed for the restore are located in the temporary directory specified in the command.

3. From a command prompt, type the following:

vcbRestore -h 172.30.0.120 -u administrator -p Sybex123 -s

Figure 10.55 shows the virtual machine restore process.

4. Upon completion of the restore from the temporary location process, verify the existence of the files by navigating to datastore or by quickly glancing at the tree view of VirtualCenter.

Figure 10.55 Virtual machine restore times are highly dependent on the size of the virtual machine and the number of writes to it.

Restoring a Single File from a Full Virtual Machine Backup

Problems in the datacenter are not always as catastrophic as losing an entire virtual machine because of corruption or deletion. In fact, it is probably more common to experience minor issues like corrupted or deleted files. A full virtual machine backup does not have to be restored as a full virtual machine. Using the mountvm tool, it is possible to mount a virtual machine hard drive into the file system of the backup proxy (VCB) server. Once the hard drive is mounted, it can be browsed the same as any other directory on the server.

Putting Files into a VMDK

Files cannot be put directly into a VMDK. Restoring files directly to a virtual machine requires a backup agent installed on the virtual machine.

Let's say that a virtual machine named Server1 has a full backup that has been completed. An administrator deletes a file named FILE TO RECOVER.txt that was on the desktop of his/her profile on Server1 and now needs to recover the file. (No, it's not in the Recycle Bin anymore.) Using the mountvm command, the VMDK backup of Server1 can be mounted into the file system of the VCB proxy server and the file can be recovered. Figure 10.56 shows the mountvm command used to mount a backup VMDK named scsi0-0-0-server1.vmdk into a mount point named server1_restore_dir off the root of the E drive. Figure 10.56 also shows the Windows Explorer application drilled into the mounted VMDK.

Figure 10.56 The mountvm command allows a VMDK backup file to be mounted into the VCB proxy server file system, where it can be browsed in search of files or folders to be recovered.

Perform the following steps to conduct a single file restore from a full virtual machine backup:

1. Log in to the VCB proxy and navigate to the directory holding the backup files for the virtual machine that includes the missing file.

2. Browse the backup directory and note the name of the VMDK to mount to the VCB file system.

3. Open a command prompt and change to the C:\Program Files\VMware\VMware Consolidated Backup Framework directory.

4. Type the following command:

mountvm -d  -cycleId

5. Browse the file system of the VCB proxy server to find the new mount point. The new mount point will contain a subdirectory named Letters followed by a directory for the drive letter of the VMDK that has been mounted. These directories can now be browsed and manipulated as needed to recover the missing file.

6. Once the file or folder recovery is complete, type the following command:

mountvm -u

7. Close the command prompt window.

Real World Scenario

User Data Backups in Windows

Although VCB offers the functionality to mount the virtual machine hard drives for file- or folder-level recovery, I recommend that you back up your custom user data drives and directories on a more regular basis than a full virtual machine backup. In addition to the methods discussed here like the file-and folder-level backups with VCB or third-party backup software like Vizioncore vRanger Pro, there are also tools like Shadow Copies of Shared Folders that are native to the Windows operating system.

Shadow Copies of Shared Folders builds off the Volume Shadow Services available in Windows Server 2003 and later. It offers scheduled online backups to changes in files that reside in shared folders. The frequency of the schedule determines the number of previous versions that will exist up to the maximum of 63. The value in complementing a VCB and Vizioncore backup strategy with shadow copies is in the restore ease. Ideally, with shadow copies enabled, users can be trained on how to recover deletions and corruptions without involving the IT staff. Only when the previous version is no longer in the list of available restores will the IT staff need to get involved with a single file restore. And for the enterprise-level shadow copy deployment, Microsoft has recently released the System Center Data Protection Manager (SCDPM). SCDPM is a shadow copy on steroids, which is used to provide online frequent backups of files and folders across the entire network.

For more information on Shadow Copies of Shared Folders, visit Microsoft's website at http://www .microsoft.com/windowsserver2003/techinfo/overview/scr.mspx.

For more information on System Center Data Protection Manager, visit Microsoft's website at http://www.microsoft.com/systemcenter/dpm/default.mspx.

Restoring VCB Backups with VMware Converter Enterprise

Perhaps one of the best new features of VirtualCenter 2.5 is the integration of the VMware Converter Enterprise. But to add to its benefit, VMware extended the functionality of the VMware Converter to allow it to perform restores of backups that were made using VMware Consolidated Backup. Figure 10.57 shows the VMware Converter Import Wizard option.

Figure 10.58 shows the files that are part of the VCB backup, including the required VMX file that must be referenced. During the import process, shown in Figure 10.59, you will need to provide the UNC path to the VMX file for the virtual machine to be restored.

Figure 10.57 The VMware Converter Import Wizard greatly simplifies the procedure for restoring VCB backups.

Figure 10.58 The VCB backup files include a VMX file with all the data about the virtual machine.

Figure 10.59 The VMware Converter Import Wizard requires a UNC path that references the VMX file of the virtual machine to be restored.

The examples shown in the previous two figures show the configuration for a backup server named DR1 with a folder that has been shared as MNT. Therefore the appropriate path for the VMX file of the virtual machine to be restored would be \\DR1\MNT\192.168.168.8-fullVM\VAC-DC3.vmx. The remaining steps of the Import Wizard are identical to those outlined in Chapter 7.

This particular feature alone makes VirtualCenter 2.5 an invaluable tool for building a responsive disaster recovery and business continuity plan.

The Bottom Line

Cluster virtual machines with Microsoft Clustering Services (MSCS) Clustering virtual machines provides a means of creating an infrastructure that supports high availability for individual virtual machines.

Master It A critical network service requires minimal downtime. You need to design a failover solution for the virtual machine that hosts the network service. Your solution should provide the least amount of service outage while utilizing existing hardware and software platforms.

Implement and manage VMware High Availability (HA). VMware HA enabled on clusters of ESX Servers allows virtual machines from a failed ESX Server host to be restarted on another host. This feature offers reduced downtime and eliminates administrative effort as a response to a failed server situation.

Master It Domain controller, mail servers, and database servers must be the first virtual machines to restart in the event of server failure.

Master It In the event of server failure, you do not want virtual machines to be prevented from being powered on because of excessive resource contention.

Master It Virtual machines used for testing purposes should not be powered on by cluster nodes if they were running on the ESX Sever host that failed.

Master It Your virtual infrastructure includes redundancy at each level, including switches and NICs. Service Console ports, VMkernel ports, and virtual machine port groups exist on separate virtual switches. You need to ensure that virtual machines continue to run even if the Service Console loses network connectivity.

Back up virtual machines with VMware Consolidated Backup (VCB). VCB is a framework upon which third-party backup solutions can be constructed to perform full virtual machine and file-level backups. While the framework can be used on its own, it lacks any type of automation feature or the ability to write directly to tape.

Master It You need to design a data-recovery plan. The company purchased licenses for VMware Consolidated Backup. You must determine how VCB can accomplish your backup goals. What types of backups does VCB support?

Master It You need to implement VCB as part of a regularly scheduled backup job.

Restore virtual machines with VMware Consolidated Backup (VCB). The VCB framework encompasses not just the backup processes but also restore capabilities. Tools included with VCB allow backups of full virtual machines, individual VMDK files, or specific files from within the virtual machine operating system. In addition, VMware Enterprise Converter offers the simplest restore procedure with its support for restoring VCB backups.

Master It You need to minimize the financial impact of implementing a backup strategy for your virtual infrastructure.

Master It You need to minimize the amount of time required to restore data to any of the virtual machines in your environment.

Master It You have a full virtual machine backup of a system named Server1. A user deletes a file that is included on the last backup of Server1. You need to recover the file.

Master It You need to quickly restore a VCB backup of a virtual machine. The backup is stored in a shared folder named VMBackups on a server named Backup1. The name of the virtual machine is Server17.