Configure HA for a cluster

Cluster HA depends on shared storage and dynamic migration technologies to provide simple and efficient HA services for applications running on all VMs in the cluster. It reduces service interruption caused by host hardware failure. Cluster HA is applicable to scenarios that require service continuity.

HA mechanism

CVM virtualizes a group of hosts into a cluster that uses a shared resource pool. After you enable HA for the cluster, CVM monitors running state of all hosts and VMs in the cluster.

When a host fails, CVM migrates the VMs on the host to available hosts in the cluster.

When a VM fails, CVM restarts the VM. If the VM is restarted, CVM does not migrate the VM. If the restart fails, CVM migrates the VM to another host and restarts it.

When the network between a host and the shared storage fails, CVM migrates the VMs on the host to available hosts in the cluster.

HA benefits

Automatically monitors running state of hosts and VMs and migrates a failed VM or VMs on a failed host to other hosts in the cluster.

Reserves enough resources for VMs to restart if hosts fail.

Automatically migrates VMs between hosts to ensure service continuity in case of hardware failure.

Automatically selects suitable hosts for VMs on a failed host based on the resource usage if you enable both HA and DRS for the cluster.

Restrictions and guidelines

All hosts in an HA-enabled cluster must have the same virtual switch configuration, including virtual switch quantity, name, and forwarding mode.

To ensure that VMs in an HA-enabled cluster can migrate between hosts in the cluster, make sure the image files of all VMs in the cluster are saved in the shared storage. As a best practice, do not enable HA or DRS if the VMs use the local storage.

In an HA-enabled cluster, all hosts must use CPUs from the same manufacturer. Clusters containing hosts that use CPUs of the same model from the same manufacturer can provide better migration compatibility.

To prevent VM name conflict, make sure no hosts in abnormal state exist in a cluster before you disable HA for the cluster. If VM name conflict occurs, enable HA for the cluster again.

During the process of enabling or disabling HA for a cluster, do not start, deploy, or migrate VMs or restart or shut down hosts in the cluster.

To reinstall the CVK component for a host in an HA-enabled cluster, first delete the host from the cluster, reinstall the CVK component, and then add the host to the cluster again.

Before you enable HA for a cluster, make sure all hosts in the cluster have reserved sufficient system resources so that the VMs can migrate between the hosts.

If a VM enabled with the Intel RDT noisy neighbor quieting feature is migrated to a host that does not support the feature, the feature will be disabled for the VM.

Procedure

On the top navigation bar, click Resources.

From the left navigation pane, select Compute > Host Pool Name > Cluster Name.

Click HA.

Enable HA as needed.

If you enable HA for the cluster, you must select a default startup priority for the VMs in the cluster.

Enable service network HA, HA access control, and local disk HA as needed.

If you enable HA access control, you must specify the minimum number of nodes, select a failover host, or set the reserved CPU and memory percentages.

If you enable local disk HA, you must set the check interval.

Click OK.

Parameters

Startup Priority: Select a default startup priority for the VMs in the cluster. Options include Low, Medium, and High. You can set the startup priority for a VM when you add or edit the VM. After a host fails, the system migrates the VMs on the host based on their startup priorities until all the VMs are migrated or the cluster does not have any available resources.

Service Network HA: Configure whether to enable service network HA. When the service network of a VM fails, the VM can be migrated to another host in the cluster. Virtual switches that are not bound to physical NICs and those that use the management network or VXLAN forwarding mode do not support HA failure detection.

HA Access Control: Select whether to enable HA access control. If you enable HA access control, configure one of the following parameters:

Min Nodes: Specify the minimum number of hosts for HA to take effect on the cluster. If the number of hosts that are operating correctly in the cluster is smaller than the specified minimum node number, HA cannot take effect on the cluster. To avoid migration failure caused by inaccurate resource calculation, make sure all hosts in the cluster have the same CPU quantity and memory size.

Failover Host: Select hosts used for migration of failed VMs. These hosts cannot be used for common VM migration or VM adding operation. The failover hosts must use the same shared storage as the service hosts. A host that has running VMs cannot be used as failover hosts.

Reserved Resource: Set the reserved CPU and memory percentages. When the remaining resources in the cluster are less than the specified percentage of resources, you cannot start new VMs, set the VMs to running or suspending state, or migrate running VMs to the cluster.

Local Disk HA: Select whether to enable HA for a local disk. If you enable this feature, the system checks the disks in the root directory of the CVK host for disk failures. If a disk failure is detected, the system migrates VMs that use the shared storage on the host to another host in the same cluster. If you enable this feature, you must also set the detection cycle.

Host Storage Failure Response: Select the action to take on VMs when a shared storage failure occurs. This parameter is editable only when the value for the Shared Storage Fault Action parameter in system settings is set to Do Not Restart Host.

Migrate—Migrates VMs that have all data stored on the shared storage to other hosts in the cluster when a shared storage failure occurs.

No Action—Freezes VMs that have all or some data stored on the shared storage. A VM cannot be frozen if one of the following conditions exists on it:

The disk bus type is USB.

The disk bus type is high-speed SCSI (FC&ISCSI) for block devices.

Disks are encrypted.

The disk cache mode is writeback or writethrough.

LVM raw blocks exist.

NFS storage is attached.

After the shared storage recovers, the VM will automatically enter running state.

Timeout: The period of time between the executions of the Host Storage Failure Response-No Action and Host Storage Failure Response-Migrate policies. Within this period of time, I/O commands are re-issued to the storage device. The default is 12000 minutes. The maximum value is 2147483647 minutes.