Nutanix Metro Availability Operations – Part 2

In the first part of this Nutanix Metro Availability series I discussed the datacenter failure recovery operation (failover) to a secondary site. In this second part I talk about the operational procedure to resume normal operations back to the primary site (failback) after a successful failover operation.

If you missed the announcement of NOS 4.1, please refer to All Nutanix 4.1 Features Overview in One Page (Beyond Marketing).

 

Datacenter Failure Failback – Operation

The example below follows a datacenter failure recovery (failover) as described in my first article.

I had two sites that were replicating distinctive containers to each other. After a network or datacenter outage the Metro Availability peering was automatically broken to allow each surviving cluster to operate independently.

After the container promotion Nutanix started executing virtual machines from Site 1 in the cluster on site 2. When that is done virtual machines were automatically restarted on the surviving cluster and operations resumed. At this point in time, with virtual machines running in site 2, the container in site 1 has become out-of-sync (1), and vice-versa.

When the datacenter or the network is back online it is necessary to re-enable synchronicity between sites, clusters and containers. This is a one-click manual step executed by the administrator when it’s safe to resume normal operations. Once the re-enable operation is executed containers are re-activated, replication is re-started and Nutanix clusters start to figure out the data that have been created during the outage. This data is then shipped to Site 1 and clusters get in synch once again.

(1)

Please note that at this point in time virtual machines are running in Site 2 site with the primary container (blue) being located in site 2.

Since replication is re-established it is possible to move virtual machines back to site 1, however first it is necessary to promote the container (blue) in site 1 to back to primary state (2). This is an important step, because if virtual machines are migrated back to Site 1 without the container promotion the IOs would still transverse network to site 2.

(2)

 

After this step operations are back to normal and virtual machines are once again in synch across datacenters and cluster, and are running in their correct site. Despite the manual intervention required by the administrator to resume normal operations this is in reality a simple 2-click procedure. Please note that those two manual steps can also be fully automated through the use of run-book automation tools.

 

In Part 3 I will explain the operational procedure to add a new node to an existing replication container and also provide some additional technical details on Nutanix Metro Availability architecture.

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net.

4 comments

Skip to comment form

    • michael on 12/01/2014 at 12:55 am

    hi andre

    thx for sharing! what about stretched clusters, or even stretched guest clusters? are they supported to minimize app downtime?

    thx

    • Michael on 12/06/2014 at 1:39 am

    Thank you Andre, thats great news!

    • Thomas on 09/10/2015 at 6:24 am

    If this is a true stretched cluster why isn´t it mentioned as VMware Metro Cluster certified? Best Regards Thomas

  1. Metro Availability is not stretched cluster from a Nutanix perspective; where actually it is a datastore synchronously replicated, not stretched. From a VMware admin perspective feels like a stretched cluster but it is not. Metro Availability is available to all supported hypervisors and that’s one of the nice things about being hypervisor independent.

Leave a Reply