In the first part of this Nutanix Metro Availability series I discussed the datacenter failure recovery operation (failover) to a secondary site. In this second part I talk about the operational procedure to resume normal operations back to the primary site (failback) after a successful failover operation.
If you missed the announcement of NOS 4.1, please refer to All Nutanix 4.1 Features Overview in One Page (Beyond Marketing).
Datacenter Failure Failback – Operation
The example below follows a datacenter failure recovery (failover) as described in my first article.
I had two sites that were replicating distinctive containers to each other. After a network or datacenter outage the Metro Availability peering was automatically broken to allow each surviving cluster to operate independently.
After the container promotion Nutanix started executing virtual machines from Site 1 in the cluster on site 2. When that is done virtual machines were automatically restarted on the surviving cluster and operations resumed. At this point in time, with virtual machines running in site 2, the container in site 1 has become out-of-sync (1), and vice-versa.
When the datacenter or the network is back online it is necessary to re-enable synchronicity between sites, clusters and containers. This is a one-click manual step executed by the administrator when it’s safe to resume normal operations. Once the re-enable operation is executed containers are re-activated, replication is re-started and Nutanix clusters start to figure out the data that have been created during the outage. This data is then shipped to Site 1 and clusters get in synch once again.
Please note that at this point in time virtual machines are running in Site 2 site with the primary container (blue) being located in site 2.
Since replication is re-established it is possible to move virtual machines back to site 1, however first it is necessary to promote the container (blue) in site 1 to back to primary state (2). This is an important step, because if virtual machines are migrated back to Site 1 without the container promotion the IOs would still transverse network to site 2.
After this step operations are back to normal and virtual machines are once again in synch across datacenters and cluster, and are running in their correct site. Despite the manual intervention required by the administrator to resume normal operations this is in reality a simple 2-click procedure. Please note that those two manual steps can also be fully automated through the use of run-book automation tools.
In Part 3 I will explain the operational procedure to add a new node to an existing replication container and also provide some additional technical details on Nutanix Metro Availability architecture.
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net.