In the first part of this Nutanix Metro Availability series I discussed datacenter failure recovery operation (failover) to a secondary site. In the second part I talk about the operational procedure to resume normal operations after a successful failover. In this third and last part I discuss the operational procedure to recover an entire datacenter to a new Nutanix cluster in a new site.
If you missed the announcement of NOS 4.1, please refer to All Nutanix 4.1 Features Overview in One Page (Beyond Marketing).
Datacenter Failure Recovery to a new Cluster – Operation
The example below follows a datacenter failure recovery (failover) as described in my first two articles.
I had two sites that were replicating distinctive containers to each other (bi-directional). After a network or datacenter outage the Metro Availability peering was automatically broken to allow each surviving cluster to operate independently. However in this case, let’s assume that Site 1 was completely lost due to flooding or another natural disaster.
In this case site 2 had all the data for site 1, and site 1 is completely down. The administrator decides to move the entire workload belonging to site 1 to Site 3. (Please note that the administrator may choose to temporally run the workload from site 1 in site 2 until it’s time to move to site 3).
Just like the other metro cluster operations, re-establishing operations in a brand new site is just couple clicks away. After racking, stacking and configuring the new cluster in site 3 the administrator need to establish connection between sites 2 and 3. This can easily be done via the PRISM UI.
The first picture demonstrates the scenario described, where site 2 is lodging the workload from site 1 (blue) and it’s own workload (green). Now data must be migrated to a complete new cluster in a new site (1).
The next step is to enable container replication (blue) for between sites (2).
The replication, if bidirectional, will start to synchronize the data in the container (3) between both sites. Please note that metro availability replication works in conjunction with all NOS data management features such as compression, de-duplication, shadow clones, automated tiering and others. Additionally, Metro availability also offers compression over the wire to reduce the amount of bandwidth required.
Once the replication is complete the next step is to promote the blue container (4) in site 3. The container promotion tells the Nutanix cluster in site 2 to now run the virtual machines on site 1. When that is done the virtual machines will automatically restart on the new cluster (5) and operations will be resumed. The promotion step is a one-click manual procedure, but it can also fully automated with some basic scripting or run-book automation tools. Please note that this scenario assumes stretch clusters and stretch VLANS are in use.
I have been in the technology and infrastructure space for a long time and have managed very large datacenters. I have never seen a solution that allows efficient data migration with failover and failback operations in such simple and elegant manner.
If you are interested in reading the first two parts of this series: Nutanix Metro Availability Operations – Part 1 and Nutanix Metro Availability Operations – Part 2.
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net.