Nutanix is paranoid about data loss and enforces multiple architectural considerations and checks to ensure data is always protected and available.
Some of Nutanix architectural considerations include zero single points of failure or bottleneck for management services, creating system tolerance to failures. Tolerance of failures is key to a stable, scalable distributed system, and ability to function in the presence of failures is crucial for availability.
Techniques like vector clocks, two-phase commit, consensus algorithms, leader elections, eventual and strict consistency, multiple replicas, dynamic flow control, rate limiting, exponential back-offs, optimistic replication, automatic failover, hinted-handoffs, data scrubbing, checksumming among others all go towards the ability of Nutanix to handle failures, but also provide the backbone to easily recover from failures.
NDFS uses replication factor (RF) and checksum to ensure data redundancy and availability in the case of a node or disk failure or corruption. In the case of a node or disk failure the data is then automatically re-replicated among all nodes in the cluster to maintain the defined replication factor and data SLA; this is called re-protection. Re-protection might happen after a Controller VM went down.
Node and Block awareness is a feature that enable the NDFS metadata layer to choose the best placement for data and metada in the cluster, always ensuring the cluster tolerates single or multiple node failures, or an entire block failure. This is a critical piece to maintain data availability across large clusters, always ensuring data is not just randomly placed in different hosts in the cluster.
Because NDFS is always writing data to multiple nodes it is extremely important that the consistent model is strict, ensuring that writes are only acknowledged once two or more copies have been successfully committed to disk in different nodes or blocks. This requires a clear understanding of the CAP theorem (Consistency, Availability and Partition Tolerance) (http://en.m.wikipedia.org/wiki/CAP_theorem).
Medusa, the metadata layer, stores and manages all of the cluster metadata in a distributed ring based upon a heavily modified Apache Cassandra, and Paxos algorithm is utilized to enforce strict consistency.
“Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures. Paxos is usually used where durability is required (for example, to replicate a file or a database), in which the amount of durable state could be large. The protocol attempts to make progress even during periods when some bounded numbers of replicas are unresponsive. There is also a mechanism to drop a permanently failed replica or to add a new replica.” – This service runs on every node in the cluster
The larger the cluster the higher the chances of a double failures (ex. drives), which may lead to data loss. Today, with NOS 4.1, NDFS is able to implement RF 2 and 3, meaning it tolerates up to two simultaneous component failures without data loss. However, it is important to understand that the larger the cluster is, the lower the chance of a tripe disk failure causing data loss due to lower risk of the same data being stored on all three failed drives. Nutanix distribute data across all drives available on the cluster in 1Mb extents, but also always keep a copy of the data local to where the VM is running for performance reasons; this is called data locality.
The larger the cluster the faster the cluster can recover from failures too, because all nodes in the cluster effectively contribute to the rebuild of the data lost and this process also lower the chances of data loss as a result of a double drive failure as NDFS does not trash a small number of disks to recover from a drive loss ie: Repairing to a hot spare or replacement drive like in RAID groups or other hyper-converged solutions. The impact to performance during recovery from a drive failure is also much lower on Nutanix than traditional RAID systems.
Nutanix also provide native inter-cluster data replication (synchronous and asynchronous) and backup-to-cloud (AWS now; Azure soon).
For more information on Nutanix data protection check out the The Nutanix Bible.
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net