Apr 16 2014

Nutanix 4.0 Hybrid On-Disk De-Duplication Explained

Nutanix is a distributed scale-out 3-tier platform, utilizing RAM, Flash and HDD. This combination provides access to constantly accessed data in terms of microseconds; instead of milliseconds when exclusively flash devices are used. This is an awesome feature that influence and enhance the end-user experience for any type of workload.

De-duplication allows the sharing of guest VM data on premium storage tiers (RAM and Flash). Performance of guest VMs suffers when active data can no longer fit in the premium tiers. If guest VMs are substantially similar, for example if the Nutanix cluster is used to host numerous Windows desktops, enabling de-duplication substantially improves performance. When used in the appropriate situation, de-duplication makes the effective size of the premium tiers larger so that the active data can fit.

The Nutanix de-duplication engine was designed for scale-out, providing near instantaneous application response times. Nutanix de-duplication is 100% software defined, with no controllers or hardware crutch; and because Nutanix is platform agnostic, this feature is available in whatever hypervizor or VDI solution you chose to work with (today vSphere, Hyper-V and KVM are supported).

Nutanix added inline de-duplication for the performance tier (RAM and Flash) in NOS 3.5. The new NOS 4.0 release is introducing de-duplication for the capacity tier, the extent store, allowing organizations to have greater VM density, more virtual machines per node.

The capacity tier de-duplication is a post-process de-duplication, meaning the common blocks are aggregated according to a curated background process; by default every 6 hours, while de-duplication in the performance tier is inline, meaning it happens as data blocks transverse RAM or Flash. This hybrid de-duplication approach allows Nutanix CVM to be less intrusive and utilize less CPU cycles to detect common data blocks.

The ON-DISK capacity de-duplication must be enabled per Container in NOS 4.0 GUI (picture below), and is mostly recommended for VDI persistent desktops and server workloads. However, it is possible to enable and disable de-duplication per VMDK (vDisk) using NCLI. A next version on PRISM will provide the ability to manage de-duplication per VM or VMDK.

Screen Shot 2014-04-07 at 9.04.24 PM

Every write I/O larger than 64Kb is fingerprinted with US Secure Hash Algorithm 1 (SHA1) using native SHA1 optimizations available on Intel processors, and only a single analogous data block is stored in the Nutanix cluster. In the VDI context this means that persistent desktops can be deployed without the capacity or performance penalties existent with most storage solutions.

The picture below demonstrate the new NOS 4.0 Storage View with On-Disk De-duplication Saving. (Click on the picture to see full size)

Screen Shot 2014-04-13 at 9.33.42 PM

If the VMs or desktops are using Linked-Clones Nutanix VAAI snaps, the de-duplication happens at a different level, where the linked clone track the parent/child hierarchy, and this case no fine grain de-duplication is required.

How it works

The de-duplication happens via curator full-scans. As I mentioned, full scans happen every 6 hours, and the process look for common fingerprinted data to be de-duplicated. The de-duplication process happens at the Stargate component, and during the post-process de-duplication process CVM may present CPU overhead of 10 to 15% range.


During the curated background process Curator scans the metadata as part of its scans for duplicate fingerprints. If a new duplicated SHA1 is found, Stargate will re-write the data in a new location, removing duplicated data. Behind the scenes NDFS will increment refcounts against what has already been de-duplicated, against shared extents.

In NOS 4.0 the default de-duplication uses 16Kb granularity to process SHA1. For this reason NOS 4.0 introduces the concept of two extent group types, 16Kb and 4Kb. De-duplication is done at 16Kb block size, while caching in the performance tier is done at 4Kb extent granularity for better utilization of the caching resources.

In the second part of this de-duplication article I will demonstrate how to track and optimize de-duplication for different types of workloads.

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Permanent link to this article: http://myvirtualcloud.net/?p=6208

Apr 15 2014

Nutanix 4.0 Features Overview (Beyond Marketing)

Today Nutanix announced the release NOS 4.0. This is a major release with introduction of features in the areas of data services, performance, resiliency, data protection, and management and analytics. It’s been eight months since NOS 3.5 was announced with the Elastic De-Duplication Engine, PRISM UI, RESTful API, SRM Support, and Nutanix is now delivering another major release. Just like I did for the VMware Horizon View releases when I was at VMware I am going to start doing the Beyond Marketing series for Nutanix.

Please refer to the product Release Notes for official information from Nutanix.

Let’s start with the New Features…

Core Data Services


  • Hybrid On-Disk De-Duplication

De-duplication allows the sharing of guest VM data on premium storage tiers (RAM and Flash). Performance of guest VMs suffers when active data can no longer fit in the premium tiers. If guest VMs are substantially similar, for example if the Nutanix cluster is used to host numerous Windows desktops, enabling de-duplication substantially improves performance. When used in the appropriate situation, de-duplication makes the effective size of the premium tiers larger so that the active data can fit.

  • Shadow Clones (Official Support)

Shadow Clones is finally out of tech-preview. Shadow Clones intelligently analyze the I/O access pattern at the storage layer to identify files shared in read only mode (ie: Linked Clone Replica). When a 100% read only disk is discovered, Nutanix will automatically create a snapshot at the storage layer on each Controller VM (CVM) and redirect all read I/O to the local copy, drastically improving end-user experience. Read more at Nutanix Shadow Clones Explained and Benchmarked.




Multiple different performance improvements have been added to NOS 4.0, increasing overall system performance in 20% compared to NOS 3.5. There will be more detailed information in a different article to discuss  performance improvements in NOS 4.0.

  • Multi-disk OpLog Store

Nutanix now utilize all SSD disks to store the oplog store, increasing on-disk oplog capacity and increasing performance as different vDisks can utilize different SSDs at the same time for writing oplog data. There will be more detailed information in a different article to discuss the performance improvements in NOS 4.0.

  • Other performance improvements include: Fault isolation at a vDisk level, ODirect IO in Extent Store an ODirect IO in oplog store. 




  • Tunable Fault Tolerance (RF-3)

Replication Factor 3 (also known as FT2) protect data against two simultaneous node, disk or NIC failures.

  • Smart Pathing (CVM/AutoPathing 2.0)

The new and improved CVM AutoPathing 2.0 prevents performance loss during rolling upgrades minimizing I/O timeout by pre-emptively redirecting NFS traffic to other CVMs. Failover traffic is automatically load-balanced with the rest of the cluster based on node load.

  • Availability Domains (Failure Domain Awareness)

Also known as ‘Block Fault Tolerance’ or ‘Rack-able Unit Fault Tolerance’ the availability domain feature adds the concept of block awareness to Nutanix cluster deployments. It works managing the placement of data and metadata in the cluster, ensuring that no singular replicated data is stored in the same Nutanix block for high availability purposes.



Data Protection


  • Snapshot Browser

The new snapshot browser functionality allow administrator to see and restore point-in-time array-based snapshots from a VM or a group of VMs in a local or remote protection domain. This functionality is powered by an extremely detailed scheduling that allow for very granular and application consistent snaps.

(click on the image to enlarge)

  • Snapshot Scheduling via PRISM

Nutanix UI now provides the ability to use calendar based scheduling for backups and replication with ability to specify data retention policies per remote site. Nutanix effectively delivers a unified pane of glass that allows administrators to configure and manage local and remote VM/File backups.

  • Improved Nutanix Storage Replication Adapter (SRA)

Nutanix SRA now has fast detection of files corresponding to the VMs protects in SRM with support to up to 50 VMs on a vStore protected group in SRM. Support for multiple SRM devices in a SRM protection group has been added, and support for execution of multiple SRM recovery plans in parallel.

  • Disaster Recovery Support for Hyper-V

Nutanix 4.0 extends it’s DR capabilities to Hyper-V, providing a VM-Centric native disaster recovery solution. The Hyper-V support has now feature parity with Nutanix DR for ESX.  Being VM-centric implies that in addition to protecting files associated with the VM, Nutanix also orchestrate powering down, un-registering, registering/cloning, powering on of the VM in the destination cluster/site.


Management and Analytics


  • One-Click NOS Upgrade

As the name says, it’s a one-click NOS upgrade for the entire Nutanix cluster. Nutanix one-click upgrade automatically indicates when a new NOS version is available and it will auto-download the binaries if the auto-download option is enabled. With a single-click to upgrade all nodes in a cluster Nutanix will use a highly parallel process and reboot one CVM at a time using a rolling upgrade mechanism. The entire cluster upgrade can be fully monitored by the administrator.

  • Cluster Health

Nutanix Cluster Health is a great asset in maintaining availability for Tier 1 workloads. Cluster Health gives the ability to monitor and visually see the overall health of cluster nodes, VMs and disks from a variety of different views. With the ability to set availability requirements at the workload level, Cluster Health will visually dissect what’s important and give you guidance on how to take corrective action.

(click on the image to enlarge)

  • Prism Central (Multi-Cluster UI)

Nutanix now provides a single UI to monitor multiple clusters in the same or different datacenters. Prism Central avoid administrations from having to sign individually to every cluster and provide aggregated cluster health, alerts and historical data. Administrators are effectively able to manage all Nutanix clusters from the same UI.

(click on the image to enlarge)

  • PowerShell Support and Automation Kit

One of the big new things for workflow automation in Nutanix NOS 4.0 are the addition of PowerShell cmdlets to interact with the Nutanix API’s. The POSH library covers the entire extent and functionality of the Nutanix GUI. Basically, anything that can be done via GUI can also be done via REST and can also be done via POSH; including Alerts, Authentication, Backup and Disaster Recovery, Clusters, Containers, Disks, VM and Host, Remote Sites, Multi-Cluster, Storage Pools, SNMP etc.

  • Smart Support

When enabled by the administrator the smart support feature collect statistics from all the nodes in the cluster and send a summary to Nutanix via email. This information is used for debugging and troubleshooting. In the future this data may also be used to auto-diagnose and alert administrators of possible miss-configurations or problems.

Keep tuned, more news to come soon!

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Permanent link to this article: http://myvirtualcloud.net/?p=6218

Apr 10 2014

Nutanix has never lost data… and here is why!

keep-calm-and-no-data-lossOne of the things Nutanix is particularly proud of is to be able to say that no customer data has ever been lost or damaged due to system or component failures. This is a big achievement for any storage solution vendor.

However, please note that despite me personally having done the research and internal questioning about data loss for current and past versions of NOS (Nutanix OS), this is not a Nutanix official article.

In all openness, in the past when there was no proper prompt question before deleting protection groups, there were a couple of cases where users ended up manually forcing data deletion. But the issue is long gone and the Nutanix Prism UI now ensures that users are sure about the protection domain deletion.

It’s nice to be able to state something like that, but we are not bullish to think this could not happen, ever. That is the reason why our engineering team is paranoid about data loss and enforces multiple architectural considerations and checks to ensure data is always protected and available.

Some of these architectural considerations include zero single points of failure or bottleneck for management services, creating system tolerance to failures. Tolerance of failures is key to a stable, scalable distributed system, and ability to function in the presence of failures is crucial for availability.

Techniques like vector clocks, two-phase commit, consensus algorithms, leader elections, eventual and strict consistency, multiple replicas, dynamic flow control, rate limiting, exponential back-offs, optimistic replication, automatic failover, hinted-handoffs, data scrubbing, checksumming among others all go towards the ability of Nutanix to handle failures.

NDFS uses replication factor (RF) and checksum to ensure data redundancy and availability in the case of a node or disk failure or corruption. In the case of a node or disk failure the data is then automatically re-replicated among all nodes in the cluster to maintain the RF; this is called re-protection. Re-protection might happen after a Controller VM is down.

Node and Block awareness is a feature that enable the NDFS metadata layer to choose the best placement for data and metada in the cluster, always ensuring the cluster tolerates single or multiple node failures, or an entire block failure. This is a critical piece to maintain data availability across big clusters, always ensuring data is not just randomly placed in different hosts in the cluster. Moving forward we will also see the ability to ensure data is also distributed across racks, or even datacenters.

Because NDFS is always writing data to multiple nodes it’s extremely important that the consistent model is strict, ensuring that writes are only acknowledged once two or more copies have been successfully committed to disk in different nodes or blocks. This requires a clear understanding of the CAP theorem (Consistency, Availability and Partition Tolerance) (http://en.m.wikipedia.org/wiki/CAP_theorem).

Medusa, the metadata layer, stores and manages all of the cluster metadata in a distributed ring like manner based upon a heavily modified Apache Cassandra. The Paxos algorithm is utilized to enforce strict consistency.




Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures. Paxos is usually used where durability is required (for example, to replicate a file or a database), in which the amount of durable state could be large. The protocol attempts to make progress even during periods when some bounded numbers of replicas are unresponsive. There is also a mechanism to drop a permanently failed replica or to add a new replica.” – This service runs on every node in the cluster

The larger the cluster the higher the chances of a double drive failure, which may lead to data loss. Today, with NOS 3.5, NDFS uses RF 2, meaning it tolerates a drive failure, like RAID 5, but at the same time it is important to understand that the larger the cluster the lower the chance of a double disk failure causing data loss due to lower risk of the same data being stored on the two failed drives. Nutanix distribute data across all drives on the cluster 1Mb extents.

The larger the cluster the faster the cluster can recover from a failure (node or disk), because all nodes in the cluster effectively contribute to the rebuild of the data lost and this process also lower the chances of data loss as a result of a single drive failure as NDFS does not trash a small number of disks in a RAID set to recover from a drive loss ie: Repairing to a hot spare or replacement drive. The impact to performance during recovery from a drive failure is also lower on NDFS than traditional RAID systems.

As I mentioned before it’s nice to be able to state something like that, but we are not bullish to think this could not happen, ever. Therefore a robust backup and disaster recovery strategy is extremely important, and Nutanix also cover all the bases here. I am going to discuss backup and DR in a future article, but in the meantime you may watch this failover and failback video I recorded a while back.

I also would like to recommend this article on Resiliency from my colleague Damien Philip (http://pdamien58.blogspot.com/2014/03/resilience-part-1.html)


Thanks to Steven Poitras for allowing me to use content from The Nutanix Bible.

Thanks to Michael Webster revising this article.


This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Permanent link to this article: http://myvirtualcloud.net/?p=6182

Older posts «