Apr 15 2014

Nutanix 4.0 Features Overview (Beyond Marketing)

Advertisement

Today Nutanix announced the release NOS 4.0. This is a major release with introduction of features in the areas of data services, performance, resiliency, data protection, and management and analytics. It’s been eight months since NOS 3.5 was announced with the Elastic De-Duplication Engine, PRISM UI, RESTful API, SRM Support, and Nutanix is now delivering another major release. Just like I did for the VMware Horizon View releases when I was at VMware I am going to start doing the Beyond Marketing series for Nutanix.

Please refer to the product Release Notes for official information from Nutanix.

Let’s start with the New Features…
 

Core Data Services

 

  • Hybrid On-Disk De-Duplication

De-duplication allows the sharing of guest VM data on premium storage tiers (RAM and Flash). Performance of guest VMs suffers when active data can no longer fit in the premium tiers. If guest VMs are substantially similar, for example if the Nutanix cluster is used to host numerous Windows desktops, enabling de-duplication substantially improves performance. When used in the appropriate situation, de-duplication makes the effective size of the premium tiers larger so that the active data can fit.

  • Shadow Clones (Official Support)

Shadow Clones is finally out of tech-preview. Shadow Clones intelligently analyze the I/O access pattern at the storage layer to identify files shared in read only mode (ie: Linked Clone Replica). When a 100% read only disk is discovered, Nutanix will automatically create a snapshot at the storage layer on each Controller VM (CVM) and redirect all read I/O to the local copy, drastically improving end-user experience. Read more at Nutanix Shadow Clones Explained and Benchmarked.

 

Performance

 

Multiple different performance improvements have been added to NOS 4.0, increasing overall system performance in 20% compared to NOS 3.5. There will be more detailed information in a different article to discuss  performance improvements in NOS 4.0.

  • Multi-disk OpLog Store

Nutanix now utilize all SSD disks to store the oplog store, increasing on-disk oplog capacity and increasing performance as different vDisks can utilize different SSDs at the same time for writing oplog data. There will be more detailed information in a different article to discuss the performance improvements in NOS 4.0.

  • Other performance improvements include: Fault isolation at a vDisk level, ODirect IO in Extent Store an ODirect IO in oplog store. 

 

Resiliency

 

  • Tunable Fault Tolerance (RF-3)

Replication Factor 3 (also known as FT2) protect data against two simultaneous node, disk or NIC failures.

  • Smart Pathing (CVM/AutoPathing 2.0)

The new and improved CVM AutoPathing 2.0 prevents performance loss during rolling upgrades minimizing I/O timeout by pre-emptively redirecting NFS traffic to other CVMs. Failover traffic is automatically load-balanced with the rest of the cluster based on node load.

  • Availability Domains (Failure Domain Awareness)

Also known as ‘Block Fault Tolerance’ or ‘Rack-able Unit Fault Tolerance’ the availability domain feature adds the concept of block awareness to Nutanix cluster deployments. It works managing the placement of data and metadata in the cluster, ensuring that no singular replicated data is stored in the same Nutanix block for high availability purposes.

host_failure_ft

 

Data Protection

 

  • Snapshot Browser

The new snapshot browser functionality allow administrator to see and restore point-in-time array-based snapshots from a VM or a group of VMs in a local or remote protection domain. This functionality is powered by an extremely detailed scheduling that allow for very granular and application consistent snaps.

(click on the image to enlarge)

  • Snapshot Scheduling via PRISM

Nutanix UI now provides the ability to use calendar based scheduling for backups and replication with ability to specify data retention policies per remote site. Nutanix effectively delivers a unified pane of glass that allows administrators to configure and manage local and remote VM/File backups.

  • Improved Nutanix Storage Replication Adapter (SRA)

Nutanix SRA now has fast detection of files corresponding to the VMs protects in SRM with support to up to 50 VMs on a vStore protected group in SRM. Support for multiple SRM devices in a SRM protection group has been added, and support for execution of multiple SRM recovery plans in parallel.

  • Disaster Recovery Support for Hyper-V

Nutanix 4.0 extends it’s DR capabilities to Hyper-V, providing a VM-Centric native disaster recovery solution. The Hyper-V support has now feature parity with Nutanix DR for ESX.  Being VM-centric implies that in addition to protecting files associated with the VM, Nutanix also orchestrate powering down, un-registering, registering/cloning, powering on of the VM in the destination cluster/site.

 

Management and Analytics

 

  • One-Click NOS Upgrade

As the name says, it’s a one-click NOS upgrade for the entire Nutanix cluster. Nutanix one-click upgrade automatically indicates when a new NOS version is available and it will auto-download the binaries if the auto-download option is enabled. With a single-click to upgrade all nodes in a cluster Nutanix will use a highly parallel process and reboot one CVM at a time using a rolling upgrade mechanism. The entire cluster upgrade can be fully monitored by the administrator.

  • Cluster Health

Nutanix Cluster Health is a great asset in maintaining availability for Tier 1 workloads. Cluster Health gives the ability to monitor and visually see the overall health of cluster nodes, VMs and disks from a variety of different views. With the ability to set availability requirements at the workload level, Cluster Health will visually dissect what’s important and give you guidance on how to take corrective action.

(click on the image to enlarge)

  • Prism Central (Multi-Cluster UI)

Nutanix now provides a single UI to monitor multiple clusters in the same or different datacenters. Prism Central avoid administrations from having to sign individually to every cluster and provide aggregated cluster health, alerts and historical data. Administrators are effectively able to manage all Nutanix clusters from the same UI.

(click on the image to enlarge)

  • PowerShell Support and Automation Kit

One of the big new things for workflow automation in Nutanix NOS 4.0 are the addition of PowerShell cmdlets to interact with the Nutanix API’s. The POSH library covers the entire extent and functionality of the Nutanix GUI. Basically, anything that can be done via GUI can also be done via REST and can also be done via POSH; including Alerts, Authentication, Backup and Disaster Recovery, Clusters, Containers, Disks, VM and Host, Remote Sites, Multi-Cluster, Storage Pools, SNMP etc.

  • Smart Support

When enabled by the administrator the smart support feature collect statistics from all the nodes in the cluster and send a summary to Nutanix via email. This information is used for debugging and troubleshooting. In the future this data may also be used to auto-diagnose and alert administrators of possible miss-configurations or problems.

Keep tuned, more news to come soon!

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Permanent link to this article: http://myvirtualcloud.net/?p=6218

Apr 10 2014

Nutanix has never lost data… and here is why!

keep-calm-and-no-data-lossOne of the things Nutanix is particularly proud of is to be able to say that no customer data has ever been lost or damaged due to system or component failures. This is a big achievement for any storage solution vendor.

However, please note that despite me personally having done the research and internal questioning about data loss for current and past versions of NOS (Nutanix OS), this is not a Nutanix official article.

In all openness, in the past when there was no proper prompt question before deleting protection groups, there were a couple of cases where users ended up manually forcing data deletion. But the issue is long gone and the Nutanix Prism UI now ensures that users are sure about the protection domain deletion.

It’s nice to be able to state something like that, but we are not bullish to think this could not happen, ever. That is the reason why our engineering team is paranoid about data loss and enforces multiple architectural considerations and checks to ensure data is always protected and available.

Some of these architectural considerations include zero single points of failure or bottleneck for management services, creating system tolerance to failures. Tolerance of failures is key to a stable, scalable distributed system, and ability to function in the presence of failures is crucial for availability.

Techniques like vector clocks, two-phase commit, consensus algorithms, leader elections, eventual and strict consistency, multiple replicas, dynamic flow control, rate limiting, exponential back-offs, optimistic replication, automatic failover, hinted-handoffs, data scrubbing, checksumming among others all go towards the ability of Nutanix to handle failures.

NDFS uses replication factor (RF) and checksum to ensure data redundancy and availability in the case of a node or disk failure or corruption. In the case of a node or disk failure the data is then automatically re-replicated among all nodes in the cluster to maintain the RF; this is called re-protection. Re-protection might happen after a Controller VM is down.

Node and Block awareness is a feature that enable the NDFS metadata layer to choose the best placement for data and metada in the cluster, always ensuring the cluster tolerates single or multiple node failures, or an entire block failure. This is a critical piece to maintain data availability across big clusters, always ensuring data is not just randomly placed in different hosts in the cluster. Moving forward we will also see the ability to ensure data is also distributed across racks, or even datacenters.

Because NDFS is always writing data to multiple nodes it’s extremely important that the consistent model is strict, ensuring that writes are only acknowledged once two or more copies have been successfully committed to disk in different nodes or blocks. This requires a clear understanding of the CAP theorem (Consistency, Availability and Partition Tolerance) (http://en.m.wikipedia.org/wiki/CAP_theorem).

Medusa, the metadata layer, stores and manages all of the cluster metadata in a distributed ring like manner based upon a heavily modified Apache Cassandra. The Paxos algorithm is utilized to enforce strict consistency.

 

NDFS_Ring

 

Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures. Paxos is usually used where durability is required (for example, to replicate a file or a database), in which the amount of durable state could be large. The protocol attempts to make progress even during periods when some bounded numbers of replicas are unresponsive. There is also a mechanism to drop a permanently failed replica or to add a new replica.” – This service runs on every node in the cluster
http://en.wikipedia.org/wiki/Paxos_(computer_science)

The larger the cluster the higher the chances of a double drive failure, which may lead to data loss. Today, with NOS 3.5, NDFS uses RF 2, meaning it tolerates a drive failure, like RAID 5, but at the same time it is important to understand that the larger the cluster the lower the chance of a double disk failure causing data loss due to lower risk of the same data being stored on the two failed drives. Nutanix distribute data across all drives on the cluster 1Mb extents.

The larger the cluster the faster the cluster can recover from a failure (node or disk), because all nodes in the cluster effectively contribute to the rebuild of the data lost and this process also lower the chances of data loss as a result of a single drive failure as NDFS does not trash a small number of disks in a RAID set to recover from a drive loss ie: Repairing to a hot spare or replacement drive. The impact to performance during recovery from a drive failure is also lower on NDFS than traditional RAID systems.

As I mentioned before it’s nice to be able to state something like that, but we are not bullish to think this could not happen, ever. Therefore a robust backup and disaster recovery strategy is extremely important, and Nutanix also cover all the bases here. I am going to discuss backup and DR in a future article, but in the meantime you may watch this failover and failback video I recorded a while back.

I also would like to recommend this article on Resiliency from my colleague Damien Philip (http://pdamien58.blogspot.com/2014/03/resilience-part-1.html)

 

Thanks to Steven Poitras for allowing me to use content from The Nutanix Bible.

Thanks to Michael Webster revising this article.

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Permanent link to this article: http://myvirtualcloud.net/?p=6182

Apr 08 2014

VMware vExpert 2014

Last week I found out I have been awarded with the VMware vExpert 2014.

 

VMware-vExpert-2014-400x57

 

The vExpert program is a recognition to the contributions to those who have demonstrated significant contributions to the community and a willingness to share their expertise with others.

This is the 5th consecutive year I am awarded on the vExpert program and it’s a honor to me to be part of this select group or people that in one way or another go out of their way, beyond their day-to-day responsibilities, to help the virtualization community. This group of bloggers, podcasters, web-casters, twitters is amazing.

Congratulations to all vExperts 2014!
 
Original Blog Post by John Troyer: vExpert 2014 awardees announced

I would like to thank John Troyer, Corey Romero, and the VMware Social Media & Community Team for their efforts to move this program forward internally at VMware.

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net.

Permanent link to this article: http://myvirtualcloud.net/?p=6175

Older posts «

» Newer posts