Nutanix announced recently NOS 4.1. This release is mostly focused on enhancements for the areas of resiliency, security, disaster recovery, analytics, supportability and management. However, even being a ‘dot’ release, NOS 4.1 delivers very important features, and in my option this version has enough meat to actually even be considered a major release.
It’s been only 5 months since NOS 4.0 was announced with the introductions of Hybrid On-Disk De-Duplication, Failure Domain Awareness, Disaster Recovery for Hyper-V, Snapshot Scheduling, One-Click NOS Upgrade and others. If you missed the NOS 4.0 release announcement read about it at Nutanix 4.0 Features Overview (Beyond Marketing).
This is the power of software-defined architectures, running on standard x86 hardware, with no special purpose machine doing one and one thing only. The software approach allows for faster release cycles and whenever there are hardware performance improvements you get to enjoy the benefits and performance improvements right away.
Please refer to the oficial product Release Notes for official information from Nutanix.
Let’s start with the New Features…
- Cloud Connect
Nutanix allows administrators to implement and manage VM centric Disaster Recovery policies across multiple sites and datacenters using a multi-topology architecture.
Whilst the Nutanix built-in DR capabilities allow administrators to specify snapshot retention policies, this approach can be expensive even when data de-duplication and compression are already enabled across cluster. There must be enough storage capacity available to retain the data and be able to handle multiple backups and snapshots over a long period of time.
Nutanix has now officially announced the ability to leverage the global, distributed and highly available infrastructure of Amazon Web Services for data backup and restore. That means that an on-premise Nutanix cluster is now able to back up and restore virtual machines to AWS, while having organizations getting billed directly for AWS for EC2 and S3 costs.
With Cloud Connect organizations are able to schedule, manage and use local and remote snapshots and replication for backup and disaster recovery from within Nutanix PRISM user interface.
- Local snapshots with Time Stream
- Backup to Amazon S3
- Integration with VSS and SRM
- Quick restore and state recovery
- WAN-optimized replication for DR
- Works with ESXi and Hyper-V
- 15 minute Cloud RPO
Administrators also have the ability to control and fine tune the back-up schedule and retention policies to meet the needs of the workload. The back-up schedule can be as low as 15 minutes or as high as a day enabling you to choose the frequency that best meets the infrastructure capacity and SLAs.
A Nutanix NOS instance (software-only) runs on AWS and each AWS NOS instance is managed as a remote site in Nutanix – it’s all integrated and managed via a single pane of glass. In AWS the remote site will require a M1.xlarge instance and the AWS NOS instance may be created in an availability zone of choice, increasing reliability and resiliency as needed by your organizations. You can have as many AWS NOS instances as needed.
The NOS cluster services run on this AWS virtual instance and uses Amazon EBC for metadata and S3 to storage the backup data. All the communication is done via VPC or SSH tunnels with optimized data transfer de-duplication for both backup and restore operations. Data is de-duplicated and compressed before it is backed-up in the public cloud with net result savings of 75% in both network bandwidth usage and storage footprint. It is also possible to throttle the network bandwidth consumption of backup to ensure that applications performance is not affected in any way. In addition to that, data transfer throughput with SSH falls by approximately by 25% compared to VPC and is the recommended method.
Just like any other cloud service AWS is subject to failures and outages. For this reason the AWS NOS instance takes automatic periodic snapshots of EBS volumes stores in S3. A AWS failure or unavailability will automatically raise an alert on the on-premise NOS cluster, maintaining that single pane of glass for managing all your Nutanix clusters and instances.
I will soon be working on a video demo to publish here, but in the meantime you can watch this video with Disaster Recovery – Failover and Failback with Nutanix.
Nutanix scale-out solution already handles the large majority of workloads using existing NX models, but certain workloads with very large active datasets or write IO intensive can benefit from additional performance. The NX-8150 is the new platform for those Tier 1 business critical workloads and it can be mixed with existing Nutanix clusters.
The NX-8150 is the first platform to fully leverage the Nutanix scalable write buffer, which improves performance and drives down latencies for the most demanding applications. In addition, the larger SSD tier for active data satisfies the latency and IOPS requirements for database workloads, without the excessive cost of an all-flash appliance.
The ability to mix different nodes types into a single cluster makes it practical and easier to eliminate infrastructure silos in datacenters.
The NX-8150 is targeted at Microsoft Exchange, SharePoint, SAP, High performance databases such as MS SQL Server and Oracle RAC; and it comes with 4 times the number of flash devices compared to previous Nutanix platforms to accommodates a much larger active data set. The NX-8150 is also the first Nutanix platform that offer flexible configuration of storage and server (compute) resources from factory, and that include multiple CPU options, 3 different SSD configurations, a variety of memory profiles, and multiple options for connectivity including ability to expand to 4x 10GbE ports.
Another benefit of the NX-8150 platform with flexible configuration is the ability to minimize the number of software licenses (applications and hypervisor) to be purchased. Think vSphere and Oracle.
Tests and validations will soon come out from Nutanix performance and engineering labs. Here is an example of such validations for Microsoft Exchange workloads where using the new platform with only 2U footprint Nutanix is able to manage 6x more mailboxes per node – 3,300+ mailboxes per node following Microsoft testing guidelines and practices – can linearly scale the system to support 100,000s of mailboxes simply by adding additional NX-8150 nodes.
- NX-9240 (All Flash)
Nutanix scale-out solution already handles the large majority of workloads using existing NX models. For workloads with very large active datasets or write IO intensive that require additional performance Nutanix introduced, as part of the NOS 4.1 release, the NX-8150 for those Tier 1 business critical workloads and it can be mixed with existing Nutanix clusters.
The new NX-9240 appliance is built to run applications with very large working sets, such as databases supporting online transaction processing (OLTP) that not only exceptionally fast storage performance, but also demand predictable and consistent l/O latency that flash can deliver. The new NX-9240 is 100% All flash storage, offering ~20TB RAW per 2U.
Flash capacity is optimized using Nutanix’s scale-out compression and de-duplication technologies that leverage unused compute resources across all nodes in the cluster, avoiding performance bottlenecks.
Differently than other solutions, this is a true scale-out All Flash storage where storage capacity and performance are augmented simply by adding nodes, one-at-a-time, non-disruptively, for 100% linear scalability with no maximum limit.
In this first release (NOS 4.1) the NX-9240 All Flash nodes cannot be mixed with other node types because the new nodes do not have the concept of automated tiering, given it’s all flash. Therefore a new cluster must be created only with NX-9240 nodes; however all other NOS capabilities such as disaster recovery, backup and even the new metro cluster availability can be used between different clusters. A future release of NOS will allow the mix and match of All Flash and Hybrid nodes.
Security and Compliance
- Data At Rest Encryption (NX-3060-E, NX-3061-E, NX-6060-E)
Nutanix clusters are deployed in a variety of customer environments requiring different levels of security, including sensitive/classified environments. These customers typically harden IT products deployed in their datacenters based on very specific guidelines, and are mandated to procure products that have obtained industry standard certifications.
Data-at-rest encryption is one such key criteria that customers use to evaluate a product when procuring IT solutions to meet their project requirements.
Nutanix Data-at-Rest encryption satisfies regulatory requirements for government agencies, banking, financial, healthcare and other G2000 enterprise customers who consider data security products and solutions. This new feature allow Nutanix customers to encrypt all or selected partitions on persistent storage using strong encryption algorithm and only allow access to this data (decrypt) when presented with the correct credentials.
- Compliant with regulatory requirements for data at rest encryption
- Leverages FIPS 140-2 Level-2 validated self-encrypting drives
- Future proof (uses open standard protocols- KMIP, TCG)
To enable Nutanix DRE a 3rd party key management server is required. At the time of the launch only ESXi is supported and only the SafeNet KeySecure Cryptographic Key Management System is certified, but overtime other key management systems will be supported. Nutanix supports any KMIP 1.0 compliant key management system, but others have not yet been certified. The key management system can even be a VM running on the Nutanix cluster.
Currently it is not possible to mix a Nutanix DRE enabled cluster with a non-DRE cluster because the platform requires special FIPS 140-2 Level 2 SED drives to meet the data at rest encryption requirements. By breaking the homogeneity of the cluster, one will violate the data at rest encryption requirement for copies of data stored on non-SED drives. However, both DRE and non-DRE cluster can be managed via the same PRISM Central UI.
- One-Click Hypervisor and Firmware Upgrade
I kinda soft-launched this feature in my article Nutanix One-Click Upgrade now takes care of Firmware and Hypervisor too! after Nutanix CEO Dheeraj Pandey revealed the new feature in a tweet. Love it!
In addition to the already non-disruptive NOS upgrade, the One-Click upgrade feature now ensures that the hypervisor (vSphere and HyperV) is also automatically upgraded in a distributed and rolling fashion. Nutanix is solving a huge problem, where either manual intervention or external automation tools are required.
In addition to the new hypervisor upgrade feature Nutanix is now enabling NCC (Nutanix Cluster Health) upgrade and hardware firmware upgrade. Let’s talks about firmware upgrade.
Upgrading firmware on servers, controllers, disks and other interfaces is probably one of the major pain points for datacenter administrators. The workflow normally goes as follows: schedule maintenance outage, go to the manufacturer website, download correct firmware, flash firmware via web interface or with a Linux cli tool, reboot server, and then repeat this task for every server in the datacenter that require firmware upgrade. It’s complex and lengthy.
Because NOS runs in hypervisor user space Nutanix is able to execute complex firmware and hardware upgrade processes across multiple server components without requiring reboots. Using Nutanix highly distributed parallel processing the One-Click upgrade takes care of firmware upgrade for all disk devices in the cluster with zero impact to virtual machines and workloads. No reboots required, true AlwaysOn!
As a bonus the framework that is serving as the engine for the Nutanix Cluster Health, also knows as NCC, is also part of the One-Click upgrade process.
- Metro Availability
Over the last couple years Nutanix introduced many features around availability and resiliency to the platform. Today Nutanix has built-in self-healing capabilities, node resilience and tunable redundancy features, virtual machine centric backup and replication, automated backup to Cloud, and many other features vital for running enterprise workloads.
However, Business-critical applications demand continuous data availability. This means that access to application and user must be preserved even during a datacenter outage or planned maintenance event. Many IT teams use metro area networks to maintain connectivity between datacenters so that if one site goes down the other location can run all applications and services with minimal disruption. To keep the applications running, however, requires immediate access to all data.
The new Nutanix Metro Availability feature stretches datastores and containers for virtual machine clusters across two or more sites located up to 400km apart. The mandatory synchronous data replication is natively integrated into Nutanix, requiring no hardware changes. During the data replication Nutanix uses advanced compression technologies for efficient network communications between datacenters, saving bandwidth and speeding data management.
For existing Nutanix customers it is good to know that the implementation of the metro availability feature uses the same concepts of data protection groups existing in PRISM for backup and replication across Nutanix cluster, just now adding a synchronous replication option where administrators are also able to monitor and manage cluster peering and promote containers or break peers.
By default, the container on one side (site) is the primary point of service, and the other side (site) is the secondary and synchronously receives a copy of all the data blocks written in the primary point site. Since this is done on a container level, it’s possible to have multiple containers and datastores, and the direction of replication can be simply defined per container.
The Nutanix Metro Availability supports heterogeneous deployments and do not require identical platforms and hardware configurations at each site. Virtualization teams can now non-disruptively migrate virtual machines between sites during planned maintenance events, providing continuous data protection with zero recovery point objective (RPO) and a near zero recovery time objective (RTO).
The requirements to enable metro availability are simple, being enough bandwidth to handle the data change rate, and a round trip time of <=5ms. A redundant network link is also highly recommended.
- <=5ms RTT
- Bandwidth depends on ‘data change rate’
- Recommended: redundant physical networks between sites
- 2 Nutanix clusters, one on each site
- Mixing hardware models allowed
- ESXi (other hypervisors soon)
What I like the most about the Nutanix platform is that using One-Click NOS, Hypervisor and Firmware Upgrade customers will be able to start using the new feature, as soon it is available. This is the power of the true software-defined datacenter.
- Configurable remote Syslog forwarding enables you to send logs to a remote server using the TCP/UDP protocols. Syslog is a standard for computer message logging. It permits separation of the software that generates messages from the system that stores them and the software that reports and analyzes them. In Nutanix each log in /home/nutanix/data/logs/ is prefixed with the name of the module (for example, cassandra) generating the information.
- Multi-cluster management feature (also known as Prism Central), now allow convenient and automated cluster NOS upgrades through a web console upgrade dialog. Automatic software alerts notify you of available upgrades, which you can install manually or automatically. This is in addition to the One-Click Hypervisor and Firmware Upgrade already support by Prism, enabling even more powerful multi-datacenter management. You can see a full demo of Prism Central at Nutanix PRISM Central Demo Video (multi-datacenter management).
- Volume Shadow Copy Service (VSS) support for Hyper-V hosts. If you are interested on Hyper-V support roadmap I recommend reading this article by Tim Isaacs, Nutanix and Microsoft Private Cloud: We continue our journey with Microsoft. There’s a lot happening in the Hyper-V, SCVMM and Azure arena.
- Improvements on the handling of NOS Oplog for better performance and stability. The Oplog is similar to a filesystem journal and is built to handle bursty writes, coalesce them and then sequentially drain the data to the extent store. You will find more tech info about OpLog and Data Path at The Nutanix Bible. The performance improvements will come with the official release notes and PR for NOS 4.1.
- Simplified drive replacement procedure, making all HDDs and SSDs on all Nutanix supported platforms be hot-swappable for both local and remote disks. The drive replacement can now be fully monitored via PRISM GUI and nCLI. Additionally, it is possible to clearly identify via Prism and drive carrier LED the location of the failed drive.
- The chassis LED now can be turned on or off from the Prism UI Hardware page in either the Diagram or Table view to help identify the correct Nutanix block in large data centers. In large data centers this a must have feature.
- Support for Dell XC720xd series hardware, in accordance to Nutanix Announces Global Agreement with Dell announcement.
- Shadow clones are now enabled by Default. When a vDisk is read by multiple VMs (such as the base image for a VDI clone pool), NOS caches the vDisk on all the nodes in the cluster. Nutanix Shadow Clones allow for distributed caching of a particular disk or VM data, which are in a ‘multi-reader’ scenario. This will work in any scenario, which may be a multi-reader scenario (eg. deployment servers, repositories, etc.). Read more about Shadow Clones at Nutanix Shadow Clones Explained and Benchmarked.
- System Center Operations Manager and System Center Virtual Machine Manager
Nutanix now offers single pane of glass management in Microsoft environments with full Storage Management Initiative Specification (SMI-S) support by SNIA.
“The SMI-S defines a method for the interoperable management of a heterogeneous Storage Area Network (SAN), and describes the information available to a WBEM Client from an SMI-S compliant CIM Server and an object-oriented, XML-based, messaging-based interface designed to support the specific requirements of managing devices in and through SANs” Here is a good introduction to SMI-S by the Microsoft team.
The integration allow Microsoft administrators to monitor performance and health of Nutanix software objects such as clusters, storage containers, controller VMs and others via SCVMM; and also monitor Nutanix hardware objects such server nodes, fans, power supplies and others via SCOM.
Here are some screenshot examples, but I will overtime write more about each individual management pack.
Nutanix Cluster – Containers View (Click to Enlarge)
Nutanix Cluster – Performance >> Clusters (Click to Enlarge)
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net.