Nutanix Controller Failure. Will Users Notice?

Reliability and resiliency is a key, if not the most important piece, to the Nutanix Distributed File System (NDFS). Being a distributed system NDFS is built to handle component, service and controller (CVM) failures.

Reliability is the probability of an item operating for a certain amount of time without failure. As such, the reliability function is a function of time, in that every reliability value has an associated time value.” – Reliability Hotwire

Resiliency is the ability to provide and maintain an acceptable level of service in the face of faults and challenges to normal operation.” – wikipedia.com

The Nutanix cluster automatically selects the optimal path between a hypervizor host and a guest VM data. This is known as automatic path direction, or Autopath.

When local data is available, the optimal path is always through the local CVM to local storage devices. However, in some situations, the VM data is not available on local storage, such as when a VM was recently migrated to another host. In those cases, the CVM re-directs read request across the network to storage on another host.

Nutanix Autopath also constantly monitors the status of CVMs in the cluster. If any process fails to respond two or more times in a 30-second period, another CVM will redirect the storage path on the related host to another CVM. To prevent constant switching between CVM, the data path will not be restored until the original CVM has been stable for at least 30 seconds.

A CVM “failure” could include a user powering down the CVM, a CVM rolling upgrade, or any event, which might bring down the CVM. In any of these cases autopathing would kick-in traversing storage traffic transparently to be served by another CVM in the cluster.

The hypervisor and CVM communicate using a private network on a dedicated virtual switch. This means that all storage traffic happens via an internal IP addresses on the CVM. The external IP address of the CVM is used for remote replication and for CVM to CVM communication.

 

 

In the event of a local CVM failure the local addresses previously used by the local CVM becomes unavailable. In such case, NDFS automatically detects the outage and redirect storage traffic to another CVM in the cluster over the network. The re-routing is done transparently to the hypervisor and to the VMs running on the host.

This means that even if a CVM is powered down the VMs will still continue to be able to perform IO operations. NDFS is also self-healing meaning it will detect the CVM has been powered off and will automatically reboot or power-on the local CVM. Once the local CVM is back up and available, traffic will then seamlessly be transferred back and start to be served by the local CVM.

NDFS uses replication factor (RF) and checksum to ensure data redundancy and availability in the case of a node or disk failure or corruption. In the case of a node or disk failure the data is then re-replicated among all nodes in the cluster to maintain the RF; this is called re-protection. Re-protection might happen after a CVM is down.

Below we show a graphical representation of how this looks for a failed CVM:

 

(Click Image to Enlarge)

 

What Will Users Notice?

During the switching process, the host with a failed CVM may report that the datastore is unavailable. Guest VMs on this host may appear to “hang” until the storage path is restored. This pause could be as little as 15 seconds and only be noticed as a surge spike in latency. Although the primary copy of the guest VM data will be unavailable because it is stored on disks mapped to the failed CVM, the replicas of that data are still accessible. As soon as the redirection takes place, VMs can resume reads and writes.

The performance may decrease slightly, because the IO is now traveling across the network, rather than across the internal virtual switch. However, because all traffic goes across the 10GbE network, most workloads will not diminish in a way that is perceivable to users.

This behavior is very important as it defines Nutanix architecture resiliency, allowing Guest VMs to keep running even when there is a storage outage. In other solutions, where the storage stack is not independent, a failure or a kernel panic could potentially force Guest VMs and applications to have to be restarted in a different host, causing serious application outages.

 

[Important Update] As of NOS 3.5.3.1 the VM pause is nearly imperceptible to Guest VMs and applications.

 

What happens if another CVM fails?

A second CVM failure will have the same impact on VMs on the other host, which means there will be two hosts sending IO requests across the network. More importantly, however, is the additional risk to guest VM data. With two CVMs unavailable, there are now two sets of physical disks that are inaccessible. In a cluster with a replication factor of two, there is now a chance that some VM data extents have become unavailable, at least until one of the CVMs resumes operation.

However if the subsequent failure occurs after the data from the first node has been re-protected there will be the same impact as if one host had failed. You can continue to lose nodes in a Nutanix cluster provided the failures happen after the short re-protection time, and until you run out of physical space to re-protect the VM’s.

 

Boot Drive Failure

Each CVM boots from a SATA-SSD. During cluster operation, this drive also holds component logs and related files. A boot drive failure will eventually cause the CVM to fail. The host does not access the boot drive directly, so other guest VMs can continue to run. In this case NDFS Autopath will also redirect the storage path to another CVM. In parallel, NDFS is constantly monitoring the SSDs to predict failures (I’ll write more about it in the future).

 

Thanks to Steven Poitras for allowing me to use content from The Nutanix Bible.

Thanks to Michael Webster, Josh Odgers and Prasad Athawale for contributing and revising this article.

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net.

19 comments

Skip to comment form

  1. I have about 4000 desktops running on Nutanix, so I wanted to find out what their experience would be if I performed an upgrade during production. I ran a test on my test cluster and my VM seemed to hang for about 20 seconds. Would the user notice? Maybe. Would I do an upgrade during production hours? Probably not.

    I ran my test on 3.5.2.1 so I’m looking forward to trying it on 3.5.3.1.

    • forbsy on 03/24/2014 at 3:49 am

    In other solutions, where the storage stack is not independent, a failure or a kernel panic could potentially force Guest VMs and applications to have to be restarted in a different host, causing serious application outages.

    Wouldn’t a kernel panic cause HA to kick in and restart vm’s on any platform running vSphere? Are you saying Nutantix can handle an ESXi server failing and it’s guest vm’s will still not need to be restarted on a different host? I understand if you’re referring to other solution cvm’s, but you said kernel panick so wasn’t sure if you were referring to the hypervisor kernel.

  2. forsby,

    Well pointed. I’m trying to convey the idea that if the panic is related to the storage stack, an independent solution, like Nutanix, guarantees that VMs are not affected for a lengthy amount of time and do not go down. If the panic is in the storage stack in the kernel, there are chances that the entire kernel gets unstable, crashing the host, therefore forcing VMs to be restarted in different new host after failure.

    Please note that this probably apply also to solutions that utilize Kernel module, but are not shipped baked into the hypervizor. SO, I am not talking about any specific solution here.

  3. Josh, I saw you video. Great Stuff!!
    Would you allow me to publish in my blog with proper acknowledgments?

  4. Sure! The first time I tried it, the performance was as expected: Only the video encoding paused, but the VM was still responsive. But when I went to record the video it hung for 20 seconds, or rather the Mac View Client hung for 20 seconds. I guess I could try recording the video from the vSphere console, then I could eliminate the View Client. For now I guess I’ll just call that worst case is that your VM pauses for 20 seconds until you upgrade to 3.5.3.1.

    Of course that was with only 1 VM running… I guess now I need to take it to the next level and have 100 VMs running at the same time on there.

  5. Josh , I have done some tests with the same version you have deployed and I don’t see the VM hanging like you did. This makes me believe this was the mac View client hanging for 20 seconds.

  6. Hello Andre,
    I just tried to kill (via esxcli vm process kill -t force -w WorldID) the CVM, hopping that other CVM in the Nutanix cluster would notice and restart the killed CVM (as you mention “NDFS is also self-healing meaning it will detect the CVM has been powered off and will automatically reboot or power-on the local CVM.”, and Steven Poitras has the exact same sentence in the Nutanix bible).
    Unfortunatly, even after leaving the cluster alone for 15min, the CVM had not been restarted.
    I asked my local Nutanix SE and he said that only in the event of application crash inside the CVM would the CVM be restarted by its peers.
    So I played along and launched a “sudo kill -9 1” while SSHed to one CVM to try & create a crash (not the best way to do it, I concur). The CVM was still up, I could still SSH to it, but “cluster status” was stuck in the ” Failed to reach a node where Genesis is up. Retrying…” loop and the CVM never got restarted by other surviving CVMs.

    Am I doing something wrong on both counts? is there something I’m missing here?

    • forbsy on 04/02/2014 at 7:09 am

    Yikes. If you have to orchestrate the CVM to fail in a specific manner, where’s the resiliency value? I generally like what Nutanix brings to the table, but this is one example of where I feel hypervisor (kernel mode) integration of a storage IO solution trumps the limitations of solutions that rely on controller vm’s.

    • Andre Leibovici on 04/02/2014 at 8:23 am

    forbsy, although Sylvain crashed the CVM manually all VMs still run on the same host without major impact as they will start using data blocks from other hosts automatically. This is the resiliency that you won’t get in a kernel based approach if there’s an issue with storage processes.

    Sylvain is talking about the self-healing process, where if for any failure the controller panics it would self heal and restart accordingly and returning the data path control to the CVM native on the host.

    Sylvain, let me so a bit of research and get back to you on how the self-healing exactly works. What NOS version are you using?

    -Andre

  7. @forsby: Yeah, I can confirm that the VMs that were running on the host are still fine, no impact for them. I’m much more comfortable with the hypervisor doing it’s job which is running VMs and Nutanix doing it’s own job, namely, providing storage & data resiliance.

    Not having a strong link between the 2 is one of our driver in the choice of Nutanix. We do not want anything messing around our hypervisor kernel (In fact we tried but never implemented vShield Endpoint Protection for the very same reason).

    @Andre: We are running NS 3.5.2.1 (upgrade to 3.5.3.1 is in my todo list).

  8. @Sylvain the 3.5.3.2 is also available now. As you know upgrades to the entire cluster are rolling, automated and non-disruptive to your environment. I would recommend upgrading the cluster and testing it again. In the meantime I will speak to the engineers to see where and how services are self-healed.

  9. @Andre: Are you sure it’s available yet? I don’t see it on the Nutanix Support Portal.
    Anyway, I need to document the upgrade process in relation to our internal processes, so I’ll have to wait a bit before upgrading.

    • forbsy on 04/02/2014 at 2:18 pm

    @Andre: Once again I’m assuming you’re referring to other solutions controller based vm’s having a kernel panic and not having that resiliency. I was talking about hypervisor kernel integrated (non vm controller solutions) like VSAN or PernixData. Still sounds like there’s an issue with the Nutanix CVM not self healing properly – or only if failed in a certain manner.

    @Sylivian: There’s actually a very good reason that you do want your hypervisor handling all aspects of your storage IO and not relying on a guest vm. Frank Denneman does a great job explaining this in Chad Sakic’s blog:

    http://virtualgeek.typepad.com/virtual_geek/2014/03/a-few-thoughts-and-opinions-on-vsan-and-hyperconvergence.html#more

    Scroll to the bottom (last comment) to read Frank’s explanation. In the end, the hypervisor kernel is actually where you want to handle your storage IO. If that wasn’t the case, you’d have a bunch of storage controller virtual machines spun up in your vSphere environment to handle that today.

    I find it hard to believe that Nutanix wouldn’t have taken advantage of tight hypervisor kernel mode integration with NDFS – if that option were available to them during development. As mentioned, I think Nutanix is a nice solution for hyper-convergence, but the issue you ran into coupled with the points that Frank makes just point out why a controller vm IO solution presents issues – especially as you scale out – which is the Nutanix model.

  10. @forbsy I hear you, respect you opinion and understand your concerns and points. Frank is a friend of mine and I also respect his opinion. However, the best solution is the one that works. Betamax was better than VHS, VHS won! I could give you many examples on this line.

    Nutanix has clusters with 100’s of nodes and a cluster with 1,600 nodes running without issues for a government agency. Therefore, the scalability issue is FUD from competition.

    Also, Nutanix view hypervizors as commodity and you should use whatever hypervizor you want to, or the one that fits your wallet. I can also see the day that Nutanix will enable native use of any VM formats across hypervizors effectively enabling multi-hypervizor solutions. Multi-hypervizors is something that Gartner says more than 60% of organizations expect to have in the future.

    Since Nutanix is not baked in the Kernel the innovation rate is also likely to be much faster.

    Ultimately it’s up to you customers to chose the path they want to take since both are valid approaches.

    -Andre

    Note: On the self-healing I think it’s because he is running an old version of NOS. I’ll double check with engineering.

    • forbsy on 04/02/2014 at 4:07 pm

    LOL. I totally remember the Beta vs VHS wars. Not sure I remember which was better :). Totally agreed. Both are valid approaches. I wasn’t arguing that vsphere kernel integration of the IO path was the best way. Rather, any solution that uses hypervisor kernel integration of the storage IO path is the preferred solution, IMO. VSAN has their solution and if Hyper-V had a hyper-converged solution that relied on kernel mode integration, I feel it would be the way to go (if I were inclined to use Hyper-V).
    Agreed that support across multiple hypervisors (either to capture more market share, or to work within a heterogeneous datacentre) is something that makes sense. The customers I deal with today pretty much only have a single hypervisor but that could change in the future.

  11. @forbsy: Not to add fuel to the fire, but I knew about the post from Frank (multiple year vExpert veteran here…) and while I see his point in theory, I always prefer field test to form an opinion.
    Not being aligned to a vendor (no ties to any) I get the opportunity to field test every product that my clients can get there hands on.
    In this case, the client is running 4000+ VMs on ~200 hosts, and he does have a multi hypervisor policy, KVM & ESXi.
    We tried vSAN (was part of the beta at a very early stage) and ultimately dismissed it for various reasons, but one of them was it’s strong ties to the kernel of ESX & the vendor lock-in that it provoques.
    We are not planning to try ScaleIO (because it’s block based and the client runs a NFS only shop) but we are testing Nutanix, and except for this specific point (which can be solved otherwise & may very well be a problem on our side, I didn’t even open a support case yet) we are seeing some really good performances (Better than the performances we are seeing from our usual NetApp arrays).

    Ultimately my motto is: Use whatever works for you in your specific context, with the client specific constraints, but make your choice wisely after having tested the available solutions.

    @Andre: Back on topic: The version we are using is the second to last, I would not qualify it as “old” per se, are you sure that’s the root of the problem?

    Sylvain.

  12. @Andre: I just upgraded to 3.5.3.1, and was able to reproduce the behavior (Crash the CVM, wait for it to be automatically restarted by other CVM, never happen).
    I may open a low priority support case to have a definitive answer from the support if you want.

    • Ian Forbes on 04/08/2014 at 7:47 am

    @Andre -If a CVM experiences a failure and NDFS re-routes storage IO for that node to another CVM, is it possible that CVM could get overwhelmed servicing the remote node’s storage IO as well as it’s own?

  13. @Ian Forbes, Nutanix has a algorithm to dynamically hash requests for different VDisks to different CVM in different nodes. I believe this path is also dynamically changed upon CVM load, but I need to confirm that. Might be a topic for another blog post.

Leave a Reply