Zero RTO Application Restores – Myth or Reality?

Most of my life I have been closer to applications and primary storage, and later on when managing technology teams I always counted on well-trained data protection professionals to handle such critical role in organizations. One thing was clear to me – backup is equivalent to insurance. It is a tool that is there for when sh*t hit the fan. {pardon the expression}

Other than Data Domain adding deduplication and more recently backup vendors using HCI approach to store backups in scale-out clusters, not much has changed since my customer days. Some vendors are now starting to tinker with Cloud as tape replacement, but for the most part, the cost is still prohibitive due to the high capacity requirements for incremental, daily, weekly and monthly full backups.

I may not be a data protection authority, but I do understand RPO, RTO and the business impacts.

 

RPO (Recovery Point Objective) refers to the amount of data at risk. It is determined by the amount of time between data protection events and reflects the amount of data that potentially could be lost during disaster recovery. The metric is an indication of the amount of data at risk of being lost.

RTO (Recovery Time Objective) is related to downtime. The metric refers to the amount of time it takes to recover from a data loss event and how long it takes to return to service. RTO refers then to the amount of time the system’s data is unavailable or inaccessible preventing normal service.

Descriptions By David Vellante (2008)

Zero RTO!

The team at Datrium is knee deep into data protection technologies, including having the Data Domain founding team as part of the founding team. They made data protection an integral part of Datrium DVX, including primary storage (BTW fastest and most scalable in history) — and the most exciting — Zero RTO!

RTO ZERO means that when an application or VM restore is needed, the restore is instantaneous — a single click and ZERO waiting to restore the application to a consistent state. Please note that I am not talking about reverting a snapshot that is co-located on the same storage sub-system that the application is running on (like VMware vSphere and HCI snapshots) — this is a restore from data at-rest on different media than your primary storage. Let’s have a look at how this works!

There are two important things to know about how Datrium DVX works:

 

  1. The DVX Hyperdriver Software on each compute node (hypervisor) manages all active data for the VMs within that host. It provides scalable IO performance, availability and data management capabilities.
  2. The DVX Data pool provides persistence and resiliency for a durable copy of all data in the cluster. In normal operation it is write-only, but it also provides streaming read performance for flash uploads as well as cluster coordination for simple management.

 

(1)

 

(2)

 

The DVX maintains two distinct “namespaces” (filesystem metadata about VMs and datastore-files).

The “datastore” contains the current, live version of all VMs and files. This is what the hypervisor “sees.” The hypervisor management tool (vCenter for VMware, or RHEV Manager for RedHat) allows you to browse the contents of the live datastore at any time. The contents of these files always contain the most recently written data.

The “snapstore” on the other hand contains previous point-in-time snapshots of the live datastore as it existed previously. Every time a protection group causes a snapshot to be taken; entries are made in
the snapstore with the contents of every file in the live datastore at that instant in time.

Datrium uses a “redirect on write” (ROW) technique to store incoming data. New data is always written to new locations (vs. copy-onwrite techniques that can introduce delays as changes are copied). Because only changes are stored in a snapshot, and because DVX only stores compressed and deduplicated data; snapshots consume relatively little capacity.

If, for example, a given VM’s protection policy causes snapshots to be taken every hour and retained for two days, then snapstore would contain up to 48 different versions of this VM’s files. These policies can be overlapped delivering RPO of 10 minutes and retaining up to 2,000 snapshots per VM.

Datrium DVX provides two uses for the point in time copies of VMs and datastore – restoring/reverting VMs/files in the live datastore and creating net new VMs/files (cloning).

 

  • Restoring/reverting replaces the state of live VMs or datastore-files with the state from the point-in-time when the snapshot was taken.  It instantly “rolls back” VMs or datastore-files to a previous point in time.
  • Cloning is the process of taking a point-in-time snapshot and creating a net new VM is immediately populated with the state contained in the snapshot. It is an instantaneous way to create one or more copies of existing VMs and applications.

In contrast, conventional backup tools that understand VMs (many don’t know VM-level objects) would need to restore/copy the data from an external repository (such as a NAS) via a proxy server. However, because the primary storage and the backup tool operate disjointly and don’t possess global deduplication awareness, the entire VM dataset has to be restored, causing the system to transfer large sums of data, taking hours and sometimes days to complete restores, depending on storage and link performances.

It is crucial to understand that the Datrium DVX global deduplication enables the system to simply have to up-stream differential data from the data pool to compute nodes when a restore is triggered – and that happens both synchronously and asynchronously, allowing applications restart instantly.

The ability to instantly “roll back” or clone a VM to from the data pool delivers Zero RTO, and that is what matters at the end of the day when trying to restore systems. In many cases, even a thirty minutes delay window to restore systems can cause organizations millions of dollars in damages.

Please note that there are situations where you may need to work with both Datrium and  the leading backup vendors, specially in heterogeneous on-premise storage and sub-5 minute remote-site RPO.

Check out the paper on data protection and secondary storage (here).

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

 

Leave a Reply