I solved Natural Hazard DR with Python and Datrium

I solved Natural Hazard DR with Python and Datrium.

One of the critical issues with disaster recovery, in general, is that you only want to trigger DR when the disaster strikes.

As much as I would love to say that, ideally, applications should move around freely, the reality is that there are too many interdependencies to make us want to migrate applications if a natural disaster has not happened yet, or if it is uncertain it is going to happen.

Some of the new app development architectures are trying to solve this problem with container infrastructures. However, for stateful datasets, this is still a problem if application-level coherence isn’t available across zones and regions.

Organizations may have their replication and DR plans in place, but the missing component is the intelligence to deal with natural disaster events as they unfold, without relying on human interaction and without triggering DR if not needed.

The solution to the problem is to be as prepared as possible in case of a DR event becomes necessary, and that means having the most recent data available in the target DR datacenter. Hence, if the disaster strikes, you now have the shortest Recovery Point Objective (RPO) as possible.

One way to solve the problem is to have all your core applications being synchronously replicated between datacenters. While this seems logical, it is extremely expensive and will not withstand a ransomware attack or corruption since all data is automatically replicated across source and target data centers.

This post assumes some prior knowledge of how Datrium snapshot and replication work.

The Python tool

The reason I built this Python tool for the Datrium DVX is to enable real-time weather and geological monitoring using real-time data from the U.S. Geological Survey and to trigger actions under certain conditions. The tool uses Datrium API SDK for Python.

These conditions may be high winds or temperatures above the defined limits or a high magnitude earthquake that just hit the region and has the potential to cripple the public infrastructure in the next few minutes or hours.

The tool can be configured to monitor natural hazard events in a particular location every few minutes, and if conditions are satisfied, it will snapshot applications and force replication to the DR site.

If you are curious about Datrium Zero RTO I recommend this reading.

This solution is generally better than having all your VMs snapshotting and replicating every few minutes due to resource utilization and cost. Typically, organizations have well-defined SLAs, but if they need to increase snap and replication of specific applications under certain natural hazard conditions, they can you this tool.

Note that Datrium also provides orchestrated plans for DR that can be triggered if necessary, and when using in conjunction with this tool, you will always have the latest data available when needed.

Of course, this tool can also be repurposed for different infrastructures, but Datrium offers the perfect platform for Site DR, Cloud DR, and Ransomware protection.

I have posted the Python source code under Apache License, Version 2.0  licensing on my GitHub repository here.
Natural Hazard monitoring

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Backups – Feasibility vs. Requirements

A recent conversation regarding application protection triggered me to think about feasibility vs. business requirements. This topic applies to pretty much any enterprise technology domain, but in this case, I want to focus on backup and snapshot technologies. I’ll use Datrium as an example given its high degree of flexibility.

First some key technology fundamentals.

  1. Datrium provides native VM-centric snapshots
  2. Snapshots are non-intrusive and offer zero performance impact
  3. Snapshots may be configured as often each minute, allowing 1 minute RPO restores.
  4. 1.2 Million snapshots are supported
  5. 2,000 snapshots per VM are supported
  6. Entire datastores can be snapshotted at once
  7. Snapshot retention can be of any desired period
  8. Local snapshots enable Zero RTO restores

If the fundamentals above are presented to application owners as an elective SLA, they will likely choose the SLA that minimizes their downtime exposure. Every 60 minutes seem to be the response I hear the most, but 30 and 15 minutes RPO aren’t uncommon.

Compared to current backup strategies that offer backups once every 24 hours, this is already a radical change towards improving application SLAs.

The next topic relates to the snapshot retention period, or how long the application owner would like to keep point-in-time copies for – and the answer is regularly anything between one and seven years.

Next, we start having a conversation about snapshots for site recovery vs. snapshots for ransomware recovery. Yes, those have very different requirements.

As an example, you may want applications snapshotted every 10 minutes and available onsite for the purpose of recovering individual applications from a ransomware attack keeping these for 24 hours. However, you may likewise want to have another time-series snapshot that happens every hour, is kept for 1 month and at the same replicated offsite for the purpose of disaster recovery.

The application X data protection schema could look like this:

  • Time-Series 1 – 10 minutes RPO – 24 hours retention
  • Time-Series 2 – 1 hour RPO – 30 days retention > offsite replication
  • Time-Series 3 – 24hs RPO – 7 years retention > cloud replication

In this scenario, application X can be restored back to a known-good-state every 10 minutes for the first 24 hours, every 1 hour for 30 days, and every 24hs for 7 years. It seems like a good protection plan to me, but the application SLA is what defines if the plan is good enough.

Datrium offers metadata-based snapshots with compression and deduplication, and that certainly helps with data reduction across common datasets. However, if the data being created is not really de-dupable it will generate overhead that could use a reasonable amount of storage capacity. Further, it also comes down to the application data change rate – unique data being created.


Storage Capacity

Let’s look at the same scenario from a storage consumption angle using Datrium Global De-Dupe Index across all customers (4.4x), but please note that your mileage may vary according to the data. Moreover, let’s use an initial 100TB dataset to make the math easier, and for change rate let’s assume 0.20% for 10 min, 0,4% for 1 hour and 1% for daily.

I’m not going to get into the calculation details in this post, but the total capacity required in this case with the snapshot frequency and retention period mentioned would be ~667 TB.

So it is important to ensure you have the adequate usable capacity or have plans to add capacity in the future since 668 TB will be the total capacity required in year 7. Datrium current upper capacity is 4.4 Petabytes, but insiders are already taking in Exabyte terms.

Optionally, you may prefer to have longer retention policies in a DR or Cloud site then on-premises.

(Protection Scheduled – click to make it bigger)

Snapshot Counting

Another layer for decision making is the total amount of snapshots. Assuming we have 100 VMs backed up according to the schedule above we would have 352,000 snapshots after 7 years. However, if we are backing up 1,000 VMs with this same schedule there would be 3.5M snapshots, a number above systems limits today.

The engineering and QA teams are always improving system limits, but for production environments, the solution should be architected with current limits in mind.


Conclusion

As you can see there is tremendous flexibility in the way Datrium enables organizations to protect applications, but with power comes responsibility and IT must work with application owners to achieve the expected SLA within the existing constraints.

Just because it is possible, doesn’t mean you should do it.

A more common approach nowadays is the creation of SLA tiers, such as bronze, silver, and gold that are then offered to the business as alternatives, eliminating complexity and ensuring the platform correctly sized.

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

We executed a Disaster Recovery Plan for our VDI!

I have been working with EUC solutions for a long time and for as long as I can remember, VDI solutions like Horizon and XenDesktop lacked a practical way to deliver DR.

In order to execute DR, administrators had to abstract components of the solution, virtualize applications, manually replicate golden images and make heavy use of non-persistent desktops.

Even like that, it is complex to have an effective DR plan that caters to business requirements, and for that reason, most IT organizations opt for having standby VDI deployments. This, of course, is an expensive solution as organizations need to keep paying for resources until the DR plan is triggered; and there could lengthy and cumbersome manual work.

Before we look into DR for VDI I would like to highlight what I have been saying for a while and that more and more customers are adopting scale-out technologies for VDI. However, instead of HCI that is inflexible in terms of performance and storage capacity ratios, these customers are adopting disaggregated architectures, like Datrium. Datrium is perfect for VDI as it also removes numerous challenges in deploying and scaling VDI deployments.

Here is a presentation that Simon Long and I did with ActualTech Media and where we demo DR for a full-stack VMware Horizon deployment. Enjoy!

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Load more