I solved Natural Hazard DR with Python and Datrium.
One of the critical issues with disaster recovery, in general, is that you only want to trigger DR when the disaster strikes.
As much as I would love to say that, ideally, applications should move around freely, the reality is that there are too many interdependencies to make us want to migrate applications if a natural disaster has not happened yet, or if it is uncertain it is going to happen.
Some of the new app development architectures are trying to solve this problem with container infrastructures. However, for stateful datasets, this is still a problem if application-level coherence isn’t available across zones and regions.
Organizations may have their replication and DR plans in place, but the missing component is the intelligence to deal with natural disaster events as they unfold, without relying on human interaction and without triggering DR if not needed.
The solution to the problem is to be as prepared as possible in case of a DR event becomes necessary, and that means having the most recent data available in the target DR datacenter. Hence, if the disaster strikes, you now have the shortest Recovery Point Objective (RPO) as possible.
One way to solve the problem is to have all your core applications being synchronously replicated between datacenters. While this seems logical, it is extremely expensive and will not withstand a ransomware attack or corruption since all data is automatically replicated across source and target data centers.
This post assumes some prior knowledge of how Datrium snapshot and replication work.
The Python tool
The reason I built this Python tool for the Datrium DVX is to enable real-time weather and geological monitoring using real-time data from the U.S. Geological Survey and to trigger actions under certain conditions. The tool uses Datrium API SDK for Python.
These conditions may be high winds or temperatures above the defined limits or a high magnitude earthquake that just hit the region and has the potential to cripple the public infrastructure in the next few minutes or hours.
The tool can be configured to monitor natural hazard events in a particular location every few minutes, and if conditions are satisfied, it will snapshot applications and force replication to the DR site.
If you are curious about Datrium Zero RTO I recommend this reading.
This solution is generally better than having all your VMs snapshotting and replicating every few minutes due to resource utilization and cost. Typically, organizations have well-defined SLAs, but if they need to increase snap and replication of specific applications under certain natural hazard conditions, they can you this tool.
Note that Datrium also provides orchestrated plans for DR that can be triggered if necessary, and when using in conjunction with this tool, you will always have the latest data available when needed.
Of course, this tool can also be repurposed for different infrastructures, but Datrium offers the perfect platform for Site DR, Cloud DR, and Ransomware protection.
I have posted the Python source code under Apache License, Version 2.0 licensing on my GitHub repository here.
This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net