Breaking the Data Gravity Hypothesis… The Data Anti-Gravity

Data Gravity is the term used describe the hypothesis that Data, like planets, have mass and that applications and services are naturally attracted to Data. This is the same effect Gravity has on objects around a planet.

Dave McCrory first coined {link} the term Data Gravity to explain that because of Latency and Throughput constraints applications and services will or should always be executing in proximity to Data — “Latency and Throughput, which act as the accelerators in continuing a stronger and stronger reliance or pull on each other.”

The theory is also used to describe how Data mass growth restrains applications to move to or from private and public clouds. What some people also allude to is that inertia causes the real data mobility problem and that transferring large amounts of data is still hard because of the speed of light.

Data Gravity – in the Clouds – by McCrory


In any way the hypotheses are applied, the truth is often in the eye of the beholder, and most commonly a sound technology has not been developed, or it has not been used in a way that breaks the hypotheses.

This Data Gravity theory is highly applicable to the evolution of datacenters and clouds, both private and public. The ascension of host attached Flash devices, and the ability to utilize them on local computing buses vs. over a network is a clear indication that applications benefit from the Data proximity.

However, when it comes to application and system mobility, we are still wrapped by Latency and Throughput, making such data movement hard, particularly when addressing vast Data Lakes. McCrory also determined the key factors that are preventing Data Gravity mitigation, including, Network Bandwith, Network Latency, Data Contention vs. Data Distribution, Volume of Data vs. Processing, Data Governance/Laws/Security/Provenance, Metadata Creation/Accumulation/Context, and Failure State.

In the case of data movement between clouds, the real puzzle is how to dilute and reduce Data to it’s most essential fundament, a sequence of bits and bytes that never repeat themselves. Also known as Data Deduplication, this technology has been around for many years, but it has always been used as in a self-contained manner, this means that Data is de-duplicated in a container, drive, in a host, in a cluster, on the wire.

If was possible to de-duplicate application data at a global level, across various datacenters, across clouds, across Data Lakes, and across systems then we would be guaranteeing very high-level of data availability in each part of the globe because data becomes ubiquitous and universal.


How does that work in practice?

An application running in a private datacenter has each data block de-duplicated and hashed locally creating a unique fingerprint, then these fingerprints are compared to hashes available on AWS from this own same system or from all systems running on AWS from all customers that are also uniquely hashed and fingerprinted. Then only the outstanding and unique data is transferred before migrating the application location from an on-premise to AWS in a fraction of the time and bandwidth that is required today with traditional mechanics.

Universal De-duplication makes data ubiquitous and universal, common to every possible application and system, while Metadata takes on a vital role, building datasets, enforcing policies and distribution. Metadata is what defines my Data Lake, from your Data Lake – my system from your system.

Universal Deduplication solves the puzzle and most of the key issues described by McCrory. The bigger the pool of de-duplicated data available on a given location (AWS, Azure, On-Premise), the lesser bandwidth is required because most of the necessary data is already there. Data Contention and Distribution issues are gone because data is ubiquitous and common to all systems, while Metadata starts playing a vital role. Data Governance becomes a Metadata intricacy, not a data problem. Encryption and Data Security becomes a concern at the Metadata level, not at the Data level.

While there would need to have a further in-depth discussion on the matter, it is clear that unless we look at the problem with different eyes, we will not solve the exponential data growth problem and data mobility. Universal de-duplication is a sound way to address the Data Gravity hypothesis. The Data Anti-Gravity.


How does that relate to Datrium?

On a smaller scale that’s one of the critical issues that Datrium is solving for enterprise data center workloads, especially Virtual Machines, and Stateful Containers. In a Datrium solution, Data is universally de-duplicated across drives, hosts, systems, clusters, links, multi-datacenter deployments or on AWS – for a customer or enterprise.

A simple example, an IT admin may load stale and legacy data from a backup onto a secondary of DR site that has not been backed up from the same system, same dataset, or even using the same backup tool; and the primary site will never re-send data that is already on the destination system. No checkpoints required, no pre-synchronization – it’s simple universal metadata hash and fingerprinting comparison at the most basic level.

Another example, when archiving or restoring data from AWS S3 only data with unique hash fingerprints are sent for archiving, reducing the amount storage and bandwidth required, but also, more importantly, solving another more meaningful issue, the artificial Data Gravity created by AWS, with the high egress cost for data.


This article was first published by Andre Leibovici (@andreleibovici) at

1 ping

  1. […] Unless we defeat data gravity it’s not possible to build a Hybrid Cloud that won’t incur high networking and cloud costs. You can read more about Data Gravity here. […]

Leave a Reply

Your email address will not be published.