Danger! Beware of potential Data Loss through Bit Rot…

Not all Hyperconverged Solutions are created equal. It’s all very well for vendors to discuss data resiliency and data recovery, but when they claim to support Tier One workloads, yet are not giving real consideration and protection to customer’s data, then there is a problem for the customer. One such example is VSAN. Unfortunately for customers VSAN does not have the core software-based data protection mechanisms in place to prevent data loss due to bit rotting. We all know that Data loss experience in any organization is significant, impactful and in this day in age not tolerated. In this article I intend to outline the process of bit rot/data loss and the variances of features between Nutanix and VSAN.

Bit rotting is the deterioration of the integrity of data stored on storage media. It is also known by the names data rot and silent corruption. Most disks, disk controllers and file systems are subject to a small degree of unrecoverable failure. With the ever-growing disk capacities, data sets, and increases in the amount of data stored in magnetic and flash media, the likelihood of the occurrence of data decay and other forms of uncorrected and undetected data corruption increases.

Different techniques can mitigate the risk of such underlying failures such as by increasing redundancy, implementing integrity checking and self-repairing algorithms. The ZFS file system was designed to address many of these data corruption issues. EMC Isilon OneFS also has a service called MediaScan to periodically check for and resolve drive bit errors across the cluster. The Nutanix NDFS file system also includes data protection and recovery mechanisms.

A Netapp study found out that the risk of losing data through bit rotting events is thousands of times higher than predicted by “MTBF” failure models.

The problem that bit rotting poses to distributed storage systems where multiple copies of the data exist is that these systems may write or replicate a bad copy of the data making all copies unusable. In some cases, a good copy of the data could be overwritten with a bad one. There are two main methods to prevent or correct bit rot data, the first is to perform disk scrubs, which is something every reputable array vendor does; the second involves the use of redundant copies and checksuming to verify data integrity.

 

“A checksum or hash sum is a small-size datum from a block of digital data for the purpose of detecting errors which may have been introduced during its transmission or storage. It is usually applied to an installation file after it is received from the download server. By themselves checksums are often used to verify data integrity, but should not be relied upon to also verify data authenticity.” – wikipedia

 

Stargate

Every Nutanix node has a process called Stargate that amongst many other things is responsible for processing checksums. While the data is being written, a checksum is computed and stored as part of its metadata. Any time the data is read, the checksum is computed to ensure the data is valid.  In the event where the checksum and data don’t match, the replica of the data will be read and will replace the non-valid copy.

The response from each replica will carry the checksums of the updated data block. Each replica can then verify them to ensure that everyone wrote the exact same data. The Stargate service issuing a WriteOp can then store the resulting checksums within the metadata entry – this permits a disk scrubber to later verify these checksums.

 

Disk Scrubber (Curator)

Another important service is the Curator and it does continuous non-deterministic fault recovery. Besides being
responsible for data replication the Curator is continuously monitoring data integrity by verifying checksums of random data groups across the entire cluster.

The disk scrubbing activity is done at low priority for all disks in the cluster. Any corrupted data result in the data replica getting marked as bad – thus triggering off replication from a good replica. So even if a disk sector was to go bad after a successful I/O, Stargate’s scrubber operation would detect it and then create new replicas as necessary.

 

Nutanix & VSAN – The Variances

In summary, the Nutanix distributed file system has a number of features that ensure that checksums are computed and that data integrity is prioritized to guarantee that Customer data is safe. VSAN in it’s current release 6.0 doesn’t yet provide software-based checksum to protect against bit rotting which should be concerning for organizations adopting distributed storage architectures.

In fairness, VSAN does provide limited support for hardware-based checksum, but it will depend on the controller being used and it was difficult to find data on this while searching the web or the HCL. According to some blog posts only two controllers are supported and the What’s New VSAN 6.0 document has the following mention “Support for hardware-based checksum – Limited support for controller-based checksums for detecting corruption issues and ensuring data integrity (Refer to VSAN HCL for certified controllers).” I hear that VSAN will be introducing software-based checksum in 2016.

[UPDATE] VSAN 6.1 was announced today (31/8/2015), and still doesn’t provide software-based checksum.

[UPDATE] VSAN 6.2 offers now end-to-end software-based checksumming for the AllFlash solution only.

 

We may be competitors, but data loss prevention, protection and integrity are important considerations for those recommending or purchasing a HCI system and we should all be clear on what these differences are to make informed choices.

Not all Hyperconverged Solutions are created equal!

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

8 comments

1 ping

Skip to comment form

    • Andre Leibovici on 09/01/2015 at 7:26 am

    Julian, that’s great to hear, but until checksumming and data integrity checks are in place no organization should consider using VSAN for any types of workloads; it’s just too risky, specially considering that any storage media or networking interfaces are accepted by VSAN.

    As I mentioned in my article, we may be competitors, but data loss prevention, protection and integrity are important considerations for those recommending or purchasing a HCI system and we should all be clear on what these differences are to make informed choices.

    • Patrick Bingham on 09/02/2015 at 11:05 am

    Andre, you might want to check this site out.

    https://blogs.vmware.com/virtualblocks/2015/08/31/too-soon-nah-vsan-technology-preview/#sf40690832

    “Data Integrity

    While not as sexy as deduplication and erasure-coding, data integrity is critical and goes back to our commitment to deliver a highly-available and highly-resilient storage solution. To that end, we are excited to announce that we plan to test end-to-end, software checksums in the beta as well. The goal is to protect against storage bit rot, network problems, software and firmware issues. The checksum will use CRC32c, which utilizes special CPU instructions thanks to Intel, for the best performance. These software checksums will complement the hardware-based checksums available today.”

  1. HI Patrick,

    They are talking about the “dot dot” NEXT release, because the .next release is not even out as yet. In my opinion it’s an acknowledgement to the fact that VSAN is not able to provide data integrity protection today.

    As I mentioned in my article, we may be competitors, but data loss prevention, protection and integrity are important considerations for those recommending or purchasing a HCI system and we should all be clear on what these differences are to make informed choices.

    -Andre

  2. Curious your thoughts on a solution Microsoft has been advocating since Exchange 2010 was released (and admitted they have used in house).

    If Checksum’s are a requirement to run a Tier 1 application then why does Microsoft recommend (and use internally) deploying DAG with Exchange in a JBOD (no RAID, no controller) configuration?

    http://blogs.technet.com/b/ucedsg/archive/2010/05/06/can-i-really-use-jbod-storage-with-exchange-2010.aspx

    Is Exchange an acceptable type of workload to not rely on your storage vendor for checksums?

    Considering other filesystems and applications handle their own recovery (ReFS/ZFS/BTRFS) as well as other applications (Oracle uses them, and they have the added bonus of catching in memory corruption, something that a back end disk system isn’t going to realize was bad data to begin with).

    While Checksum’s are great and having more layers to deploy them is useful (especially for frail, legacy applications and file systems that did not handle failure well) It would appear most modern OS’s FileSystems, and Applications can handle bit rot (as well as a number of failure conditions that the back end disk system just isn’t going to catch).

    • Andre Leibovici on 09/20/2015 at 11:06 pm

    John, Exchange 2010 and newer all have application level software-based checksumming. That is particularly useful to maintain integrity at the database layer.

    “Database checksumming (also known as Online Database Scanning) is the process where the database is read in large chunks and each page is checksummed (checked for physical page corruption). Checksumming’s primary purpose is to detect physical corruption and lost flushes that may not be getting detected by transactional operations (stale pages).”

    http://blogs.technet.com/b/exchange/archive/2011/12/14/database-maintenance-in-exchange-2010.aspx#checksumming

    While ReFS/ZFS/BTRFS detect and recover file-system level corruption it will no protect remote copies of the data that are specific to distributed systems such as Nutanix and VSAN. If you have 2 good copies and 1 bad copy how will the guest file-system decide what copy should be used if 1 good copy goes missing?

    Exchange is a good example where corruption protection was added to the application/database layer ,but that is far from reality for the large majority of workloads. I wonder if Microsoft added that only to support DAG’s with JBOD.

  3. Given the amount of times I saw Exchange 2003/2007 corrupt EDB’s on Netapp/EMC arrays I suspect it was added for more than just JBOD. While bit rot is scary application, OS, and Hypervisor crashes mid log commit are shockingly more common in my anecdotal experience. Oracle also does their checksums in their DB product for similar reasons.

    If memory serves ZFS and ReFS verify the checksum on every read (and will automatically repair the “bad” copy if they happen to be reading from it that moment). Now unlike the application level they will not initiate a repair unless you are using their parity/redundancy systems (Although with ReFS you will get a nice event ID 133, and by default it will block the read and retry. If it reads the other copy it would go through (and if your monitoring your logs like you are supposed to you could down that node, run a repair/chkdsk/re-mirror) and fix it. In the event both copies have become corrupted then your application level redundancy can take a shot at repairing it (SQL alwaysON, DAG mirror). The key point is that modern file systems can and will detect bit rot defeating its most dangerous function (that its silent and you don’t notice ntil it is too late),

    Now if you are de-duping the duplicate copies of DAG or AlwaysOn (Which I believe Nutanix recommends) I would share the level of concern you have. Bit rot could take out multiple copies of data in your BCA application clusters undermining the entire point of these features. I suspect (I don’t actually know) that this is why checksum’s and dedupe are mentioned both for release together in the .NEXT edition.

    • Andre Leibovici on 09/21/2015 at 9:56 pm

    John, you are mistaken. That’s not how de-duplication works in distributed architectures. In such architectures the tolerance or replication factor is always maintained, even after de-duplication. Checksum is a function completely independent from de-duplication.

  1. […] Danger! Beware of potential Data Loss through Bit Rot… […]

Leave a Reply