«

»

Oct 02 2014

How to Fingerprint existing VMDK/VHD/RAW for De-duplication on Nutanix

In my article How to enable Nutanix De-duplication per VMDK/VHD/RAW I explained how to enable or disable de-duplication and fingerprinting-on-write for individual VMDK/VHD or RAW vdisks via nCLI.

Once de-duplication and fingerprinting-on-write are both enabled for a given vdisk NOS will process every new write with the US Secure Hash Algorithm 1 (SHA1) using native SHA1 optimizations available on Intel processors. The created hashes are then utilized for post-process de-duplication executed by the Curator background process. However, I have omitted a somewhat important information about this process.

Because only new writes are fingerprinted-on-write, previous writes that now constitute data blocks in existing vdisk are not fingerprinted and therefore will not be considered for the post-process de-duplication.

In order to fingerprint existing vdisk data it is necessary to utilize a new nCLI command called vdisk_manipulator. The vdisk_manipulator is useful to allow you to fingerprint entire disks with existing data after a Nutanix NOS 4.x upgrade or after enabling de-duplication for the first time in a container.

 

To fingerprint an entire vdisk

% vdisk_manipulator –-vdisk_name=NFS:90967668 –operation=add_fingerprints

(click to enlarge)

Screen Shot 2014-10-02 at 2.29.28 PMNote: The process may take a little while depending on the disk size because NOS will need to map and hash all vdisk data blocks .

 

 To fingerprint  portion of a vdisk, as an example the first 10GB

% vdisk_manipulator –-vdisk_name=NFS:90967668 –operation=add_fingerprints –end_offset_mb=10240

 

Fingerprinting only a portion of the disk is useful in cases whereas a vdisk contains the System OS (Windows, Linux, etc.) and at the same time it contains a large amount of non-dedupable data such as videos, application level de-duplicated data such as Exchange and Zip files, or transactional databases. The process of manually fingerprinting data generates a large amount of metadata that overtime may demand extra RAM for the Cassandra distributed database, therefore be thoughtful about fingerprinting unnecessary vdisks. In cases where de-duplication is not ideal it’s suggested enabling compression for the container.

The vdisk_manipulator has also additional options to delete fingerprints and compress/decompress.

Please note that if you have enabled post-process de-duplication at the container level when you first created the container all data in every vdisk is automatically fingerprinted-on-write and de-duplicated.

As you can see the functional implementation for fingerprinted-on-write and de-duplication is per vdisk, but in Nutanix PRISM is has been exposed as a container feature for ease of use and simplicity.

 

If you want to learn more about on-disk de-duplication I suggest Nutanix 4.0 Hybrid On-Disk De-Duplication Explained or the Nutanix Bible.

 

This article was first published by Andre Leibovici (@andreleibovici) at myvirtualcloud.net

Leave a Reply