Linked by Thom Holwerda on Mon 2nd Nov 2009 23:20 UTC
Sun Solaris, OpenSolaris ZFS has received built-in deduplication. "Deduplication is the process of eliminating duplicate copies of data. Dedup is generally either file-level, block-level, or byte-level. Chunks of data - files, blocks, or byte ranges - are checksummed using some hash function that uniquely identifies data with very high probability. Chunks of data are remembered in a table of some sort that maps the data's checksum to its storage location and reference count. When you store another copy of existing data, instead of allocating new space on disk, the dedup code just increments the reference count on the existing data. When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples."
Thread beginning with comment 392572
To read all comments associated with this story, please click here.
Clarification requested
by aaronb on Tue 3rd Nov 2009 18:42 UTC
aaronb
Member since:
2005-07-06

ZFS deduplication is synchronous....


What happens when you turn de-duplication on, for an existing ZFS pool?

I am unsure whether the existing data is de-duplicated or not.

RE: Clarification requested
by dilidolo on Tue 3rd Nov 2009 22:39 in reply to "Clarification requested"
dilidolo Member since:
2006-02-02

You can backup and restore, or ZFS send and receive. I believe async is still in development.

We use NetApp and Datadomain, NetApp only has async but Datadomain only has sync. Now Datadomain is owned by NetApp, we'll see when NetApp would have both. Hopefully SUN would beat NetApp to have both first.

Reply Parent Bookmark Score: 1

RE: Clarification requested
by Beket_ on Wed 4th Nov 2009 12:18 in reply to "Clarification requested"
Beket_ Member since:
2009-07-10

I would *expect* to be the same as with compression= and checksum= options.

For example if you switch your checksumming algorithm from A to B, the old files are using A and new files B. Or, if you enable compression in a dataset that already has uncompressed files, they remain uncompressed. Only newly created files are affected.

Reply Parent Bookmark Score: 1

Kebabbert Member since:
2007-07-27

Standard chksum algorithm is SHA256. Incidentally, Niagara SPARC computes SHA256 in chip hardware, achieving 41GB/sec.

You can also choose to use fletcher4, which is very fast but not cryptographically strong. Which means that there is a very low probability of yielding a collision.

With SHA256, the chance of a collision is 2^(-256) which is extremely extremely low probability. Maybe it is like 10^(-71) or so for two differing blocks to collide.

But, you can request that if there is a hash collision, ZFS must compare bit for bit. This makes dedupe totally safe against collisions.

Reply Parent Bookmark Score: 2