Linked by Thom Holwerda on Mon 2nd Nov 2009 23:20 UTC
Sun Solaris, OpenSolaris ZFS has received built-in deduplication. "Deduplication is the process of eliminating duplicate copies of data. Dedup is generally either file-level, block-level, or byte-level. Chunks of data - files, blocks, or byte ranges - are checksummed using some hash function that uniquely identifies data with very high probability. Chunks of data are remembered in a table of some sort that maps the data's checksum to its storage location and reference count. When you store another copy of existing data, instead of allocating new space on disk, the dedup code just increments the reference count on the existing data. When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples."
Thread beginning with comment 392521
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[3]: I skimmed the article...
by Laurence on Tue 3rd Nov 2009 14:31 UTC in reply to "RE[2]: I skimmed the article..."
Laurence
Member since:
2007-03-26

Err, no. There is no way the intro/outro scenes are going to be byte-by-byte-identical in the encoded data for different episodes even if they look identical to the eye. Even if nothing else is, the timestamp metadata for each frame is going to differ.


I guess that depends on the codec used.
I thought many MPEG codecs didn't have a timestamp as such and used a form of encoding that allowed an MPEG file (be it a video container file or an MP3 audio file) to be chopped in to parts at any random point and each of the parts can still play individually (much like the myth about worms ability to be chopped up and each part becoming alive)

Besides, your point is only valid for shows that have a pre-opening credits teaser rather than those (typically older) shows that always opened with music and credits.

Reply Parent Bookmark Score: 2

Tuxie Member since:
2009-04-22

Well, why don't you just try it for yourself?

diff <(head -c 100000 file1.avi) <(head -c 100000 file2.avi)

This will compare the first 100000 bytes of file1.avi and file2.avi.

Reply Parent Bookmark Score: 1

Laurence Member since:
2007-03-26

Well, why don't you just try it for yourself?

diff <(head -c 100000 file1.avi) <(head -c 100000 file2.avi)

This will compare the first 100000 bytes of file1.avi and file2.avi.


Unfortunately I'm moving house in a couple of days so my DVDs are packed away - thus I can't rip and diff. (though if anyone else is able to perform this test then i'd love to see the results)

....So I'm going to have to take your word on the differences being there.



However, (and going back to deduping for a moment) if understood the article properly, then the credits don't have to by byte for byte exact as the dedup looks at the bytes themselves rather than the whole MB block of bytes.

Thus there only has to be enough similar grouped bytes for a space saving to occure.

So unless MPEG compression uses some kind of random hash to encode it's data, then surely the very fact that the A/V is the same (timestamp or not) must mean that there are SOME similar bytes that can be grouped and indexed?

Reply Parent Bookmark Score: 2