Linked by Thom Holwerda on Fri 6th Nov 2009 23:42 UTC, submitted by poundsmack
Permalink for comment 393320
To read all comments associated with this story, please click here.
To read all comments associated with this story, please click here.





Member since:
2005-07-08
To allow ZFS to be crash proof, there must be certain really basic mechanisms implemented in a way, that adheres to specifications and standards.
Which doesn't always happen in the real world, be it because the devices are buggy or because the devices are great but have a firmware bug, or because they are old and stop working properly. So there will be a very very small, but existent, group of users who will have problems.
The cause of the problem is not ZFS' fault, it's the hardware fault, but the lack of a tool to fix the filesystem or recover data from it is the filesystem fault, not the hardware fault.
The whole post explains very well why ZFS reliability depends a tiny bit on hardware behaviour - that is equivalent to say that ZFS doesn't rely absolutely everything on its own design to avoid absolutely all kind of problems. However, the "ZFS doesn't needs fsck" attitude assumes that the ZFS design can avoid all kind of problems...that's somewhat contradictory.
The need of helpers to fix things is clearly there, just take a look at the last month in the ZFS lists. Here's http://mail.opensolaris.org/pipermail/zfs-discuss/2009-November/033... from 6 days ago. And also the second paragraph of http://mail.opensolaris.org/pipermail/zfs-discuss/2009-November/033... which pretty much says what i just wrote but in less words: "I've no objection to deciding how much recovery tools are needed based on experience rather than wide-eyed kool-aid ranting or presumptions from earlier filesystems, but so far experience says the recovery work was really needed"
BTW: Linux has similar problems with problematic hard- and software like components not honoring write barriers.
But it has a fsck, which makes Linux (and solaris UFS) users think "hey, something can get corrupted due to a bad disk that doesn't handle the sync cache commands correctly, but if I hit the problem at least I can try to fix it"
Some may call the results of PSARC 2009/479 something like an fsck tool, but it isn't.
The disk state is inconsistent and the tool fixes it - it's a fsck. The uberblock is a part of the filesystem metadata, a wrong uberblock is a filesystem inconsistency. Just because that tool is transactional based doesn't means it isn't fixing something.
Besides, the kind of corruption that you can hit with bad hardware is not neccesarily a uberblock that points to something that hasn't been written to the disk due to bad cache handling - it can be other things. When hardware fails the resulting behaviour is undefined.
At first you should to throw the sub-sub-substandard hardware in the next available trash bin after copying the the data to a storage subsystem of better quality and wiping the old disks.
Well, one of the most common ZFS catchphrases is that you can do reliable storage with very cheap disks - so it's quite probable that users and enterprises will do exactly that, don't you think?
Edit: Trying to fix links...
Edited 2009-11-07 01:34 UTC