Linked by Thom Holwerda on Fri 6th Nov 2009 23:42 UTC, submitted by poundsmack
Thread beginning with comment 393364
To read all comments associated with this story, please click here.
To read all comments associated with this story, please click here.
The state isn't set back per block device, it's set back per pool. This recovery is valid for *all* filesystems and emulated volumes in a dataset.
Don't think of ZFS as a regular filesystem. It's a combination of a volume manager and a filesystem. The filesystem is just one view to a data pool via the ZFS Posix Layer. But in the same pool emulated block devices can exist via the ZFS Emulated Volume Layer.
You have to think different in the context of ZFS. I'm not a ZFS developer, but i'm working with ZFS since 2005 ... so i have some knowledge about it 
Don't think of ZFS as a regular filesystem. It's a combination of a volume manager and a filesystem. The filesystem is just one view to a data pool via the ZFS Posix Layer. But in the same pool emulated block devices can exist via the ZFS Emulated Volume Layer.
Sun did the computing world a *huge* disservice by calling it "ZFS, the filesystem". They really should have called it what it actually is: "ZSMS, the Zettabyte Storage Management System". That would have solved so many of these kinds of issues for people.
Once you start looking at if from a storage management position, instead of "it's just a fancy fs" position, it becomes a lot simpler to understand and work with.
Unfortunately, it's too late now, and these kinds of misunderstanding and misconceptions are just going to continue to get worse.
ZFS "the filesystem" doesn't need an fsck tool. It has features that make sure data is either written correctly, or not written at all. And if a specific block can't be read or doesn't match the checksum, then it pulls it from a different copy.
ZFS "the storage pool manager" manages all the storage transactions. If something goes wrong, it can lead to an unimportable storage pool (ie, all the filesystems and volumes above it are inaccessible). Previously, one had to manually much around with dd, zdb, and voodoo to tell the storage pool to load from a previous transaction group. Now, one can do that automatically. No filesystem checking is done. It just picks an older point in time (transaction group), and loads from there. All your data (up to that point in time) is intact.
Once the pool is imported, and all the filesystems and volumes are available, you have the option of running a background scrub on the pool (the entire pool, not individual filesytems and volumes) to make sure that the data is intact. The scrub will compare the checksums on every single block in the pool, and repair anny that are bad via redundant copies.
Thus, a filesystem-specific tool that checks that one filesystem's metadata on disk (aka fsck) is not needed. Tools are already available that give a better end result ... just from a different direction.
Edited 2009-11-08 03:25 UTC







Member since:
2007-09-22
It sounds like this situation asks for a per blockdevice consistent state that can be rolled back. So transactional at the blockdevice level not the filesystem level.
So the system can detect that one blockdevice (harddisk, partittion, ssd, usb, whatever) didn't save the latest state, but can rollback to the previous transaction that all blockdevices have or ask the user if that is what they want.
Maybe it already has that, I've not yet used ZFS and/or don't know the internals of ZFS.
From the article and comments I think you could conclude something similair might be going on, but the author isn't a filesystem-designer.
I'm not such a person either. ;-)