‘No, ZFS Really Doesn’t Need a fsck’

Submitted by poundsmack 2009-11-06 Solaris 23 Comments

“There is a discussion at osnews.com about a simple question: “Should ZFS Have a fsck Tool?”. The answer is simple: No. I could stop now, as this answer is pretty obvious when you work a while with ZFS, but i want to explain my position. And i want to ask a different question at the end.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

23 Comments

2009-11-07 1:15 am

diegocg
To allow ZFS to be crash proof, there must be certain really basic mechanisms implemented in a way, that adheres to specifications and standards.

Which doesn’t always happen in the real world, be it because the devices are buggy or because the devices are great but have a firmware bug, or because they are old and stop working properly. So there will be a very very small, but existent, group of users who will have problems.

The cause of the problem is not ZFS’ fault, it’s the hardware fault, but the lack of a tool to fix the filesystem or recover data from it is the filesystem fault, not the hardware fault.

The whole post explains very well why ZFS reliability depends a tiny bit on hardware behaviour – that is equivalent to say that ZFS doesn’t rely absolutely everything on its own design to avoid absolutely all kind of problems. However, the “ZFS doesn’t needs fsck” attitude assumes that the ZFS design can avoid all kind of problems…that’s somewhat contradictory.

The need of helpers to fix things is clearly there, just take a look at the last month in the ZFS lists. Here’s http://mail.opensolaris.org/pipermail/zfs-discuss/2009-November/033… from 6 days ago. And also the second paragraph of http://mail.opensolaris.org/pipermail/zfs-discuss/2009-November/033… which pretty much says what i just wrote but in less words: “I’ve no objection to deciding how much recovery tools are needed based on experience rather than wide-eyed kool-aid ranting or presumptions from earlier filesystems, but so far experience says the recovery work was really needed”

BTW: Linux has similar problems with problematic hard- and software like components not honoring write barriers.

But it has a fsck, which makes Linux (and solaris UFS) users think “hey, something can get corrupted due to a bad disk that doesn’t handle the sync cache commands correctly, but if I hit the problem at least I can try to fix it”

Some may call the results of PSARC 2009/479 something like an fsck tool, but it isn’t.

The disk state is inconsistent and the tool fixes it – it’s a fsck. The uberblock is a part of the filesystem metadata, a wrong uberblock is a filesystem inconsistency. Just because that tool is transactional based doesn’t means it isn’t fixing something.

Besides, the kind of corruption that you can hit with bad hardware is not neccesarily a uberblock that points to something that hasn’t been written to the disk due to bad cache handling – it can be other things. When hardware fails the resulting behaviour is undefined.

At first you should to throw the sub-sub-substandard hardware in the next available trash bin after copying the the data to a storage subsystem of better quality and wiping the old disks.

Well, one of the most common ZFS catchphrases is that you can do reliable storage with very cheap disks – so it’s quite probable that users and enterprises will do exactly that, don’t you think?

Edit: Trying to fix links…

Edited 2009-11-07 01:34 UTC

2009-11-07 5:54 am

Dryhte

At first you should to throw the sub-sub-substandard hardware in the next available trash bin after copying the the data to a storage subsystem of better quality and wiping the old disks.

Well, one of the most common ZFS catchphrases is that you can do reliable storage with very cheap disks – so it’s quite probable that users and enterprises will do exactly that, don’t you think?

what would also be interesting is a sort of HCL or a set of criteria with which the average user can decide which of his set of harddisks he should not use zfs on. It’s not like there are so many harddisk manufacturs, so how can we decide which of their harddisks we can trust our data to?

2009-11-07 1:55 pm

c0t0d0s0
The point is: You shouldn’t use such devices with other filesystems, too. Just say NO to such disks. With ZFS you just recognize those error. Since i’m running regular scrubs over my datasets on my home fileservers, i’m pretty disappointed about the quality of SOHO drives.

BTW: When you are using disks directly with SATA or SAS, you won’t see such problems. Those disks are reasonably biggest-mistakes free. The problems start, when you have some cheap SATA/PATA to Firewire or USB converters.

2009-11-07 5:15 pm

Dryhte
The point is: You shouldn’t use such devices with other filesystems, too. Just say NO to such disks. With ZFS you just recognize those error. Since i’m running regular scrubs over my datasets on my home fileservers, i’m pretty disappointed about the quality of SOHO drives.

BTW: When you are using disks directly with SATA or SAS, you won’t see such problems. Those disks are reasonably biggest-mistakes free. The problems start, when you have some cheap SATA/PATA to Firewire or USB converters.

ah, but since nobody tells us how to recognize ‘those’ disks, you can say NO as often as you want without the slightest effect. So unless someone comes up with a HCL or a sort of product matrix which tells you how to recognize ‘bad’ disks, there will be a need for a way to restore broken zfs filesystems to a usable state.

2009-11-07 6:02 pm

c0t0d0s0
The PSARC mentioned in the linked text is the method to get around such problems, as it rolls back to a consistent state by simply importing the pool at another transaction group number …
2009-11-07 6:17 pm

Dryhte
The PSARC mentioned in the linked text is the method to get around such problems, as it rolls back to a consistent state by simply importing the pool at another transaction group number …

OK, but my point is that the psarc (whatever that may be) _is_ actually what the original poster was asking for, i.e. a mechanism to allow unimportable pools to be imported.

despite the fact that he uses a term which most of you don’t agree with, you implicitly agree with his original point when you say that this psarc allows you to do just that. the OP probably just didn’t know about it.
2009-11-07 7:46 pm

c0t0d0s0
I see a clear distinction between a tool that checks the metadata of a filesystem and fixes it (the fsck) and a methdod to jump back to an slightly oder state. As far as i understand the article of the OP, the thought explicitly of a tool.

In my opinion the filesystem shouldn’t fix any data in such situation and just fall back to an older state, as it’s absolutly unkown what state on disk is at filesystem other than ZFS.
2009-11-07 6:03 pm

c0t0d0s0
Ah … one additional point: Forget broken ZFS filesystems. The filesystem isn’t unimportable. It’s the pool that’s resists its import. A pool can contain several filesystems and emulated volumes.

2009-11-07 3:32 pm

c0t0d0s0
Many concepts in ZFS are pretty different to any other filesystem. It think this is the problem, when people are talking about ZFS and try to impose concept of other filesystems on it.

For example the transaction rollback doesn’t fix and doesn’t check. It doesn’t fix anything. It just imports the pool at a different transaction group number. That’s pretty much the complete story. When you are still paranoid, you can scrub your pool now, and check if your data is correct. But you don’t have to.

Both is pretty much different to the concept of the fsck. The transaction rollback does nothing what a fsck would do, and the scrub goes much further than a fsck, as it checks the checksums of all blocks. Of course you could call it fsck, but it has nothing in common with a fsck for ext4 or xfs.

Regarding the “cleaning up after bugs”: I’m not sure if the fsck is the correct place for such logic, perhaps it’s better to integrate code that is able to live with the buggy state and rewrite it correctly as soon, as the data has changed. The other interesting point: What’s if the state is correctly on disk, but it’s read incorrecly. How do you repair such a problem by fsck? As the logic of the fsck is similar to the code that reads the data, it would be obvious, that the same problem would exist in both parts.

For further explanation i just cite the ZFS FAQ:

“Why doesn’t ZFS have an fsck-like utility?

There are two basic reasons to have an fsck-like utility:

* Verify file system integrity – Many times, administrators simply want to make sure that there is no on-disk corruption within their file systems. With most file systems, this involves running fsck while the file system is offline. This can be time consuming and expensive. Instead, ZFS provides the ability to ‘scrub’ all data within a pool while the system is live, finding and repairing any bad data in the process. There are future plans to enhance this to enable background scrubbing.

* Repair on-disk state – If a machine crashes, the on-disk state of some file systems will be inconsistent. The addition of journalling has solved some of these problems, but failure to roll the log may still result in a file system that needs to be repaired. In this case, there are well known pathologies of errors, such as creating a directory entry before updating the parent link, which can be reliably repaired. ZFS does not suffer from this problem because data is always consistent on disk.

A more insidious problem occurs with faulty hardware or software. Even file systems or volume managers that have per-block checksums are vulnerable to a variety of other pathologies that result in valid but corrupt data. In this case, the failure mode is essentially random, and most file systems will panic (if it was metadata) or silently return bad data to the application. In either case, an fsck utility will be of little benefit. Since the corruption matches no known pathology, it will be likely be unrepairable. With ZFS, these errors will be (statistically) nonexistent in a redundant configuration. In an non-redundant config, these errors are correctly detected, but will result in an I/O error when trying to read the block. It is theoretically possible to write a tool to repair such corruption, though any such attempt would likely be a one-off special tool. Of course, ZFS is equally vulnerable to software bugs, but the bugs would have to result in a consistent pattern of corruption to be repaired by a generic tool. During the 5 years of ZFS development, no such pattern has been seen.

For almost all failure modes ZFS protects the data, there is just one left: Components lying about the sequence and state of write operations. And no filesystem can work against such problems: The advantage of ZFS in conjunction with the mentioned PSARC putback: At least you can jump back to a state that’s consistent and has validated integrity. And that’s much more important form my point of view to press the data into a form, that’s expected by the filesystem, where some blocks are old, some are new, some are deleted after a fsck. At the end the data is the important stuff, not the filesystem. The filesystem is just a helper construct.

2009-11-07 5:14 pm

MrVain
fsck only checks the metadata, but it doesnt check the actual data, right?

2009-11-07 6:07 pm

c0t0d0s0
Exactly. The filesystem knowns about it’s own structure, but a block of random noise is as valid data than a word document when it fits into the structures of the filesystem. As there is no additional metadata to check the contents of a block, you can’t say if the data is valid or not. ZFS has its checksums for this task to check if a block in a filesystem or an emulated volume is correct.

A filesystem check just checks the filesystem … not the data in the entities (files) managed by the filesystem.

2009-11-07 1:29 am

JoeBuck
… is that ZFS needs a recovery tool, but that rather than try to repair damage, it should instead find the last valid snapshot and make that snapshot be the current state. However, to do that the tool needs to be able to do enough consistency checks to determine which snapshots are valid. So it does file system checking, and it does recovery. Sounds like fsck. The only distinction is that it doesn’t try to rewrite a bad state into a consistent state.

2009-11-07 1:59 pm

c0t0d0s0
It doesn’t check. But you can. But that’s a different feature of ZFS, the scrubbing. Such a check would take a long time. It falls back to another transaction groug number, tries to import the pool. In essence it does the same than when you try to import the pool via the latest uberblock than one before.

2009-11-07 10:45 am

Lennie
It sounds like this situation asks for a per blockdevice consistent state that can be rolled back. So transactional at the blockdevice level not the filesystem level.

So the system can detect that one blockdevice (harddisk, partittion, ssd, usb, whatever) didn’t save the latest state, but can rollback to the previous transaction that all blockdevices have or ask the user if that is what they want.

Maybe it already has that, I’ve not yet used ZFS and/or don’t know the internals of ZFS.

From the article and comments I think you could conclude something similair might be going on, but the author isn’t a filesystem-designer.

I’m not such a person either. 😉

2009-11-07 1:51 pm

c0t0d0s0
The state isn’t set back per block device, it’s set back per pool. This recovery is valid for *all* filesystems and emulated volumes in a dataset.

Don’t think of ZFS as a regular filesystem. It’s a combination of a volume manager and a filesystem. The filesystem is just one view to a data pool via the ZFS Posix Layer. But in the same pool emulated block devices can exist via the ZFS Emulated Volume Layer.

You have to think different in the context of ZFS. I’m not a ZFS developer, but i’m working with ZFS since 2005 … so i have some knowledge about it

2009-11-08 3:14 am

phoenix
Don’t think of ZFS as a regular filesystem. It’s a combination of a volume manager and a filesystem. The filesystem is just one view to a data pool via the ZFS Posix Layer. But in the same pool emulated block devices can exist via the ZFS Emulated Volume Layer.

Sun did the computing world a *huge* disservice by calling it “ZFS, the filesystem”. They really should have called it what it actually is: “ZSMS, the Zettabyte Storage Management System”. That would have solved so many of these kinds of issues for people.

Once you start looking at if from a storage management position, instead of “it’s just a fancy fs” position, it becomes a lot simpler to understand and work with.

Unfortunately, it’s too late now, and these kinds of misunderstanding and misconceptions are just going to continue to get worse.

ZFS “the filesystem” doesn’t need an fsck tool. It has features that make sure data is either written correctly, or not written at all. And if a specific block can’t be read or doesn’t match the checksum, then it pulls it from a different copy.

ZFS “the storage pool manager” manages all the storage transactions. If something goes wrong, it can lead to an unimportable storage pool (ie, all the filesystems and volumes above it are inaccessible). Previously, one had to manually much around with dd, zdb, and voodoo to tell the storage pool to load from a previous transaction group. Now, one can do that automatically. No filesystem checking is done. It just picks an older point in time (transaction group), and loads from there. All your data (up to that point in time) is intact.

Once the pool is imported, and all the filesystems and volumes are available, you have the option of running a background scrub on the pool (the entire pool, not individual filesytems and volumes) to make sure that the data is intact. The scrub will compare the checksums on every single block in the pool, and repair anny that are bad via redundant copies.

Thus, a filesystem-specific tool that checks that one filesystem’s metadata on disk (aka fsck) is not needed. Tools are already available that give a better end result … just from a different direction.

Edited 2009-11-08 03:25 UTC

2009-11-08 5:57 am

phoenix
Another way to look at it is to ask whether or not LVM needs an fsck, since that’s the layer in the ZFS storage system that’s being worked on.

ZFS filesystems themselves rarely need fixing (I’ve never come across one, and haven’t read about any online, but I’ve only been using ZFS for a year). They take care of that automatically using self-healing via checksums and redundancy, transactions, and copy-on-write.

The storage pool could become unimportable, but was usually fixable via arcane voodoo magic commands. Now, it’s made a lot simpler (via the code implemented in the PSARC mentioned above — PSARC is like a support case, or bug report, in Sun-speak).

There are tools for fixing LVM, though. And now there are tools to fix things at the storage pool layer in ZFS.

Asking for “fsck” doesn’t make sense, though, as that’s the wrong layer in the stack.

2009-11-08 12:57 pm

c0t0d0s0
PSARC has nothing to do with support cases or bug reports. PSARC stands for Plattform Support Architecture Review Commitee. That’s a group of people in the Opensolaris design process discussing and voting about new additions to Solaris when it changes external interfaces or open new interfaces (ABI, command line commands et al) Looks bureaucratic at first, but at the end it’s responsible for such stuff like the effectiveness of the binary compatibility guarantee and the systemic features like the dense coupling of containers, zfs snapshots and the new networking stack aka Crossbow for example.
2009-11-08 12:57 pm

MrVain
Yeah. The problem/blessing with ZFS is that it detects many more errors than other filesystems, as it is end-to-end. ZFS being more sensitive than other filesystems, is a good thing. Which filesystem could have caught this?

http://blogs.sun.com/elowe/entry/zfs_saves_the_day_ta

And the problem was not ZFS fault. Instead, ZFS is the messenger. Dont shoot the messenger?

2009-11-07 12:19 pm

MrVain
“One: The user has never tried another filesystem that tests for end-to-end data integrity, so ZFS notices more problems, and sooner.

Two: If you lost data with another filesystem, you may have overlooked it and blamed the OS or the application, instead of the inexpensive hardware.”
2009-11-08 4:45 pm

tony
[quote]At first you should to throw the sub-sub-substandard hardware in the next available trash bin after copying the the data to a storage subsystem of better quality and wiping the old disks. [/quote]

I thought we were done with Sun hardware snobbery.

Obviously, if we’re talking mission critical data, a coupon run to Best Buy for some deals on disks isn’t going to cut it.

But the rest of us? Is your argument is that we should buy only SaS drives or 10,000 RPM Enterprise-grade SATA drives with redundant battery-memory controllers to run our home rigs, media servers, laptops? That’s not a solution. That’s a cop-out.

2009-11-08 5:59 pm

c0t0d0s0
Tony, this has nothing to do with Sun hardware snobbery … i’m using several el-cheapo disks with plain-standardard onboard/PCIe SATA-controllers.

We talked about failure modes, that are the consequence of buggy implementation. And as data is the most important stuff you have, it’s not snobbery, it’s a necessity to throw such components into the next trash bin … like a floppy disk with a fingerprint on it. And this is even more important, when you don’t have a filesystem with end-to-end integrity like ZFS, as you can’t check if your data is still correct.

I threw out every USB drive from my setup: It’s just plain eSATA for all my external enclosure needs (no protocol conversion/less problems). I’m just using USB hard drives as a bigger floppy …. but i would never place important data on them alone.
2009-11-08 7:58 pm

c0t0d0s0
One additional point: One ZFS developer once said, that he would like to integrate a function, that rejects devices with such bugs in their implementation. The problem is just, that’s hard to detect that programmatically. I have the same opinion. Data is the most precious asset i have on my server, everything else is replaceable.

Edited 2009-11-08 20:01 UTC