Should ZFS Have a fsck Tool?

One of the advantages of ZFS is that it doesn’t need a fsck. Replication, self-healing and scrubbing are a much better alternative. After a few years of ZFS life, can we say it was the correct decision? The reports in the mailing list are a good indicator of what happens in the real world, and it appears that once again, reality beats theory. The author of the article analyzes the implications of not having a fsck tool and tries to explain why he thinks Sun will add one at some point.

ZFS doesn’t need a fsck tool because of the way it works. It is designed to always have a valid on-disk structure. It doesn’t matter when you pull the plug of your computer; ZFS guarantees that the file system structure will always be valid. This is not journaling – a journaling file system can and will leave the file system structure in an inconsistent state, but it keeps a journal of operations that allows you to repair the file system inconsistency. ZFS never will be inconsistent, so it doesn’t need a journal (and it gains a bit of performance thanks to that).

How does ZFS achieve that? Well, thanks to COW (Copy On Write) and transactional behaviour. When the file system needs to write something to the disk, it never overwrites parts of the disk that are being used; instead, it uses a free area of the disk. After that, the file system “pointers” that point to the old data are modified to point to the new data. This operation is done atomically – either the pointer is modified, or it isn’t. So, in case of pulling the plug, the pointers will point to the old data or the new data. Only one of the two states is possible, and since both are consistent, there’s no need for a fsck tool.

Data loss is not always caused by power failures – hardware errors can occur too. All storage devices fail at some point, and when that happens and the filesystem is no longer able to read the data, you would traditionally use a fsck tool to try to recover as much data as possible. What does ZFS do to avoid the need of fsck in those cases? Checksumming and replication. ZFS checksums everything, so if there’s corruption somewhere, ZFS detects it easily. What about replication? ZFS can replicate both data and metadata in different places of the storage pool. And when ZFS detects that a block (be it data or metadata) has been corrupted, it just uses one of the replicas available, and restores the corrupted block with a copy (a process called “self-healing”).

Data replication means you need to dedicate half of your storage space to the replicas. Servers don’t have many problems with this, they will just buy more disks, they have money. But desktop users don’t like that – most users will choose more storage capacity over reliability. Their data is often not that critical (they aren’t risking money, like enterprises do), and nowadays an increasingly big portion of their data is stored in the cloud. And while disks can fail, they usually work reliably without problems for years (and no, it’s not really getting worse – disk vendors who sell disks which lose data too easily will face bankruptcy). It’s an even bigger issue for laptops, because you can’t add more disks to them. Even in those cases, ZFS will help them: metadata (the filesystem internal structures, the directory structure, inodes, etc) is duplicated even on single disks. So you can lose data from a file, but the filesystem structure will be safe.

All this means that ZFS can survive most hardware/software problems (it doesn’t remove the need for a backup: your entire disk pool can still get ruined by lightning). In other words, users don’t really need the fsck tool. Self-healing is equivalent to a fsck repair operation. ZFS does have a “scrub” commmand, which checks all the checksums of the file system and self-heals the corrupted ones. In practice, it’s the same process as a fsck tool that detects an inconsistency and tries to fix it, except that ZFS selfhealing is more powerful because it doesn’t needs to “guess” how file systemstructures are supposed to be; it just checks checksums and rewrites entire blocks without caring what’s on them. What about corruption caused by bugs in the file system code? Developers have ruled out that possibility. The design of ZFS makes that much harder to happen and, at the same time, much easier to detect. ZFS developers say that if such a bug appeared they would fix it, but the probability of such an event “would be very rare”, and it’d be fixed very quickly; only the guy who hit the problem and reports it first would have the problem.

When you look at ZFS from this point of view, the fsck clearly becomes unnecesary. And not just unnecesary, it’s even a worse alternative to scrubbing + self-healing. So why care? It’s a thing from the past, right?

But what would happen if you found a corrupted ZFS pool that is so corrupted it can’t even get mounted to get scrubbed? “That can’t happen with ZFS”, you say. Well, that’s true, it’s theoretically not possible. But what if corruption does happen? What if you faced a situation where you needed a fsck? You would have a big problem, as you would have a corrupted filesystem and no tools to fix it! And it turns out that those situations do exist in the real world.

The proof can be easily found in the ZFS mailing lists: There’re some people reporting the loss of entire pools or speaking about it: see this, or this, or this, or this, or this, or this. There are error messages allocated to tell you that you must “Destroy and re-create the pool from a backup source” (with 541 search results of that string found in Google), and bugs filed for currently unrecoverable situations (see also the “Related Bugs” in that link). There are cases where fixing the filesystem requires to restore copies of the uberblock… Using dd!

It’s not that ZFS is buggy, misdesigned or unsafe – excluding the fact that new filesystems are always unsafe, ZFS is not more dangerous than your average Ext3/4 or HFS+ file system. In fact, it’s much safer thanks to the integrated RAID and self-healing. Except for one thing: those filesystems have a fsck tool. The people who use them maybe have more risk of getting their filesystem corrupted, but when that happens, there’s always a tool that fixes the problem. I’ve had integrity problems with Ext3 some times (Ext3 doesn’t turn barriers on by default!), but a fsck run always fixed the problem. Solaris UFS users have had similar problems. ZFS however leaves users who face corruption out in the cold. Its designers assumed that corruption couldn’t happen. Well, guess what – corruption is happening! After a few years, we can say that those assumptions were wrong (which doesn’t preclude that the rest of their assumptions were correct and brilliant!).

In a thread we can find Jeff Bonwick, the leading Solaris hacker and main ZFS designer, explaining a corruption problem:

Before I explain how ZFS can fix this, I need to get something off my chest: people who knowingly make such disks should be in federal prison. It is fraud to win benchmarks this way.Doing so causes real harm to real people.Same goes for NFS implementations that ignore sync. We have specifications for a reason.People assume that you honor them, and build higher-level systems on top of them. Change the mass of the proton by a few percent, and the stars explode.It is impossible to build a functioning civil society in a culture that tolerates lies. We need a little more Code of Hammurabi in the storage industry.”

Well, ZFS was supposed to allow using “commodity hardware”, because ZFS was supposed to work around, by design, all the bad things that crappy disks do. But it turns out that in that message we see how cheap, crappy hardware renders the ZFS anti-corruption design useless. So you aren’t 100% safe with all kinds of commodity hardware; if you want reliability, you need hardware which is not that crappy. But even decent hardware can have bugs, and fail in mysterious ways. If cheap hardware was able to corrupt a ZFS file system, why should we expect that expensive hardware won’t have firmware bugs or hardware failures that will trigger the same kind of corruption, or other new types of corruption?

But (I’ll repeat it again) this is not a design failure of ZFS – not at all! Other file systems have the same problem, or even worse ones. Jeff is somewhat right – bad disks suck. But bad disks are not a new thing: they exist, and they aren’t going away. Current “old” file systems seem to cope with it quite well – they need a fsck when a corruption is found, and that fixes the problem most of the time. The civil society that Jeff dreams we should live in doesn’t exist, and never existed. The Sun/SPARC world has very high quality requirements, but expecting quality from PC class hardware is like believing unicorns exist.

This means that if you design something for PC hardware, you need to at least acknowledge that crappy hardware does exist and is going to make your software abstractions leaky. A good design should not exclude worst-case scenarios. For ZFS, this means they need to acknowledge that disks are going to break things and corrupt the data in ways that the ZFS design isn’t going to be able to avoid. And when that happens, your users will want to have a good fsck tool to fix the mess or recover the data. It’s somewhat contradictory that the ZFS developers worked really hard to design those anti-corruption mechanisms, but they left the extreme cases of data corruption where a fsck is necessary uncovered. There’s an interesting case of ZFS corruption on a virtualized Solaris guest: VirtualBox didn’t honor the sync cache commands of the virtualized drive. In a virtualized world, ZFS doesn’t only have to cope with hardware, but also with the software and the file system running in the underlying host. In those scenarios, ZFS can only be as safe as the host filesystem is, and that means that, again, users will face corruption cases that require a fsck.

I think that pressure from customers will force Sun Oracle to develop some kind of fsck tool. The interesting thing is that the same thing happened to other file systems in the past. Theoretically, a journaling file system like XFS can’t get corrupted (that’s why journaling was invented in first place – to avoid the fsck process). When XFS was released, SGI used the “XFS doesn’t need fsck” slogan as well. However, Real World doesn’t care about slogans. So SGI had to develop one, customers asked for it. The same thing is said to have happened to NetApp with WAFL: They said they didn’t need a fsck, but they needed to write onein the end anyway.

Wouldn’t it be much better to accept the fact that users are going to need a fsck tool, and design the filesystem to support it from day one instead of hacking it later? Surprisingly we have a recent example of that, Btrfs. From its creator: said: “In general, when I have to decide between fsck and a feature, I’m going to pick fsck. The features are much more fun, but fsck is one of the main motivations for doing this work”.. Not only does Btrfs have a fsck tool, the filesystem was explicitely designed to make the fsck job more powerful. It even has some interesting disk format modifications to make that job easier: it has something called back references, which means that extents and other parts of the filesystem have a “back reference” to the structures of the filesystem that made a reference to them. It makes the fsck more reliable at the expense of some performance (you can disable them), but it has also made it easier/possible to support some things, like shrinking a filesystem. My opinion? This approach seems more trustworthy than pretending that corruption cannot exist.

Lesson to learn: all filesystems need fsck tools. ZFS pretended to avoid it, just like others did in the past, but at some point they will need to realize that there are always obscure cases where those tools are needed, and they will make new – and probably great – tools. Worse-case scenarios always happen. Users will always want a fsck tool when those obscure cases happen. Especially enterprise users!

About the author:

Pobrecito Hablador, an anonymous open source hobbyist.

43 Comments

  1. 2009-11-02 9:34 pm
    • 2009-11-02 9:39 pm
    • 2009-11-02 9:42 pm
    • 2009-11-02 10:42 pm
      • 2009-11-02 11:00 pm
    • 2009-11-02 11:57 pm
      • 2009-11-03 8:02 am
        • 2009-11-03 11:11 am
          • 2009-11-03 4:25 pm
          • 2009-11-03 7:34 pm
          • 2009-11-03 7:51 pm
    • 2009-11-04 10:04 pm
  2. 2009-11-02 9:51 pm
  3. 2009-11-02 10:15 pm
  4. 2009-11-02 10:19 pm
    • 2009-11-03 12:56 am
      • 2009-11-03 8:00 am
        • 2009-11-03 9:13 am
      • 2009-11-03 10:27 pm
  5. 2009-11-02 10:39 pm
    • 2009-11-02 11:08 pm
  6. 2009-11-02 10:43 pm
    • 2009-11-02 11:18 pm
      • 2009-11-03 8:24 am
  7. 2009-11-02 10:54 pm
    • 2009-11-03 7:57 am
  8. 2009-11-03 8:50 am
  9. 2009-11-03 3:45 pm
    • 2009-11-03 4:43 pm
      • 2009-11-03 7:49 pm
    • 2009-11-03 9:07 pm
    • 2009-11-05 5:09 pm
  10. 2009-11-04 11:39 am
    • 2009-11-04 12:06 pm
      • 2009-11-04 9:02 pm
        • 2009-11-04 10:23 pm
    • 2009-11-04 8:57 pm
  11. 2009-11-04 10:46 pm
    • 2009-11-05 6:36 pm
  12. 2009-11-05 10:00 pm
  13. 2009-11-05 11:22 pm
    • 2009-11-07 12:28 am
      • 2009-11-07 6:00 am