Spad Filesystem for Linux

Submitted by bnolsen 2006-11-18 Linux 36 Comments

“SpadFS is a new filesystem that I design and develop as my PhD thesis. It is an attempt to bring features of advanced filesystems (crash recovery, fast directories) and good performance without increasing code complexity too much. Uses crash counts instead of journaling (because journaling is too complex and bug-prone) and uses hash instead of btrees for directory organization.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

36 Comments

2006-11-18 7:30 pm

Shaman
I took a look at this project maybe a week ago and was surprised to see it. Seems to me this project has more potential than any other current Linux filesystem near release today (I don’t consider Reiser4 released until I can at the least know it will patch into a current kernel with a minimum of fuss and no data loss or major slowdowns – which is not today, sadly).

I hope some of the major Linux FS developers realize the promise here and join in. Extending Ext3 to Ext4 doesn’t make any sense to me, Ext is realizing the law of diminishing returns.

What’s missing (or I didn’t see it) is an examination of the performance of this file system. It needs to be on par with current FS architecture to be meaningful and at least, or more, reliable.

2006-11-19 12:19 am

segedunum
I hope some of the major Linux FS developers realize the promise here and join in. Extending Ext3 to Ext4 doesn’t make any sense to me, Ext is realizing the law of diminishing returns.

No, it doesn’t make sense to me either. I understand the arguments of backwards compatibility etc. etc. regarding the ext line of filesystems, but still, it’s not pushing filesystems forwards in terms of features or performance. A lot of what goes on around ext and prolonging its life just seems to be plain politics to me.

Reiser4 has completely bogged down, no one (commercially anyway) seems to want XFS and people have neglected JFS quite badly. Unwisely, I think. I think it’s high time people came up with some new ideas on the landscape, and this kind of stuff is good to see.

2006-11-19 12:57 am

Domin
On some article on ext3 developement in IBM dev site I read that major factors behind continous ext developement are proven codebase (rock stable), know properties, easy migrations to new versions and feel of confidence among CIOs.

People will choose something they know, unless overwhelmed by direct , undeniable advantages. As many shortcommings are addressed within the framework of ext3 the show goes on.

That thinking worked even at the time where young ext3 (which sucked more than today) had to compete with just merged journal fses.

That’s the POW of corporate devs. Community is free to be interested in other innovations, that’s the beauty of OSS .

2006-11-19 3:13 am

fepede
People will choose something they know, unless overwhelmed by direct , undeniable advantages. As many shortcommings are addressed within the framework of ext3 the show goes on.

That thinking worked even at the time where young ext3 (which sucked more than today) had to compete with just merged journal fses.

No, please, don’t touch that key again.

Like it or not, ext3 has a lot of users because it is a good fs.

Sure, there are cases where other filesystems performs better, or fs that have newer (and better design) or better code and so on.

But ext3 has a very well balanced mix of robustness/performance/reliabilty and recoverability from hw failure that makes it a good choice in a lot of circumstances.

2006-11-19 3:26 pm

segedunum
Like it or not, ext3 has a lot of users because it is a good fs.

No. It’s because it’s been made the default by a few distros. It doesn’t mean it’s better.

But ext3 has a very well balanced mix of robustness/performance/reliabilty and recoverability from hw failure…

Please, DO NOT bring up the subject of ext3 being supposedly better with hardware failure ever again. Ever. It’s utterly meaningless. It’s a subject that has threads a million miles long on Gentoo forums and elsewhere, with people complaining like buggery that their filesystem didn’t survive when they yanked the cord out – and this is from people using ext3, Reiser, XFS or JFS.

No filesystem will ever protect you in any way from hardware failure, because in many cases it’s a question of what hardware failure you had and what your hard drive was doing in the event of failure.

Hardware failure is a fanboy argument that even the ext developers and advocates have picked up on, and it doesn’t mean anything. There is simply no evidence for it.
2006-11-19 5:07 pm

fepede
No. It’s because it’s been made the default by a few distros. It doesn’t mean it’s better.

In fact i didn’t wrote it’s better. I just said it is good.

Secondly, the fact it is the default for many distro doesn’t mean that pople is not able to choose their fs.

No filesystem will ever protect you in any way from hardware failure, because in many cases it’s a question of what hardware failure you had and what your hard drive was doing in the event of failure.

In fact i just wrote about recoverability in case of hardware failure, not about protection.

Seems that your need to attack ext3 made you a bit superficial about details…
2006-11-19 6:40 pm

fepede
No. It’s because it’s been made the default by a few distros. It doesn’t mean it’s better.

In fact i didn’t wrote it’s better. I just said it is good.

Secondly, the fact it is the default for many distro doesn’t mean that pople is not able to choose their fs.

No filesystem will ever protect you in any way from hardware failure, because in many cases it’s a question of what hardware failure you had and what your hard drive was doing in the event of failure.

In fact i just wrote about recoverability in case of hardware failure, not about protection.

Seems that your need to attack ext3 made you a bit superficial about details…

2006-11-19 8:01 am

l3v1
On some article on ext3 developement in IBM dev site I read that major factors behind continous ext developement are proven codebase (rock stable), know properties, easy migrations to new versions and feel of confidence among CIOs.

Yes, and you can’t tell that about xfs, right ? Also, the comment above about nobody wanting xfs anymore. What are you guys smoking, I’d really like to avoid it.

2006-11-19 3:29 pm

segedunum
Yes, and you can’t tell that about xfs, right ? Also, the comment above about nobody wanting xfs anymore.

I never said that no one wanted XFS anymore. I’m just confused that many distros are following the crowd and picking an inferior filesystem in ext3 when, certainly for very large filesystems, they could support XFS or JFS where it was needed.

2006-11-19 3:18 am

jwwf
I hope some of the major Linux FS developers realize the promise here and join in. Extending Ext3 to Ext4 doesn’t make any sense to me, Ext is realizing the law of diminishing returns.

I disagree. Continuously incrementally improving a well-seasoned design is exactly what Sun did with UFS, and, with logging, what they ended up with was a filesystem that was both rock solid and performance competitive (or leading) against more exciting designs. In my mind, the most important aspect of a filesystem is that it is exceedingly unlikely to eat your data. UFS and ext3 have that quality. The fact that UFS is in many (perhaps most) cases faster than ext3 means that there is headroom to improve ext3. I suspect that more users will benefit from a faster ext3 than any other option.

On the other hand, Sun did do exactly what you suggest with ZFS, though ZFS is as much about volume management as it is about a filesystem. My experience with ZFS has been nothing but positive, but I know that being a decade younger than UFS, it by definition is less time-tested, and other people’s testing shows that, as you would expect, its performance is more variable due to workload (sometimes much faster, sometimes slower, generally not much slower) than UFS. For things like SATA raid and home directories, ZFS’s checksums and snapshots are truly awesome. But for /usr or a RDBMS, I feel fine sticking with UFS.

Bottom of the line is, there is no such thing as a free lunch. Innovation is great, but there is value in well tested, well tuned, boring software too.

2006-11-19 6:48 am

Shaman
In my mind, the most important aspect of a filesystem is that it is exceedingly unlikely to eat your data. UFS and ext3 have that quality.

Interesting that you say that. The only filesystem that I have ever had eat itself all by its lonesome (besides DOS, but it’s hardly praiseworthy) is Ext3. Twice, no less. The same hardware that Ext3 ate two filesystems on me ran Ext2 just fine for a while and then ReiserFS for six years (and counting).

2006-11-19 10:41 am

gilboa
YMMV… YMMV…

Personal experience (especially when you’re talking about limited number of machines) – while true in your case – means little outside your own limited scope.

You might have used problematic kernel patch and/or suffered from some kind of hardware failure that screwed the ext3 super-nodes – there’s no way of knowing.

-If- you have tested ext3 vs. ReiserFS vs. XFS on >100 machines using different kernel versions / distributions and saw a measurable difference in stability, these numbers would have valid (again, outside your own personal scope).

FYI in my previous workplace we deployed around ~200 machines with / (root) running ext3 and /media on ext3 (these machines were used for streaming) and we had zero ext3 crashes and-way-too-many xfs crashes.

However, due to the vastly different workloads on both FS’, these numbers should not be taken too seriously.

– Gilboa

Edited 2006-11-19 10:44
2006-11-19 4:10 pm

jwwf
Interesting that you say that. The only filesystem that I have ever had eat itself all by its lonesome (besides DOS, but it’s hardly praiseworthy) is Ext3. Twice, no less. The same hardware that Ext3 ate two filesystems on me ran Ext2 just fine for a while and then ReiserFS for six years (and counting).

I have no doubt that every filesystem ever written is capable of trashing data. But if I understand your story correctly, I think it is an outlier, for the simple reason that I remember ext3 six years ago, when it was brand new, and (according to wikipedia and not my memory), a year away from being merged into the mainline kernel. I did not trust it then either–or any beta FS for that matter. But six years is a lifetime in software testing. These days, I think it is uncontroversial to call ext3 stable.

2006-11-19 10:49 am

gilboa
As long as your talking about desktop machines, people are willing to play chicken with their FS.

Once you start talking about workstations/servers/etc, admins/users/IT managers/etc (that do not suffer from suicidal tendencies) care less about performance and (much) more about stability.

Losing a departmental server (even if you backup religiously) due bad FS selection tends to be a career altering decision.

– Gilboa

2006-11-19 4:45 pm

jwwf
Once you start talking about workstations/servers/etc, admins/users/IT managers/etc (that do not suffer from suicidal tendencies) care less about performance and (much) more about stability.

Losing a departmental server (even if you backup religiously) due bad FS selection tends to be a career altering decision.

Two “me too’s” here:

First, I suspect many people do not realize how unbelievably long it takes to restore a couple hundred GB of non-large files from backup, even if you used disk to disk backup over, say, gigE. Among the factors involved, I think, are the preponderence of creates and writes that most FSs (and disks for that matter) do not shine on, and the fact that most storage setups are way out of balance, capacity to IOPS wise. I am a fan of any technique that decreases the likelihood of a restore. A stable platform (and FS) is probably the cheapest of those.

Second, to be fair, when the second guessing happens, many fools take the road of blaming the unfamiliar component and shutting the case–the “career altering” part. Keeps life simple and all. It’s silly, but it certainly informs my purchasing decisions.

2006-11-20 3:12 pm

Ookaze
First, I suspect many people do not realize how unbelievably long it takes to restore a couple hundred GB of non-large files from backup, even if you used disk to disk backup over, say, gigE

I don’t know what you mean. If you were talking about tape I could understand (and even then I’d have trouble).

But I nearly did what you describe years ago with my poor SCSI 4 GB backup disk and another SCSI faster 10k SCSI disk.

I used to backup all my system and most of /home on this 4 GB disk, and when I upgraded to another faster PC, I just dumped the backup from the old PC to the new PC.

Of course, all is compressed on the 4 GB disk, and I use flexbackup (with bzip2) for the record.

The result is that it didn’t actually take lots of time to restore everything. It took less than 2 hours, and the botteleneck was the processor.

And most of the FS were ext3 at the time, some small ones that rarely change (/boot) were ext2.

2006-11-21 12:29 am

jwwf

First, I suspect many people do not realize how unbelievably long it takes to restore a couple hundred GB of non-large files from backup, even if you used disk to disk backup over, say, gigE

—-

I don’t know what you mean. If you were talking about tape I could understand (and even then I’d have trouble).

But I nearly did what you describe years ago with my poor SCSI 4 GB backup disk and another SCSI faster 10k SCSI disk.

I used to backup all my system and most of /home on this 4 GB disk, and when I upgraded to another faster PC, I just dumped the backup from the old PC to the new PC.

Of course, all is compressed on the 4 GB disk, and I use flexbackup (with bzip2) for the record.

The result is that it didn’t actually take lots of time to restore everything. It took less than 2 hours, and the botteleneck was the processor.

And most of the FS were ext3 at the time, some small ones that rarely change (/boot) were ext2.

Unless your /home was filled with 100 GB of files dd’ed straight from /dev/zero, I’d be shocked if you compressed a couple hundred GB onto that 4 GB backup disk

But seriously, arbitrarily assuming 3x compression, you are not talking more than 7 GB an hour, or less than 2 MB per second, for that restore.

2006-11-18 7:32 pm

rx182
I wish good luck to this guy. I like people that do something _really_ useful for their PhD thesis.

I wish I could work on something like that…
2006-11-18 7:43 pm

ameasures
Good to see someone questioning things and developing something to be simpler.

Good to see this on OSNEWS and the benchmarks should be interesting.
2006-11-18 8:09 pm

dr_evil
I’m terribly sorry but this FS doesn’t look like it’s well-designed. I wouldn’t want to use it.

Some examples: User defined attributes are limited to ~170 bytes, crash counts are just a over-complicated version of COW, directories operations are O(sqrt(n)) instead of possible O(log(n)), …

2006-11-19 10:14 am

butters
Crash counts are NOT similar to COW. In COW, blocks are copied and written out-of-place, then any metadata blocks referencing the modified blocks are updated to point to the new version. The filesystem is always consistent, and the ability to roll back in time (both metadata and data) is virtually unlimited.

In a journaling filesystem, metadata blocks are written out-of-place, then data blocks are written in-place, and finally metadata blocks are copied into place. The filesystem is always consistent in terms of metadata, but data loss is possible, and the journal can only be used to detect and prune corrupted files from the filesystem (actually they go in the lost+found).

Crash counts go one step further in simplifying and limited the extent of recoverability. Both metadata and data are written in-place, much like a classical UNIX filesystem. As in journaling filesystems, some data loss is possible during a crash, but the filesystem is always recoverable. Performance should be better since metadata is only written to disk once.

Crash counts are just one way of finding inconsistencies from in-place metadata instead of from a journal. A much simpler mechanism would be “clean/dirty flags,” in which blocks are marked dirty before writing to them and marked clean when the write is flushed to disk. That’s really what this concept boils down to.

The O(sqrt(n)) runtime refers to block allocation, not directory operations. This part is a big mystery to me, because I’ve never seen a list search algorithm like this. With a plain doubly-linked list, it is impossible to do better than O(n). You need random access for binary/interpolation search, which you can’t get with a linked list unless you allocate a static array of nodes. If they were using skip lists (which are arguably as fast as binary search trees, depending on who you ask), they would have mentioned that.

Directory operations should be constant time according to this document, but I’ve always considered hash-bashed algorithms to be “slow constant time.”

I don’t think anyone’s ever going to be flocking to this filesystem, but it has a reasonable design. Linus probably likes it, because embedding inodes in their files’ directory entries (where possible) is one of his long-time favorite filesystem tricks. If they clean up the code and cooperate with the LKML, they have a good shot of getting this merged. We’ll just have to wait and see.

2006-11-21 4:33 pm

christian
As in journaling filesystems, some data loss is possible during a crash, but the filesystem is always recoverable. Performance should be better since metadata is only written to disk once.

Journalling generally improves the performance of especially meta-data intensive tasks. This is because the meta-data can not only be written asynchronously, but it is also written contiguously in the journal, and once in the journal, it is to all intents and purposes written to the filesystem and processing can continue.

If, like ext3, data can be journalled as well, this can further improve performance for the same reasons as meta-data.

In a simple test (10000 non-transactioned inserts into a SQLite database):

ext2: 279s

ext3: 297s

ext3 + writeback data: 236s

ext3 + journalled data: 123s

jfs: 366s

xfs: 313s

Here, default ext3 pays the price for ordered data writing, which is a pessimistic writing mode not used by XFS or JFS, which behave more like the writeback data mode of ext3.

I must say, the speed up by using journalled data in ext3 even surprised me! As did the IO overhead of JFS.

All in all, the above quick test shows ext3 in a very good light!

Disclaimer: All tests done using LVM on the same disk, but different LVs. YMMV. All tests are single runs, and not very scientifically done.

2006-11-18 10:42 pm

taschenorakel
Don’t know if I overlooked something, but if crash count works as described in INTERNALS, then the the file systems crashes after 160 days of continious operations – which is quite usual for Linux based systems, then a file system implementing crash count looses all files of the last half year? I strongly hope I missed something, otherwise the profs aproving this thesis should be fired intermediatly…

2006-11-19 6:39 am

grfgguvf
Maybe the description is not well-worded, but the file-system only loses all versions of all files that are older than half a year. But not the latest or recent versions of any file.

If you used VMS it works the same way.
2006-11-19 7:32 am

butters
That part is not explained correctly. At one point it suggests indexing a scalar with an array, which is hopefully the reverse of what the author intended.

Here is my interpretation: The crash count is the number of mounts minus the number of umounts since fs creation. The crash count table increments the transaction count for the current crash count after each transaction (in memory and then on disk after it is flushed). Each directory and allocation structure contains the crash count when last modified and the associated transaction count. If the transaction count stored in a structure is greater than the transaction count at the current crash count index, then a crash must have occurred before the transaction that modified the structure could be flushed to disk.

So, the crash count should never get very high, but the transaction count will overflow every 4.3B transactions. There are a number of ways to make this overflow happen gracefully without risking the consistency of the filesystem. One would be to flush everything to disk, increment the crash count, and zero the transaction count. This way it’s like a crash happened and everything was recovered.

This is just my interpretation of the document, which, as we realize, isn’t very well written.

2006-11-18 10:50 pm

bnolsen
Here’s a log of the spadfs thread on the LKML in forum fomat.

On this page is Linus’s response.

http://forum.jollen.org/index.php?s=bf0bece375c0f53b068568e99cd8120…

2006-11-19 6:45 am

britbrian
I read the whole thread with some interest as I worked in disk drive controller ASIC design team some years back.

Drives have very many auto calibration, detection and numerous smart features to greatly improve the reliability of writing data and trusting that it CAN be read back. When sector reads start to become suspect, the Defect Management System can copy data to reserved sectors and then map out the suspect ones.

I was suprised that most File System developers didn’t have exact knowledge of this but were surmising it from limited experience.

I would hope that vendors would be forthcoming with even a cursary explaination if the FS developers asked, though not proprietry details.

2006-11-18 11:18 pm

kamiko
Configuration:

CPU: Pentium 4 3GHz, hyperthreading turned off

Memory: 1GB

Disk: WD Caviar 160GB SATA, 20GB test partition at the end

Filesystems are tested on Linux 2.6 and Spad (not yet released kernel with experimental features).

RATE.PNG, RATE_CPU.PNG:

write, rewrite, read 8GiB file, report average data rate and CPU consumption in seconds.

http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/benchmarks/RATE.PNG

http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/benchmarks/RATE_CPU…

TAR.PNG:

Take file on-src-20060814.tar (OpenSolaris sources), already unpacked, and try:

tar xf; cp -a the whole directory; grep -r on the whole directory ; rm -rf

http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/benchmarks/TAR.PNG

Edited 2006-11-18 23:20

2006-11-19 4:10 am

bnolsen
Thanks for the fine.

The numbers look good but I don’t understand what all the numbers mean.

What’s the difference between lin/spad, spad/spad and spad/ext2 on the graphs?

Perhaps we’ll see. This filesystem has been out in the wild for less than a month it seems. Hopefully as it gets more exposure folks who already have benchmarking systems all set up will put up their workload numbers. Also we’ll see what sort of refinements get added and what they do to the speed.

2006-11-19 5:35 am

Myrd
I believe one is the “source” filesystem and the other is the “destination” filesystem.
2006-11-19 2:32 pm

MadRat
SPAD was defined as an OS kernel that is also in expirimental stage. Apparently the author is an OS writer, too. So basically he’s comparing SPAD and Linux OS kernels with the different file systems running beneath them.

2006-11-20 12:13 am

milatchi
Cool! But I’m going to stick with XFS for now.
2006-11-20 3:50 pm

axilmar
Can we please drop filesystems and use databases instead? using filesystems is a drawback to the development of the IT industry…

2006-11-21 2:49 am

Soulbender
“Can we please drop filesystems and use databases instead?”

Good thing databases doesn’t need to store their files anywhere….
2006-11-21 3:34 pm

bnolsen
This is pretty funny.

Databases are a higher level paradigm than filesystems. As mentioned above, you need to put a database onto a filesystem to make it useful. Then of course you end up with more overhead.

BeOS tried a database fs orignally and ended up ditching it for a more traditional filesystem. MS played with it for Longhorn and ditched the idea also.

Layers of building blocks work better than monolithic giants. They’re easier to implement, test and keep stable.

What I like most about the spad filesystem is the codebase size. Currently it’s LOC count is even less than ext2. Generally more elegant solutions end up being smaller.

Edited 2006-11-21 15:37

2006-11-22 12:44 pm

axilmar
the only thing databases require are a block device. They do not need a filesystem. Memory-driven databases are a testament to that.

The relational model is far superior than the simple byte-oriented unstructured model of files.