When it comes to dealing with storage, Solaris 10 provides admins with more choices than any other operating system. Right out of the box, it offers two filesystems, two volume managers, an iscsi target and initiator, and, naturally, an NFS server. Add a couple of Sun packages and you have volume replication, a cluster filesystem, and a hierarchical storage manager. Trust your data to the still-in-development features found in OpenSolaris, and you can have a fibre channel target and an in-kernel CIFS server, among other things. True, some of these features can be found in any enterprise-ready UNIX OS. But Solaris 10 integrates all of them into one well-tested package. Editor’s note: This is the first of our published submissions for the 2008 Article Contest.
The details of the whole Solaris storage stack could fill a book, so in this article I will focus only on filesystems. There are four common on-disk filesystems for Solaris, and my goal is to familiarize the reader with each of them, and to mention a few deployment scenarios where each is appropriate.
UFS in its various forms has been with us since the days of BSD on VAXen the size of refrigerators. The basic UFS concepts thus date back to the early 1980s and represent the second pass at a workable UNIX filesystem, after the very slow and simple filesystem that shipped with the truly ancient Version 7 UNIX. Almost all commercial UNIX OSs have had a UFS, and ext3 in Linux is similar to UFS in design. Solaris inherited UFS through SunOS, and SunOS in turn got it from BSD.
Until recently, UFS was the only filesystem that shipped with Solaris. Unlike HP, IBM, SGI, and DEC, Sun did not develop a next-generation filesystem during the 1990s. There are probably at least two reasons for this: most competitors developed their new filesystems using third party code which required per-system royalties, and the availability of VxFS from Veritas. Considering that a lot of the other vendors’ filesystem IP was licensed from Veritas anyway, this seems like a reasonable decision.
Solaris 10 can only boot from a UFS root filesystem. In the future, ZFS boot will be available, as it already is in OpenSolaris. But for now, every Solaris system must have at least one UFS filesystem.
UFS is old technology but it is a stable and fast filesystem. Sun has continuously tuned and improved the code over the last decade and has probably squeezed as much performance out of this type of FS as is possible. Journaling support was added in Solaris 7 at the turn of the century and has been enabled by default since Solaris 9. Before that, volume level journaling was available. In this older scheme, changes to the raw device are journaled, and the filesystem is not journaling-aware. This is a simple but inefficient scheme, and it worked with a small performance penalty. Volume level journaling is now end-of-lifed, but interestingly, the same sort of system seems to have been added to FreeBSD recently. What is old is new again.
UFS is accompanied by the Solaris Volume Manager, which provides perfectly servicible software RAID.
Where does UFS fit in in 2008? Besides booting, it provides a filesystem which is stable and predictable and better integrated into the OS than anything else. ZFS will probably replace it eventually, but for now, it is a good choice for databases, which have usually been tuned for a traditional filesystem’s access characteristics. It is also a good choice for the pathologically conservative administrator, who may not have an exciting job, but who rarely has his nap time interrupted.
ZFS has gotten a lot of hype. It has also gotten some derision from Linux folks who are accustomed to getting that hype themselves. ZFS is not a magic bullet, but it is very cool. I like to think that if UFS and ext3 were first generation UNIX filesystems, and VxFS and XFS were second generation, then ZFS is the first third generation UNIX FS.
ZFS is not just a filesystem. It is actually a hybrid filesystem and volume manager. The integration of these two functionalities is a main source of the flexibility of ZFS. It is also, in part, the source of the famous “rampant layering violation” quote which has been repeated so many times. Remember, though, that this is just one developer’s aesthetic opinion. I have never seen a layering violation that actually stopped me from opening a file.
Being a hybrid means that ZFS manages storage differently than traditional solutions. Traditionally, you have a one to one mapping of filesystems to disk partitions, or alternately, you have a one to one mapping of filesystems to logical volumes, each of which is made up of one or more disks. In ZFS, all disks participate in one storage pool. Each ZFS filesystem has the use of all disk drives in the pool, and since filesystems are not mapped to volumes, all space is shared. Space may be reserved, so that one filesystem can’t fill up the whole pool, and reservations may be changed at will. However, if you don’t want to decide ahead of time how big each filesystem needs to be, there is no need to, and logical volumes never need to be resized. Growing or shrinking a filesystem isn’t just painless, it is irrelevant.
ZFS provides the most robust error checking of any filesystem available. All data and metadata is checksummed (SHA256 is available for the paranoid), and the checksum is validated on every read and write. If it fails and a second copy is available (metadata blocks are replicated even on single disk pools, and data is typically replicated by RAID), the second block is fetched and the corrupted block is replaced. This protects against not just bad disks, but bad controllers and fibre paths. On-disk changes are committed transactionally, so although traditional journaling is not used, on-disk state is always valid. There is no ZFS fsck program. ZFS pools may be scrubbed for errors (logical and checksum) without unmounting them.
The copy-on-write nature of ZFS provides for nearly free snapshot and clone functionality. Snapshotting a filesystem creates a point in time image of that filesystem, mounted on a dot directory in the filesystem’s root. Any number of different snapshots may be mounted, and no separate logical volume is needed, as would be for LVM style snapshots. Unless disk space becomes tight, there is no reason not to keep your snapshots forever. A clone is essentially a writable snapshot and may be mounted anywhere. Thus, multiple filesystems may be created based on the same dataset and may then diverge from the base. This is useful for creating a dozen virtual machines in a second or two from an image. Each new VM will take up no space at all until it is changed.
These are just a few interesting features of ZFS. ZFS is not a perfect replacement for traditional filesystems yet – it lacks per-user quota support and performs differently than the usual UFS profile. But for typical applications, I think it is now the best option. Its administrative features and self-healing capability (especially when its built in RAID is used) are hard to beat.
SAM and QFS
SAM and QFS are different things but are closely coupled. QFS is Sun’s cluster filesystem, meaning that the same filesystem may be simultaneously mounted by multiple systems. SAM is a hierarchical storage manager; it allows a set of disks to be used as a cache for a tape library. SAM and QFS are designed to work together, but each may be used separately.
QFS has some interesting features. A QFS filesystem may span multiple disks with no extra LVM needed to do striping or concatenation. When multiple disks are used, data may be striped or round-robined. Round-robin allocation means that each file is written to one or two disks in the set. This is useful since, unlike striping, participation by all disks is not needed to fetch a file – each disk may seek totally independently. QFS also allows metadata to be separated from data. In this way, a few disks may serve the random metadata workload while the rest serve a sequential data workload. Finally, as mentioned before, QFS is an asymmetric cluster filesystem.
QFS cannot manage its own RAID, besides striping. For this, you need a hardware controller, a traditional volume manager, or a raw ZFS volume.
SAM makes a much larger backing store (typically a tape library) look like a regular UNIX filesystem. This is accomplished by storing metadata and often-referenced data on disk, and migrating infrequently used data in and out of the disk cache as needed. SAM can be configured so that all data is staged out to tape, so that if the disk cache fails, the tapes may be used like a backup. Files staged off of the disk cache are stored in tar-like archives, so that potentially random access of small files can become sequential. This can make further backups much faster.
QFS may be used as a local or cluster filesystem for large-file intensive workloads like Oracle. SAM and QFS are often used for huge data sets such as those encountered in supercomputing. SAM and QFS are optional products and are not cheap, but they have recently been released into OpenSolaris.
The Veritas filesystem and volume manager have their roots in a fault-tolerant proprietary minicomputer built by Veritas in the 1980s. They have been available for Solaris since at least 1993 and have been ported to AIX and Linux. They are integrated into HP-UX and SCO UNIX, and Veritas Volume Manager code has been used (and extensively modified) in Tru64 UNIX and even in Windows. Over the years, Veritas has made a lot of money licensing their tech, and not because it is cheap, but because it works.
VxFS has never been part of Solaris but, when UFS was the only option, it was a popular addition. VxVM and VxFS are tightly integrated. Through vxassist, one may shrink and grow filesystems and their underlying volumes with minimal trouble. VxVM provides online RAID relayout. If you have a RAID5 and want to turn it into a RAID10, no problem, no downtime. If you need more space, just convert it back to a RAID5. VxVM has a reputation for being cryptic, and to some extent it is, but it’s not so bad and the flexibility is impressive.
VxFS is a fast, extent based, journaled, clusterable filesystem. In fact, it essentially introduced these features to the world, along with direct IO. Newer versions of VxFS and VxVM have the ability to do cross-platform disk sharing. If you ever wanted to unmount a volume from your AIX box and mount it on Linux or Solaris, now you can.
VxFS and VxVM are still closed source. A version is available from Symantec that is free on small servers, with limitations, but I imagine that most users still pay. Pricing starts around $2500 and can be shocking for larger machines. VxFS and VxVM are solid choices for critical infrastructure workloads, including databases.
These are the four major choices in the Solaris on-disk filesystem world. Other filesystems, such as ext2, have some degree of support in OpenSolaris, and FUSE is also being worked on. But if you are deploying a Solaris server, you are going to be using one or more of these four. I hope that you enjoyed this overview, and if you have any corrections or tales of UNIX filesystem history, please let me know.
About the Author
John Finigan is a Computer Science graduate student and IT professional specializing in backup and recovery technology. He is especially interested in the history of enterprise computing and in Cold War technology.