File Systems and Databases

The topic of combining a database system (usually a conventional relational db system) with a file system to add meta-data, a richer set of attributes to files, has been a recurring discussion item on this and other sites. The article published last week, Rethinking the OS, under the heading “Where Is It Stored?” talks about the ability to locate a file without knowing the exact name or location.
The Basic Idea

While there are many variations, the essential idea is to enhance the number and type of file attributes to enable location and access to files without knowing the exact path. BeOS added B-tree indices for additional attributes (such as author, title, etc.) thus enabling fast searching based on those attributes. WinFS seems to use Microsft’s SQL server (or a version thereof) to store and index file attributes.

Using RDBMS technology to index file attributes enables fast search of files based on a number of attributes besides file name and path. In effect, we enhance access capability by providing many more new access paths that were not available before. At its core, this idea of adding a database to a file system is about adding many more new access mechanisms to files. The increased flexibility enables users to locate files using different terms, especially domain specific terms that have nothing to do with file name and path.

The Arguments Against It

Many developers and other technically-inclined computer users are very much against this idea. However, their opinion is often not based on sufficient information and clarity to be sound. It is easy to jump to conclusions based on headlines alone or a short summary provided by news.

The common criticism is that addition of a database is simply not necessary. After all, the modern file systems do work very well, provide a very fast and efficient access to many files already. Why mess with them? This is the, “If it ain’t broke don’t fix it”, philosophy. This is a perfectly valid argument, as it is well known that the modern file systems are results of a long evolution and therefore have many excellent characteristics. It is important to note, though, that this idea simply adds database to an existing file system and doesn’t eliminate it! The standard file system continues to play the same role as it always has had. It is not being removed of replaced!

Another argument is that the overhead introduced by a database system on top of (or next to) a file system is simply not necessary. There are several problems with this argument. First, the overhead is not very well understood today. This argument may be valid only if the overhead is significant. Naturally all implementations strive to keep the overhead minimal. In the past when computing resources were scarce this argument would make sense. Today with massive processing, memory and hard disk capacities available the overhead should be handled without any problems. In fact, future multicore processors can easily execute database operations on one of the cores in background with little to no impact on other tasks.

Another argument is that database simply adds unnecessary complexity to the filesystem. There are many RDBMS of different sizes and complexities. In general, database systems being combined with file systems are of the smaller, simpler variety. We are not talking about a massive Oracle database using massive amounts of memory and disk space. In general, they would be some sort of smaller embedded database systems that are much lighter on machine resources.

The Perspective

The principal argument against it is that we simply don’t need it. To clarify, we – the readers, programmers, technical users – don’t need it. Of course, there are other users, other perspectives, who may not be well represented here. In particular, the less computer savvy users do, in fact, need it.

An increasing number of people use computers. They both produce and use an increasing number of files. Networking has also caused these files to be scattered across multiple computers, from company file servers to Palm Pilots and iPods. They are not so much interested in the efficient algorithms, overhead of databases and other technical arguments. They are interested in their time wasted trying locate a file. This problem of trying to organize and locate files will only get worse, not better!

These users operate in non-technical domains which don’t include the concept of paths and file names. Every time they must use file manager or file open dialog they must perform a domain switch. Worse yet, most of them lack either the desire or the skill to organize the files well. How many times have you seen a novice computer user with a desktop cluttered with file icons? Often they don’t even know or understand directories. My father panicked when a file dropped off the recently used list in Word. He simply never used file manager.

Full file path is simply a unique key to a file, necessary to be able to reference files without any ambiguity. They serve the same role as a primary key in database systems. There are other system related attributes (few timestamps, ACLs, etc.) but no other domain specific attributes. Databases can store and handle many more attributes, so it is simple to add many more attributes that are domain specific. Besides the usual author, title, and similar, doctor’s office can add health related attributes, or lawyer’s office could add law domain attributes, etc.

In fact, there is no reason why any one attribute type should be the dominant one. File name should be only one of many equal attributes. Thus a GUI such as file manager should enable users to browse using other attributes. For instance, it could display a hierarchy where the 1st level are authors and second are titles. In fact, there are many different combinations possible. That is another article.

Offering the ability to define additional domain attributes and manage them (store, search,…) at the OS level removes the burden from applications. In addition, it offers the possibility of standardized, application-agnostic mechanisms to access file attributes. Compare it to the current situation where searching for information across many different file types, binary formats, is extremely difficult and painful.

Summary

Programmers are inclined and able to develop all sorts of tools to organize their files. We simply solve our problems by creating new tools. The wider user community cannot do this. They need help. From their perspective the problem of managing a large number of files is real and here today. Therefore we should move past the question if it is necessary and examine the best design/implementation solutions.

Also see: For a sample file manager based on these and other ideas see Dekk

52 Comments

  1. 2004-12-21 6:18 pm
  2. 2004-12-21 6:19 pm
  3. 2004-12-21 6:27 pm
  4. 2004-12-21 6:38 pm
  5. 2004-12-21 6:39 pm
  6. 2004-12-21 6:55 pm
  7. 2004-12-21 7:45 pm
  8. 2004-12-21 7:53 pm
  9. 2004-12-21 8:46 pm
  10. 2004-12-21 9:20 pm
  11. 2004-12-21 9:25 pm
  12. 2004-12-21 9:26 pm
  13. 2004-12-21 9:30 pm
  14. 2004-12-21 9:33 pm
  15. 2004-12-21 9:43 pm
  16. 2004-12-21 9:46 pm
  17. 2004-12-21 9:49 pm
  18. 2004-12-21 9:58 pm
  19. 2004-12-21 10:01 pm
  20. 2004-12-21 10:05 pm
  21. 2004-12-21 10:18 pm
  22. 2004-12-21 10:20 pm
  23. 2004-12-21 10:27 pm
  24. 2004-12-21 10:35 pm
  25. 2004-12-21 10:47 pm
  26. 2004-12-21 11:14 pm
  27. 2004-12-22 12:12 am
  28. 2004-12-22 12:23 am
  29. 2004-12-22 12:54 am
  30. 2004-12-22 2:54 am
  31. 2004-12-22 2:59 am
  32. 2004-12-22 3:03 am
  33. 2004-12-22 3:30 am
  34. 2004-12-22 4:17 am
  35. 2004-12-22 4:29 am
  36. 2004-12-22 5:01 am
  37. 2004-12-22 10:29 am
  38. 2004-12-22 10:53 am
  39. 2004-12-22 1:34 pm
  40. 2004-12-22 2:32 pm
  41. 2004-12-22 2:35 pm
  42. 2004-12-22 2:49 pm
  43. 2004-12-22 2:52 pm
  44. 2004-12-22 3:20 pm
  45. 2004-12-22 3:44 pm
  46. 2004-12-22 4:01 pm
  47. 2004-12-22 5:12 pm
  48. 2004-12-22 7:54 pm
  49. 2004-12-23 12:26 pm
  50. 2004-12-24 8:39 am
  51. 2004-12-24 3:39 pm