Linked by Slobodan Celenkovic on Tue 21st Dec 2004 17:58 UTC
Editorial The topic of combining a database system (usually a conventional relational db system) with a file system to add meta-data, a richer set of attributes to files, has been a recurring discussion item on this and other sites. The article published last week, Rethinking the OS, under the heading "Where Is It Stored?" talks about the ability to locate a file without knowing the exact name or location.
Order by: Score:
The general direction...
by bobjohnson on Tue 21st Dec 2004 18:18 UTC

I think the major misconception to such paradigm shifts is not the 'I forgot the filename' argument so much as the linking, and referencing to other ideas. This is the power of metadata, and the generation of metadata dynamically is where the power will be realized in such systems. This is the focus of my own project, in which abstracted concepts can be related to abstracted concepts, and by which are able to complete metadata based upon statistical analysis, combined with domain narrowing, as well as human interaction where necessary. This is the real usage of metadata, thereby being able to find any concept based upon it's metadata. 'I know I got that report from Bob, via email' thus these concepts could be related by the nature (domain) of the information one is looking for, and the filters (narrowing) based upon how that information has been acquired.

Basically, what these filesystemdatabase systems should try to acchieve is that of replacing the 'file' paradigm with a 'concept' paradigm. Make all information available via a standardized api that all programs can access and manipulate via a templating system that is dynamically defined.

Admittedly, not an easy task, but i've had pretty good luck with my implementation thus far.

Any ideas, recommendations, caveats, (constructive) criticisms? links?

What about versioning?
by Porter on Tue 21st Dec 2004 18:19 UTC

It occurs to me that no discussion on the file system as a database is complete without mentioning versioning! It would be extremely handy to integrate CVS-like versioning capabilities into the operating system to allow for automatic backups of user files...

Under Windows this might only affect the "My Documents" folder, or be a user settable property of a folder, such that whenever files are saved to that directory it actually treats it as a revision (if a file with the same name already exists). Then, by adding a couple of simple hooks to access revisions - apps could be easily extended to show differences between visual or textual (not sure how to handle audible) data.

For example an image could be versioned and then a "difference" viewed in an image editor that displayed the two images either side-by-side or overlaid with transparency.

Obviously this would have some serious storage, and performance impacts - so one might not want to allow the versioning of video files...

The key problem I see with adding a "database" to the file system in order to store meta-data about the files is that users rarely enter such metadata. Sure it can be derived from some files - MP3s and ID3 tags. But, those are obvious, and any application that is aware of them already handles them. Perhaps a better approach would be to automatically generate meta-data about files based on user actions.

Example: if the do a right-click -> send as attachment, you might want to backfill the name of the receipient(s). People do clean out their "sent" folder occassionally... But this way you'd still be able to query the filesystem to see if you ever sent a file to a person. Combined with built in versioning - you'd even know which revision of the file they had.

Windows 2K and XP already let you add in tons of meta-data (it's a lot like BeOS in that fashion) and with indexing on, it actually leverages it. But I know very few people who actually fill it out beyond what applications automate for them. So while I appreciate the convienience of an embedded database I question whether it would ever actually be used.

re:What about versioning?
by bobjohnson on Tue 21st Dec 2004 18:27 UTC

That's a really good point about the application of such concepts: how does this information get completed?

The weakest point of metadata is admittedly the completion of such data. To date, the only reason that metadata has not been fully utilized is because as usual, the tools are not that complete, or fun, or ideal to use. The referencing of knowledge is not centralized to provide a clean way of doing this.

Think about this:
the best example I've seen of a music library sorting application is (sadly enough) windows media player, in which you can select numerous files, and drag them to the artist, and the file information is updated to reflect where it was dropped. Why is this not a fundamental approach for all information that is stored on a computer, with central repository of knowledge that can be used globally?

Example two:
Why is it not simple to use your address book in Outlook, Thunderbird, and quicken? Why do they all have their own repositories?

The problem isn't the filesytem, or the database; it's the tools that are lacking, and a vision to incorporate such concepts at a root level.

We already have done this
by TheCEOCoder on Tue 21st Dec 2004 18:38 UTC

Heck, why the big philisophical talk? It works great. My company already has built a system like this inside a world class leading database and it really runs well.

versioning...
by looncraz on Tue 21st Dec 2004 18:39 UTC

I am the developer of PhOS based on BeOS (R5.1d0 Dano/Exp), and I see no new ideas in these blurbs about file systems.

Not to say they aren't good ideas :-)

And that includes versioning.

I managed almost 60,000 source code files, on my own system. ow, with that many files, it was a big pain creating and organizing backups of the files.. to mark stable versions, etc... I thought that a system-side cvs-like versioning system would be awesome. So I made one.

I add attributes to each and every file, which contain all changes since I last marked the file as stable. This is not done at the OS-level, though I may very well add this into Haiku's oBFS when the time comes, as it is not terribly difficult. I currently have a development_server running which watches for the startup of an entire list of tools I use for development, and then launches NodeMonitors (Nodes are 'dumb' they represent the data in the file, but not the file, and hey represent the only way the file system really thinks of files on BeOS. That is, this whole idea of who cares where a file is idea :-).

So, when any of the applicaions are loaded, I simply go into that app from the server and find out what file it is working on, then ensure that the file's atributes are up-to-date with the changes since the stable version. If not, I update them, using cached data from the file, and will do so for every instance the applicaion either opens or closes that file.

Sure, I could very easily do it for every save (I get a notification every time the file is changed), however that is far from needed, as the applications have Undo functions :-)

I can right-click on a file, and select the Tracker add-on I created just for this purpose, entitled 'Version Control'. Version Control gives me a couple of choices. I can force my current code to be the new base-code (mark as stable), I can remove any of the revisions saved, or I can force the creation of a revision at this point, without marking that revision as a stable version, and of course I can back-track (or even forward) to a different revision. I can search the revisions for a specific line, etc...

It seems complex, but it really isn't. It just goes hand-in-hand with how BeOS and databased file systems should act.

And one step beyond that, I use the same mechanism to parse all of my sources to ensure there are no problems in others areas of code (i.e. compatibility problems). It also contains a list of optimizations for code, and can more or less figure out what I am doing is not the best way, and let me know. But that is the hard part :-)

--The loon

View data in a program via DBMS too
by Bob Smith on Tue 21st Dec 2004 18:55 UTC

A package like Run Time Access lets you view the statistics, status, and configuration in your running programs as data in a database. This, in combination with a DBMS for a file system, can give the user a unified way to view all data on the computer.

Want a DB FS? Say goodbye to OO!
by therandthem on Tue 21st Dec 2004 19:45 UTC

If developers want a database for a filesystem then what almost no one mentions is that you will not be able to use object oriented programming as it currently stands. For a filesystem to be truly useful, the entire structure of any file format most use the database top to bottom.

If you want to search for that SVG file with the "blue circle next to the red square" then the XML structure of an SVG file most be carefully mapped to a universal object structure upon which the filesystem is based. This universal object structure will not look like OO when all is said and done. You'll know that it is not OO when people can actually understand it and explain it to others.

What will this universal object model be? There are some good choices already out there. One is the relational model. Visit www.dbdebunk.com for more information.

RE:File Systems and Databases
by flip on Tue 21st Dec 2004 19:53 UTC

I think this was a really nice and concise article. Thanks!!
This is definately the way to go for end users. I think overhead and system complexity truly are meaningless to the end user - on an average desktop setup, a lot more time is wasted today on finding documents than waiting for some task to finish.

Something like this was to be my thesis work. That was until Apple came up with Spotlight. ;)

Dependence on one database.
by dpi on Tue 21st Dec 2004 20:46 UTC

My concern is that a filesystem and thus a user becomes dependant on one database system -- 'free' or not. Filesystems are currently a choice-once-run-always unless a reinstall or other drastic changes are made. If the database is open and thus convertible or convertible whereas there are different backends then this is less of a problem but what if the FS only supports one database whereas that user doesn't want that to depend on that one, wants to depend on another or wants to hae the freedom to switch to another easily? Consider you're using e.g. Oracle or PostGres whereas you have to use MySQL for your filesystem. Thats bloat and limitting; could be very likely not wished for. OTOH, one might actually prefer different databases or at least 2 different services where one is not dependant on the stability of the other. I'm afraid WinFS will create dependence on Microsoft's proprietary software, i hope others will not make the same compatibility and freedom related mistake.

Question
by peragrin on Tue 21st Dec 2004 21:20 UTC

>> While there are many variations, the essential idea is to enhance the number and type of file attributes to enable location and access to files without knowing the exact path. <<

Isn't this the part of the concept HFS + and Mac OS uses?

it stores the data, but it also stores a unique identifier. You can use eitehr to locate the file, and if the location changes everything gets updated. That way you can move an Application but the Aliases still work. The system can compensate. It also stores a resource file, which if I am paying attention is what Spotlight is going to use as part of it's search?

yet there is no real database.

RE: The general direction...
by Slobodan Celenkovic on Tue 21st Dec 2004 21:25 UTC

Yep, exactly the right idea.
There is a wealth of file meta-data that can be automatically collected and stored from file usage context. While working on my gizmo I found that many file formats have a set of attributes, not always properly filled but better than nothing.

Please post a link to your site, where are your docs,...

thanks

DBFS
by Will on Tue 21st Dec 2004 21:26 UTC

The Tool Problem is really key to all of this, and we don't have the tools now simply because we don't have a dominant implementation of such a system, much less a portable standard to work against.

The poster mentioned the things that he can do with BeOS, and I'm sure many of the Be programs can leverage the system the same way, but obiviously those programs won't port (at least that part of their functionality) to something like Windows or Unix.

Also, not having a standard API limits porting as well. Simple example are things like extended File Attributes that are supported on most filesystems in some form: NTFS, Linux, Mac, BSD, Solaris all have some form of extended file attributes, but they vary in their implementation and limitations, and no mainstream program that *I* know of leverages these capabilities (save system utilities).

Neither HTTP nor FTP have been upgraded to support these new file systems, for example, so you can't even (directly) transfer a file using extended attributes from one system to another using these protocols. So, those are simple examples of some of the barriers towards adoption of these extended file systems. An open API would help a lot so that developers can start to use these features. For example, I don't think any of the mainstream scripting languages have support for extended file attributes.

But the attributes exists, they're here, now, and actually wide spread. There's hope that over the next year or two we'll see some stabilization among the feature sets so that developers can actually rely on them.

Once They(tm) have come to grips on how best to interoperate files using these extended attributes, then we can look at leveraging those attributes using some kind of indexing scheme.

Regarding: "Dependence on one database", that's not even an issue. If you use an extended file system feature that is unique to a particular filesystem, then you're stuck on that filesystem regardless of how the developers implemented the filesystem. If not, then you have no dependence whatsoever on that filesystem and can move your files (and ideally, your application) from one filesystem to another. The actual implementaion and internal control structures will be largely irrelevant to your application, just like it is now.

hrm, some issues
by kjs47404 on Tue 21st Dec 2004 21:30 UTC

I've been using swish-e on FreeBSD and Copernic Desktop Search on MS Windows for a while now and I have some concerns and issues.

First of all, I think that a lot of this idea that database file systems will radically solve all of the problems of file management are based on a bit of a fallacy. Databases are only as good as the keywords you put into the database. There seems to be an assumption that people unwilling to organize files into a heirarchal database will be willing to add metadata to files. As anyone who has tried to find something through google or through a periodical catalog should know, keyword searches unless constructed well tend to produce either too many misleading hits, or not enough hits.

Secondly, well-constructed file trees are metadata. I know that a collection of documents in a folder called Dissertation/draft1 is likely to be the first draft of a dissertation. images/mushrooms/ohio_10_04/*.jpg are likely to be photographs of specimins sighted in ohio. images/mushrooms/ohio_10_04_edited/ should be obvious.

Third, a part of me really worries about the possibility for vendor lock-in. Copernic Desktop Seach does not handle OOo files. swish-e will handle everything as long as you add a parser that will extract the text from a file. So one of my concerns is that applications will only have their data indexed in the database if it includes proprietary APIs.

Re: RE: The general direction...
by bobjohnson on Tue 21st Dec 2004 21:33 UTC

well, my site isn't really a site, per se, but there will be content there soon...

openknowledgebase.info

nice domain name, huh? ;)

I actually do have large plans on this sort of app, so keep your eyes open...

I hope that comment doesn't come back to haunt me... ;)

Common Ground
by Anonymouser on Tue 21st Dec 2004 21:43 UTC


One reason UFS is the best we have, right now, is that it is something different vendors can at least agree on for interoperability. Once we get into higher-level discussions about meta-data and databases, everyone has their own idea that is better than everyone else's, and it will be a long time before enough ideas are tried that people come to a concensus on them. We've had UFS for a couple decades or so, so I expect it'll be quite a few years before something better that everyone likes comes along.

Reiser4
by Anonymous on Tue 21st Dec 2004 21:46 UTC

Is not it already realised, tested and be ready to include in mainline ? Article remember me 2 years old reiser4 concept on namesys.com

regarding the issues mentioned here (so far)
by bobjohnson on Tue 21st Dec 2004 21:49 UTC

the MAIN objective of such concepts is to be able to relate things together, and have the meaning relevant to the computer and the human. This I would suggest be carried out via all textual relationships: by this I mean that instead of being able to find something by name, it's actually always referred to via a GUID. This GUID concept has the ability to have related concepts (attributes, metadata) These attributes are really just concepts that also have a GUID.

peep this:
A Concept(Bob Johnson) has Attribute(Concept(Father)='Jim Johnson')

therefore a 'folder' (used loosely) of ConceptType(Person) would hold references (links, if you will) to those Concepts('Bob' & 'Jim')

Individually, Bob exists, as does Jim: but with no relation. Now we can add a relationship between these two abstracted types of information, with a commom framework that allows for this type of dynamically generated 'Types' to ALL types of information. Depending on how I've setup my system, this framework extends itself to many different concepts, and allows flexibilty to grow, and still be able to link to other concepts.

whew, are you still with me?

Now combine that dynamic, abstracted framework to a gui that would replace all of your current software. This would (via plugins) retrieve all your messages for example, and create the necessary relationships based on they applications setttings, which are also stored in the central db.

Lets' say that your messaging app gets an email from Robert Johnson, that email has properties that the computer could parse through, and create metadata based on what is found. Using GUIDs would also make this much easier.

Once you had established that robJohnson@hotmail.com was a known person, they'd be given a GUID, to which the emails attribute (From) would have that completed dynamically.

i have to stop now, but I think you get the idea? post your thoughts, please!

I know that this looks a lot of things in the face, and much of what exists fulfills this idea, but not with a CENTRALIZED repository of information.

I read once, that in order to make a great leap, one has to ignore a lot of what has been done. Edison didn't get the lightbulb right on the first time, and he said he had learned of thousands of ways NOT to build one. I think the same applies in such ideas of computing also.

cheers!

Smalltalk
by ozten on Tue 21st Dec 2004 21:58 UTC

Have you investigated any SmallTalk implementations? It has many of the aspects which you are thinking about. It is basically like an OS which is image based instead of file based. Everything is stateful including the virtual machine. Check out Squeak or another installation.

Re: smalltalk
by bobjohnson on Tue 21st Dec 2004 22:01 UTC

I have heard of it, but paying the bills fixing M$ code all the time has not left me enough time to fully investigate it... The more research I've been doing, the more I've been finding references to it, also. I haven't heard it put the way you have in your post. That's the last straw!!! i'm going to check it out right now.

Thanks!

Re: smalltalk
by bobjohnson on Tue 21st Dec 2004 22:05 UTC

Furthermore, the gui I've been writing is in .Net, but the further along I get, the more I've been seriously considering a platform change, the 'vendor lockin' people are always talking about. i've been considering Mono, on fedora core 3, or a linux from scratch alternative: i need the linux practice , (or I've outdorked myself).

Formalization
by DonQ on Tue 21st Dec 2004 22:18 UTC

As always main problem with developing such a systems is formalization. Human being doesn't think neither express itself in logical, formalized way; computer per se cannot understand illogical and contradictional commands.

Well, users can be teached to perform some less or more simple operations, like give name to a document (do your letters at home desk have names?), save it (do your hadwritten essays need saving?) - but this is not natural.

Mixing file system, database, metadata should in principe create something more natural. Unfortunately this involves two stages of formalization - first while creating metadata (this must be automated, otherwise it's pointless), second while performing searces on this data (this requires understanding human language at least). All the rest is pure technical problem and many degrees simpler.

Seems that this task is unsolvable. There can be many partial solutions (many files have extended attributes/tags, created automatically; XP can create indices, based on file content; MS has SQL application, able to understand natural language queries etc) but noone has developed complete solution yet (for many decades already).

Microsoft tries this with WinFS - which is adjourned many times. Seems that they cannot solve these problems too.

RE: What about versioning?
by Slobodan Celenkovic on Tue 21st Dec 2004 22:20 UTC

Versioning capabilities will be added eventually, either covering full file system or a portion of it. It is just a matter of time, as well as ability to create a simple interface as the current version constrol system interfaces are too complex for an average user.

Certainly most applications are capable of processing attributes of the file types they deal with (music players can access music file attributes for instance). However, to be able to access attributes of many/most file types an application would require knowledge of many file types, which is very cumbursome, expensive and difficult to maintain. For instance, file manager search command is limited to simple text and similar ASCII based files because it is too difficult to embed bunch of filters for other binary types.

Thus storing these attributes in a database that is not file format dependant would enable any application to access attributes of any file format so long as it can access the API.

It is precisely the fact that we are unable to access these attributes that are file format specific from any context (such as generic applications, file manager, ...) that discourages users to enter them. Their scope is too limited. They become truly useful only on a global level, when searching not just from a specific application a specific set of files, but a full file system. Therefore, standardized API is likely to encourage users to start using the attributes as they will be accessible from any application.

RE: What about versioning?
by bobjohnson on Tue 21st Dec 2004 22:27 UTC

It is precisely the fact that we are unable to access these attributes that are file format specific from any context...

I agree 100%, this is vital to the completion of metadata. Also, by creating tools that enable one to create this metadata programatically, ie. a 'import Plugin' such as BeOS had with it's translation system would be a major boon to such a system. Damn I miss BeOS, my new computer doesn't work with it. There was so much potential there, so long ago.

RE: Dependence on one database.
by Slobodan Celenkovic on Tue 21st Dec 2004 22:35 UTC

Very good points. There are certainly some difficult issues to be resolved.
I believe that the RDBMS technology is sufficinelty standardized today that it should be easy to switch. Most systems have some version of unload/load or dump/load functionality. There is also XML format that can be used to transfer metadata.

Database is hidden behind an API such as WinFS and apps should never have direct access to it. Thus it should be easy to replace one type with another. While MS will probably prohibit anything else than MSQL, open source systems will likely offer some choice. That shouldn't be a problem.

That being said, there is no standard API for accessing file metadata. Therefore, there is the danger of being locked into a certain API. For instance, to use Windows platform you have no choice but to use WinFS, hence vendor lockin. I am hoping that eventually a standard multi-vendor API will emerge as WinFS becomes popular and useful.

Another problem is what happends to metadata when file travels between 2 file systems. Some sort of accompanying XML content with metadata may be created, or ... Remains to be solved as well.

There are certainly some implementation and interop issues, no doubt. It is not going to be trivial.

RE: hrm, some issues
by Slobodan Celenkovic on Tue 21st Dec 2004 22:47 UTC

Quality of metadata or lack thereof is based on users. The same issue exists today with hierarchies. The difference is that user is not forced to enter information he/she doesn't have. You can enter title and author if applicable and not bother with a path. Today, you have no choice, hence a lot of people dump most files into a single directory. Having a larger number of available attributes increases chances of entering higher quality values.

Google is a huge database. File system database is much smaller, so number of false hits ought to be proportionally smaller.

>>> Secondly, well-constructed file trees are metadata

Indeed they are. In fact, as I stated in the article they can remain in use, no need to kill hierarchies. In fact, you could define multiple hierarchies, based on different criteria. The main point is that we need more flexibility to be able to use multiple attributes, not just a single hierarchy.

...
by Yamin on Tue 21st Dec 2004 23:14 UTC

Well I think some kind of meta data FS is inevitable, but I really don't see the directory based system going anywhere. It's just too easy to reference stuff when you know what it is. The article makes a good point when it says

" Full file path is simply a unique key to a file, necessary to be able to reference files without any ambiguity. They serve the same role as a primary key in database systems. "

Any db file system will have to use this for exactness and just for people's transitional period. The biggest battle will come when we need to standardize on meta-data names.

Then there's the issue of storing all data as an object that the 'system' understands. That's a faroff goal especially if the data structures are not constant...I htink you'dhave to do something like Java's XMLEncoder...but man would your harddrive space perish. Try getting the world to standardize on that.

But for the next little while, I think a databseFS with metadata seems reasonable. Toss in some google style search technology so we don't all need to know regular expressions and SQL syntax, and it just might turn out a success. Update the FTP / file sharing protocols to include transferring of metadata, and its all good.

Already done
by Diego on Wed 22nd Dec 2004 00:12 UTC

Database-driven OSs have been done in the past (no, not BeOS) decades ago, I dont remember the name but it was one of those weird non-unix operative systems, perhaps it was before of unix.

v Hey Lumbergh
by Jem on Wed 22nd Dec 2004 00:20 UTC
Old idea being revisited and updated.
by AussieGuy on Wed 22nd Dec 2004 00:23 UTC

Back in the 70's and 80's there was this little OS that was a Database primarily, it was called PICK (Multi-valued database), which lived long before SQL, and even had its own natural query language. It bestowed the virtues of everything being stored as data, including the OS, which was stored in the database.

Thinks have changed now, and D3 (latest version of PICK), has followed UNIVERSE and other PICK/Multivalued databases and now live ontop of other OS's.

While not as advanced as you are suggesting, the fundamentals are pretty close.

Versioning
by Anonymouser on Wed 22nd Dec 2004 00:54 UTC

Several posts above advocate versioning in the filesystem. Versioning at the filesystem level is useful only in the context of WORM-type storage, because users always become their own worst enemy with regard to keeping data intact. Combine versioning with Solaris ZFS-style expandable partitions might allow this to happen, where disks can be added or replaced as storage needs grow.

Except for hard-core video editors or MP3 pack-rats, 100+GB is almost enough for a true WORM storage, given that most files are only a few KB. I'm still well under 30GB, myself.

Bleh, it's just not necessary.
by Chris on Wed 22nd Dec 2004 02:54 UTC

Individual apps should be handling this sort of thing. People lose their files, fine. Load up the app you used to make it and check through your recent files you worked with (it should store everyone you make and view, and let you sort through and search them).
People are going to lose files no matter what you do, and I don't see adding mass overhead as helping. If you restrict it to home directories it will help (eliminate literally tens of thousands of files that really don't need to be searched) however you still often end up with massive directories of stuff. It depends a lot on the user, most will only have large numbers of files in their IE cache.
I honestly think people will simply not fill out most of the extra information on files and it will become useless for those who need it. Those who don't need it may continue to simply organize their directories and pick good filenames. Most people make, save, write, and think later. And typically, non-techie, users will see files as something associated with an application and not associated with the computer as a whole. So do find that word doc they will often open word to find it (and this is logical, and if apps were written with this in mind it would work too).

The mega-finder app is a dream that's just not gonna solve all our problems in 4.2msec.

And I got news for ya, computing resources are still scarce. It's just that we seem to have infinite text processing capabilities; but you throw in video and graphics and you find the modern PC slowing down and begging for well written code.
Also, using a standard SQL server for this is MASS overkill; but I imagine developers realize this. I hope so anyway; I for one do not run mySQL on my desktop for a reason.

If somebody gets a great implementation of this integrated with a filesystem I will support it; I probably won't use it for years but I'll support it! Good luck.

Your concepts and attributes appear to be elements of a semantic network. AI (Artificial Intelligence) research has been developing this sort of ideas for a long time now. Unfortunatelly most of the academic papers are very dense reading. Still, you should examine some summary texts regarding semantic nets in AI.

Long term AI will play a major role, but it is too early today. Only a few brave companies are playing with some semantic processing mostly of documents to extract metadata.

What does he mean "we don't need it"?
by Tuishimi on Wed 22nd Dec 2004 03:03 UTC

I'm a programmer, have been for almost 20 years now. I still get tickled when I can't find something on my BeOS machine and do an attribute search and come back with what I am looking for instantaneously. I cannot imagine how cool Apple's Spotlight will be. Why should I write a tool to do anything like that when the OS can do it (and SHOULD do it) for me? Afterall it is the OS's job to manage my devices and allow me easy access to the data contained on those devices. ;)

Mike

RE: Formalization
by Slobodan Celenkovic on Wed 22nd Dec 2004 03:30 UTC

DonQ,
Excellent post! I was hoping to advance the discussion ;)

There is certainly very large chasm between human languages which are full of ambiguities and precise, discrete computer languages and data. At the moment the translation is mostly performed by us, humans. For example, when entering a query in Google, we often revise queries multiple times to add/replace synonyms. Especially skilled users can reformulate queries in many ways to convey the same semantics. This is both tedious and difficult (time consuming).

In the short term there are a few tricks, heuristics that can be programmed to ease the burden. Google already performs word processing such as removing s for plural, removing punctuation, etc. - normalizing the search terms. Word comparison can be relaxed to tolerate 1 or 2 letter difference to account for minor spelling mistakes. More ambitous programmer could add a thesaurus and perform synonym replacement automatically, attempting a series of queries until results are obtained. And so on.

Long term solid NLP (natural language processing) technology will be essential to translate between the 2 worlds. I believe that Microsoft already has staff working on NLP for use in their Office products (remember the clip?). I believe that it will be a gradual evolution starting with simple heuristics (some of which are already in use today) to full NLP. It will take time though. Unfortunately few companies and OS projects are putting any effort into it. I believe that Seth Nickell (http://www.gnome.org/~seth/) is trying to use a parser in his project.

Having applications automatically extract and store metadata from file content itself can alleviate both problems of using consistent terms as well as metadata collection effort. By copying the same terms as used in content into the database the terms should be consistent. In addition, user is not burdened with metadata entry. As Bob Johnson mentioned eralier, there is also contextual information that can be interpreted and stored as metadata. Thus ideally user would not even notice that metadata was stored.

While full NLP may not be available today, simple keyword based search are very much possible as Google has already demonstrated. WinFS delays have multiple causes. Keep in mind that they have a huge and mostly conservative user base. Therefore, it takes them much longer to develop stuff, especially brand new technology such as this because they don't get 2nd chances. The trouble is that WinFS is at such a low level (file system) that they have to be careful not to adversely impact existing applications and users.

For instance, most games may have no need for the new WinFS features, but need top performance. Microsoft has to ensure that games still get great performance. It is a very complex balancing act for them. Any tinkering at such a low level in an OS is inherently risky. Microsoft doesn't like risk, so they probably want to take the extra time to make sure that it works well.

My only real issue
by deathshadow on Wed 22nd Dec 2004 04:17 UTC

rarely comes up, and that's lazyness. If people are being too lazy to maintain an orderly file tree and put their files where they need to go to be found - Do you REALLY think these people are going to take the time to fill out all these extra attributes at save time? Sure some of these can be plugged in by the program that created them, but if you end up with 100 files all with identical sub-info, what are you accomplishing besides chewing up disk space for what amounts in functionality to little more than what a three letter extension like .doc or .jpg does?

Look at MP3 files and the ID3 tags, more specifically ID3v2 - how often do you find files that have anything in the ID3 tags that isn't in the file title in the first place? 90% of the reason you ever HAVE proper album, song title, etc is because someone used CDDB or FreeDB to name them at rip time automatically.

In my experience you are LUCKY if your user takes the time to enter more than ten letters as a filename, and on the occasion you find names longer than that it's usually because M$ office auto-named the file after the first sentence of it's content.

You really think they are going to spend the time to fill out MORE? People are WAY lazier than that.

On top of which, the hard drive is STILL the biggest bottleneck in a modern computer. Adding more data to parse through on writes, reads, and searches looks really stupid on paper.

The article hits on the most important point - all this should be OPTIONAL. Leave the underlying system intact, and allow files to be created without the overhead... The biggest fear anyone has is being saddled with options they don't want, don't use yet still slow everything down - which is why you see so many vehement attacks against it.

Like "task orientation" - Big fancy word for wizards, and a lot of users HATE wizards. Great for nubes, screws everyone else. Switching to 'task oriented' only pisses people off and wastes time, for god sakes leave the underlying option boxes in place.

RE: Yamin
by Slobodan Celenkovic on Wed 22nd Dec 2004 04:29 UTC

>>> Any db file system will have to use this for exactness and just for people's transitional period. The biggest battle will come when we need to standardize on meta-data names.


Agreed. I have seen some WinFS presentations that showed an elaborate meta-schema. They are developing a system of types, metadata names. I am not sure there will be a battle. WinFS stands alone in scope and may have the 1st mover advantage. Hard to tell.

>>> But for the next little while, I think a databseFS with metadata seems reasonable. Toss in some google style search technology so we don't all need to know regular expressions and SQL syntax, and it just might turn out a success. Update the FTP / file sharing protocols to include transferring of metadata, and its all good.

Exactly.

RE: My only real issue
by Slobodan Celenkovic on Wed 22nd Dec 2004 05:01 UTC

>>> laziness

This used to be my opinion. Those lazy users, why don't they do the proper job of creating directories and organizing files?!? I was annoyed with the graphics designer Jason who kept placing all his files onto desktop, forget directories. Then I realized that he works and thinks in graphical terms, not text based hierarchies. He works with images all day long and directories simply don't exist in his world, don't interest him.

Indeed, there are many users who either don't understand directory hierarchies and/or lack the skills to create good hierarchies. It has nothing to do with laziness! We, programmers, need to understand and accept this fact. We need to accomodate these users even if we don't like it.

>>> 100 files all with identical sub-info

The current apps make a feeble attempt to automatically enter obvious and useless information, such as Adobe attaching, Created by app: xxxxxxx
Of course such information is not of much use. However, should metadata be stored and managed at the OS level and become much more accessible, both application and users would make a greater effort to enter useful information. Today there is very little incentive.

For example, note how many music files have ID3 tags with useful info (artist, title, ....) This has happened only because of P2P apps followed by the players that make excellent use of the information. Users followed and made the extra effort to enter the information. They in turn were followed by more apps, such as music file meta databases, etc. The virtuous cycle was created between programmers building apps and users trying to attach and make the most use of metadata. In fact, the entire Napster/P2P phenomenon would not be possible without metadata.

>>> Look at MP3 files and the ID3 tags, more specifically ID3v2 - how often do you find files that have anything in the ID3 tags that isn't in the file title in the first place? 90% of the reason you ever HAVE proper album, song title, etc is because someone used CDDB or FreeDB to name them at rip time automatically.

Most ID3 tags have at least 3 attributes that are important to users: title, author, album
Many applications have options to create file names from these. The result are very large file names, so most users end up with title alone. I still think that ID3 is a huge success, demonstrates the potential of attaching metadata to files and making it widely accessible.

>>> In my experience you are LUCKY if your user takes the time to enter more than ...

Certainly users should not have to enter all metadata. Aplications will have to become more intelligent and fill some of the data automatically. Many apps already do this to a limited extent.

>>> Adding more data to parse through on writes, reads, and searches looks really stupid on paper.

Storing metadata in database will avoid having to parse/process file content all the time to access the values. The main reason for using a database is precisely to avoid this overhead. Instead, parsing may take place only on file create and write. Even then it is likely to be handled by a separate lower priority background thread. One of the reasons WinFS is late is that Microsoft is making sure that the overhead is not significant, to minimize the need for parsing.

>>> all this should be OPTIONAL

I see no reason why the flexibility to enable/disable should not be provided. It could be enabled/disabled based on location (only for home directories) and/or file extensions (data file types), etc. There are certainly files that need to be excluded (browser cache, systems files, etc.)

You still don't explain why you need a database..
by renoX on Wed 22nd Dec 2004 10:29 UTC

OK, what is the need?
The need is to that your father is able to find easily a file.
There are two ways to find easily a file:
- by its name which is done correctly with current implementation.
- by its content, which is a problem.
Note: I didn't say by its metadata, but by its content, as currently MP3 tags are inside the files.

If you want to find a file by its content, either
a) the 'file search' tool must understand the format of the file searched,
or
b) the application which creates the files must extract the content of the file to put it in the metadata for example.

For me the advantage of a) is that
1) it doesn't rely on the capability of the FS, the DB is as-it is now managed by the search tool.
2) the search tool is able to read the whole content of the file not just the metadata.

The big downside of a) is that the search tool must be very intelligent to understand each format, but on practive very few formats are needed to cover 90% of the need: doc format for document, MP3 tags for songs, ASCII files is enough (add images/video tags extration for 99%)..

You advocate modifying the OS for adding capabilities to search tool, I wonder if just modifying the search tool wouldn't be enough?


Re: You still don't explain why you need a database..
by Luke McCarthy on Wed 22nd Dec 2004 10:53 UTC

You advocate modifying the OS for adding capabilities to search tool, I wonder if just modifying the search tool wouldn't be enough?

To cut into the conversation...

Well my first thought is that it would make incremental indexing easier and faster. How to notify the search tool of every modification, making sure nothing is missed, without a costly re-index? What happens when the program is not active when files are added and modified? Storing this in the FS structures is much more efficient and reliable.

How, also, would other applications be able to take advantage of this searching and present data in convenient ways for the application domain? It's really more than just a program for searching.

The point of a database is for the data to be pulled apart and tagged in a consistent way instead of thousands of incompatible file formats which are very hard to process automatically using general tools.

My thoughts on network interchange... Invent a file format that can represent whatever model you have (please, not XML, we need non-text too). Convert to-and-from. And then wait for HTTP/2.0 ;-)

RE: Luke McCarthy
by renoX on Wed 22nd Dec 2004 13:34 UTC

While I agree that if every application is made 'meta-data' aware, this is probably less costly than having an external tool doing the indexation (even though the indexation is of time aware and only scan modified directories/files), I'm quite wary of having to be sure of the behaviour of each application to make something work correctly..

What happen if an application do not update its metadata?
You can't search correctly those files: so bye-bye porting easily applications from other OS.

Of course if the 'search tool' doesn't know how to handle a file type, there is the same problem, but a bit smaller as if several applications use the same file format you need only one plugin for parsing the file type..

Also you're writing about using non-text in the metadata, I'm not sure what you plan to put inside.. Images?
Sure there are some AI applications which are able to search inside images, but noone use them..

RE: Luke McCarthy
by bogomipz on Wed 22nd Dec 2004 14:32 UTC

Also you're writing about using non-text in the metadata, I'm not sure what you plan to put inside.. Images?
Sure there are some AI applications which are able to search inside images, but noone use them..


BeOS stored file-icons as metadata. Thumbnails for hi-res images and perhaps also for video-files makes perfect sense to store as metadata (Haiku will most probably do this for image files). While this is not meant for searching, I still see it as the "right" way to store icons and thumbnails. Also, at some point in the future it will probably be quite normal to do image searches on your home computer to find that photo of your dog eating the neighbour's ice cream.

As for what Luke was thinking of when mentioning non-textual data, he will have to answer for himself. (Representation of objects from an OO language, maybe?)

>>> but on practive very few formats are needed to cover 90% of the need: doc format for document, MP3 tags for songs, ASCII files is enough (add images/video tags extration for 99%)..

Indeed, there are few formats used by the majority that would cover most needs. However, there are many specialized formats as well, such as CAD/CAM, software design and many other industries have their own formats. Worse yet, while Word defines specific title, author and other meta-attributes, others don't. XML and HTML can encode attributes in many ways. There are many XML DTDs/schemas, each would need a specific filter.

The industry is dynamic and formats are not static. They keep changing, new ones are created all the time. Therefore, there is a large number of old obsolete formats that still need to be supported as files can be stored for a long time. Thus we have the burden of both maintaining filters for old formats and developing new ones to keep up with the latest popular formats.

Finaly, some formats are not public, are private property of companies. They can be very difficult or impossible to process. You get into all kinds of legal issues.

While on the surface it seems simple to handle the few most popular formats, in practice it is not good enough.

>>> You advocate modifying the OS for adding capabilities to search tool, I wonder if just modifying the search tool wouldn't be enough?

Make it plural, tools and applications. Some search from a file manager, others from an application such as a word processor, or mail client. All the applications would need this capability, all the filers replicated. That is why placing the parsing/processing of file formats at the OS level would remove redundancy and make it much easier to implement.

RE: Dependence on one database.
by dpi on Wed 22nd Dec 2004 14:49 UTC

I hope you're right, especially on the API / database independence part.

After reading the posts here, there's another thing i'm wondering about: the current metadata. What are you going to do with that? I'm gonna use MP3 ID3 tags as example even though there are multiple versions (ID3v1 and ID3v2). I'll keep numbers low for the sake of simplicity, but they could just as well be higher in another example so please don't apply the solution that migration is easy in this example.

Lets say a user has a number of these MP3s files.
* The user has 10 correct ID3v1 tagged MP3s in the directory "artist/cdname" which he/she ripped from a CD him/herself in 1999 (these defined as 'X').
* The user has 6 correct ID3v2 tagged MP3s in the directory "artist/cdname" which he/she ripped from a CD in 2004 him/herself in 2004 (these defined as 'Y').
* The user has 4 MP3s which he/she downloaded from the Internet.
** 1 of them has the filename 'a.mp3' and has no ID3 tag (this defined as 'Z1').
** The other one has the filename 'artist - songname' and has no ID3 tag (this defined as 'Z2').
** The thirth one has the filename 'b.mp3' and has a correct ID3v1 tag (defined as 'Z3').
** The last one has the filename 'c' and has a correct IDV2 tag (defined as 'Z4').

Now, my questions in this example are:
1) How would you know how (or actually do it) to to index this correctly. What would you index? Would you 'use' the ID3v1 tags and/or the ID3v2 tags? What about the filenames then? How do you know which delimiter to use (often used in 'cut' but thats not 'AI' of any kind).
2) Where would you store it for use? Would the 'system' (DBFS system) take the ID3vX tags and put that into its own database? Would it try to be in sync or would it remove the ID3vX tags and put that metadata into its system? Would it use it, and then simply ignore it?
3) I'm really wondering if this example also does apply on other examples or if the MP3 example is simply just a popular example but not representative.
4) How is the 'system' going to support other 'file formats', if 'it' wants to, that is? Using a library or specification to support that format thus making it extendible? Are users allowed to extend such support in e.g. ~/.dbfs/formats beyond e.g. /etc/dbfs/formats? Or is this flexibility not wished for? (I had the Amiga a bit in mind when i were thinking about this point).

PS: versioning is nothing new (VMS had it). Afaict, this doesn't have to use a DBFS or architecture at all, but i'll look a bit more into OSF now so take it with a grain of salt. Expandable partitions is nothing new either (XFS has it, LVM allows it easily).

again, the major misconception
by bobjohnson on Wed 22nd Dec 2004 14:52 UTC

is that this is a paradigm shift that WILL happen, it's the natural progression. I think what most people don't think about is the underlying concepts intend to replace the traditional file system IN TIME, not right now.

This discussion has brought up some major issues, which are valid:
1. Backwards compatibility
Personally, the software I'm working on will import all of the current information, and pretty much do away with my current software.
2. This is not intended to be a file finder:
the concepts involved (at least in my project, and many others) are intended to create a new way of storing data, so that the current idea of dealing with a file is obsolete. One would deal with concepts that can be related to concepts. In the webcentric world that we are quickly evolving, you mostly just deal with (html, css, etc) formatted database entries. The ideal with systems such as this include application servers more than anything.

3. Versioning: By design, or not one could have thousands of copies of the same concept. (This is not ideal obviously) However this could be good, because in a distributed knowledge storage system, one could then implement a rating system, something similar to slashdot, where moderations are moderated. Thus, the obvious choice becomes clear. For example: lets' say that 1000 people took part in creating all of the president information. We're just creating a list of presidents. Well, let's say that The White House participated. Based on who they are, it's assumed that more than likely they'd have the right list, (we'd like to hope at least). They would have a certain amount of clout on such matters, and thus would most likely be the highest rated source of that information. it goes on and on, after this....

Luke put this very well:
The point of a database is for the data to be pulled apart and tagged in a consistent way instead of thousands of incompatible file formats which are very hard to process automatically using general tools.

The way I see it, the metadata should be more accessible to applications and the user. One should be able to dynamically create metadata that is relevant to the object, and there should be a somewhat decentralized definition of the base of an object: a person has a first name, last name, etc....

Basically, the aim of my software is to stop reinventing the wheel, and make my life easier with not using bad, buggy, and non-datacentric software.

in the end, it really doesn't matter what people believe will happen with these concepts of the database as a filesystem. Basically there is no right answer when it comes down to it. Just because there are files, directories and what not and it works doesn't mean there aren't any other good paradigms to find. It just happens that files and directories have done the job for a long, long time. (Longer than i've been around, that's for sure), but i think for truly innovative things to appear, one must disregard common thought.

it reminds me of a creative thinking class I had at engineering schoool, we had to come up with an invention: I came up with automatically dimming headlights. The teacher gave me an 'D', and told me that GM tried it in the 50's, and it worked for crap. I replied that perhaps there've been some advances in technology since then, and it might work a little bit better these days. Since then, I've seen it in numerous models of European cars. This is the point: just because it didn't work in the past, or seems like it won't, I won't let someone like that discourage me from proving them wrong;)

cheers!


Re: again, the major misconception
by dpi on Wed 22nd Dec 2004 15:20 UTC

By design, or not one could have thousands of copies of the same concept. (This is not ideal obviously) However this could be good, because in a distributed knowledge storage system, one could then implement a rating system, something similar to slashdot, where moderations are moderated. Thus, the obvious choice becomes clear. For example: lets' say that 1000 people took part in creating all of the president information. We're just creating a list of presidents. Well, let's say that The White House participated. Based on who they are, it's assumed that more than likely they'd have the right list, (we'd like to hope at least). They would have a certain amount of clout on such matters, and thus would most likely be the highest rated source of that information. it goes on and on, after this.

A few points to this one.

By default, you show the latest version. VMS had this in the shell, in the form of filename.extension;version. If you did not specify the version, it took the latest. I couldn't find so fast how the VMS FS exactly functions. The user manual doesn't seem to explain that.

What you say applies to the editting of files; as in, creative work from humans. That could be e.g. text, audio, video, code or a combination of that. What you say sounds very much like a Wiki on FS layer, which i read about at Shoulexist.org the other day: http://www.shouldexist.org/story/2004/10/13/32944/857 One really awesome example of usage could be that you're able to 'diff' 2 versions against each other. Besides being able to see what the difference of content is, it would also be interesting to see other differences between these 2 versions: when was it editted ('modified'), who editted it ('username' and/or 'realname'), eventually statistics or allowing a feature such as moderation ('how useful does each $user find this edit?').

As i said, what you say applies to 'editting of files'. Files which use static, such as libraries and binaries, don't necessarily need this. This makes one think it should be able to turn versioning off, or to turn it on at specific directories and/or files. Logs also don't need this because they already have this function! Actually compressing old logs, such as with logrotate, is very useful against too much storaged data which ain't used hence this would apply eventually to 'versioning' in FS as well.

interesting idea about directories
by Hanno on Wed 22nd Dec 2004 15:44 UTC

This is not directly related to this article but I think it might be of some interest: There was an interesting article about the possibility of replacing directories with permanent database queries on the OpenBeOS (now HaikuOS) Newsletter.

I don't know if it's somewhere on their new site, so here is a link to the old one:
http://open-beos.sourceforge.net/nsl.php?mode=display&id=51#183

Uses for Meta Data
by Kramii on Wed 22nd Dec 2004 16:01 UTC

I think that, meta-data will be used where it is either relatively easy or very desirable for the users to create the meta-data. This applies particularly particularly to digital archives of music, films, pictures and the like.

Meta-Data entry could be made fairly easy for users in all kinds of ways:

Cameras with built-in GPS could generate tags which indicate where pictures were taken (useful for identifying holiday snaps).

Where meta-data is harder to generate, or is less useful, other methods of cataloguing files are required. eg. full text searh of textual data.

Kramii.

Users could select a set of files and apply the same meta-data to them all, eg. selecting a whole lot of pictures that have dogs in them and entering "Pictures of dogs" in the meta-data.


RE: dpi
by Slobodan Celenkovic on Wed 22nd Dec 2004 17:12 UTC

>>> Now, my questions in this example are: ...

Indeed, there are certainly a lot of details to be defined. It is not going to be a simple 1-1 mapping from file format attributes to database columns. Some sort of heuristic will have to be employed to filter out junk. For instance, if the attribute Author=a is found then filter ought to realize that this is junk, should be ignored. Using a dictionary of words and common names could improve accuracy. It won't be trivial. Maybe this is one of the reasons for WinFS delyas?

>>> 4) How is the 'system' going to support other 'file formats', if 'it' wants to, that is? Using a library or specification to support that format thus making it extendible? Are users allowed to extend such support in e.g. ~/.dbfs/formats beyond e.g. /etc/dbfs/formats? Or is this flexibility not wished for? (I had the Amiga a bit in mind when i were thinking about this point).

Conventional approach would be to define a plugin mechanism of some sort so that filters for formats can be added and removed as necessary, either by the original creator (Microsoft for WinFS) or developers using the API (apps using WinFS). Alternative is for apps themselves to supply the values via the API.

I've been using this for years
by RiverRat on Wed 22nd Dec 2004 19:54 UTC

Without going into too much details, let me just point you at two links:
http://www.imanage.com/
http://www.hummingbird.com/products/enterprise/dm/index.html

Both products are widely used in the legal industry (document centric) and do an amazing job at abstracting the filesystem from the user.

All a person has to do is say "I want all briefs from my xxx case with the name John Smith somewhere in the text" and presto!

It seems like the various projects discussed recently are trying to make this type of functionality available to the masses. A great idea if implemented properly, but a terrible idea unless they can get all vendors to change their file open/save dialog boxes...

some thoughts...
by mmu_man on Thu 23rd Dec 2004 12:26 UTC

> Why is it not simple to use your address book in Outlook
BeMail runs a query on People files (each contact is a 0 length file in BeOS), to populate the To: popup menu, and does typeahead on that list.

> What about versionning ?
As others mentioned, VMS has that since ages.
You might as well have a look at cvsfs-fuse and WayBack:
http://fuse.sourceforge.net/filesystems.html

> That was until Apple came up with Spotlight. ;)
Apple didn't come up with that, they hired an ex-Be engineer and sto^H^H^Hreused what BeOS has for years.

> Neither HTTP nor FTP have been upgraded to support these new file systems
Good point, I often have to zip up files to ftp them between beos boxen. Time to write an RFC I guess ;)

> An open API would help a lot so that developers can start to use these features.
I've been thinking around about making a compatibility layer for BeOS attributes and Linux ones (POSIX attr), but they aren't even compatible semantically.
POSIX attrs are just a couple (char name[],byte value[]), while BeOS attributes are (char name[], uint32 type, char value[]) (type is not part of the key though; i.e. you can't have 2 attrs of the same name even of different types). Simplest would be storing the type as the first 4bytes of value in linux, but I'm sure ppl will object it makes strings less readable and blah blah (hough updated tools could parse them and show that correctly).

> Damn I miss BeOS, my new computer doesn't work with it.
You might want to have a look at bebits, chances are you'll find patches for more than 1GB of RAM, athlonXPs and fast CPUs... and new drivers for new video cards ;)

> There was so much potential there, so long ago.
<advertising>
There *is* potential: http://haiku-os.org - http://yellowtab.com


> Another problem is what happends to metadata when file travels between 2 file systems.
Well, BeOS handles that to some extend, that is old MacOS resources are published as attributes, so if you copy a file to BFS, then back to HFS they are preserved at least.

> In fact, you could define multiple hierarchies, based on different criteria.
> The main point is that we need more flexibility to be able to use multiple
> attributes, not just a single hierarchy.
That's something BeOS doesn't provide off-hand but could be possibly implemented in Tracker, by building special type objects which would contain namespace description using attribute names, and filter-outs, like {MAIL:thread}/{MAIL:when}/{name}. Seriously thinking about prototyping something with Tracker.

> Toss in some google style search technology so we don't all need to know regular expressions and SQL syntax,
Or just use BeOS, you don't need to know the query language to use it, you can build queries graphically: http://www.birdhouse.org/beos/byte/24-scripting_the_bfs/formula.png
Btw, nice article there http://www.birdhouse.org/beos/byte/24-scripting_the_bfs/ describing what one can already do with bfs.

bookmarked
by mmu_man on Fri 24th Dec 2004 08:39 UTC

I think I'll keep a bookmark on that news, I'll read all the messages later, I'm sure there are nifty idea I could prototype in BeOS/Zeta ;)
Talking about bookmarks:
http://clapcrest.free.fr/revol/beos/shot_googlefs_005.png
That's no more a prototype ;)
The idea is to integrate Google's search withing BeOS.
You can run a query on that volume, and it will publish google's answers as bookmarks (0 length files with url, title, ... attributes).
So you don't need a static bookmark anymore ;)

re: bookmarked
by hobgoblin on Fri 24th Dec 2004 15:39 UTC

wtf!

now thats an insane idea. but can those files/bookmarks be saved so that i can pass them on to someone else as a collection via some kind of storage media?

haveing the folders in fact be preset database querys are one of the more interesting parts of useing a database as a filesystem (or part of one). if it allso allows for metadata setting by drag and drop into one of these preset quesrys/folders (i belive this was a feature presented for winfs) then over time the setting of metadata dont have to be as troublesome as it can be at times right now...