A file system all its own

Thom Holwerda 2013-04-14 Hardware 36 Comments

“In the past five years, flash memory has progressed from a promising accelerator, whose place in the data center was still uncertain, to an established enterprise component for storing performance-critical data. It’s rise to prominence followed its proliferation in the consumer world and the volume economics that followed. With SSDs, flash arrived in a form optimized for compatibility – just replace a hard drive with an SSD for radically better performance. But the properties of the NAND flash memory used by SSDs differ significantly from those of the magnetic media in the hard drives they often displace. While SSDs have become more pervasive in a variety of uses, the industry has only just started to design storage systems that embrace the nuances of flash memory. As it escapes the confines of compatibility, significant improvements in performance, reliability, and cost are possible.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

36 Comments

2013-04-14 9:51 pm
TempleOS
There’s going to be a revolution. Addresses on disk will be byte granularity and indicated by 64-bit values. It will probably still be blocks, but addresses will be bytes. It could mean entirely different operating system.

2013-04-14 11:19 pm
Delgarde
It could mean entirely different operating system.
Why? That’s hardly revolutionary stuff, the basis of new operating systems. The existing file APIs are already byte based – all this needs is a filesystem that understands that it’s not dealing with block-based hardware.
2013-04-15 8:29 am
Laurence
You don’t need to write a whole new OS just implement new file system drivers and updates to storage IO ABIs.

2013-04-15 8:41 am
WereCatf
You don’t need to write a whole new OS just implement new file system drivers and updates to storage IO ABIs.
Well, if your OS and its drivers are terribly rigid and poorly designed….

2013-04-15 5:29 pm
pgeorgi
There’s going to be a revolution. Addresses on disk will be byte granularity and indicated by 64-bit values. It will probably still be blocks, but addresses will be bytes. It could mean entirely different operating system.
See https://en.wikipedia.org/wiki/IBM_System/38#Data_Storage and and https://en.wikipedia.org/wiki/Object_storage_device (To me, OSD always looked a bit like ZFS’ lower storage layer – and I think there’s even a certain overlap in people working on ZFS and the OSD standard)
Edited 2013-04-15 17:32 UTC
2013-04-15 7:10 pm
saso
There’s going to be a revolution. Addresses on disk will be byte granularity and indicated by 64-bit values.
http://linux.die.net/man/3/lseek64
You’re welcome.

2013-04-15 1:02 am
TempleOS
When the fundamental storage changes from volitile to nonvolitile, a new reality exists and everything changes. The old operating systems can obviously treat it conventionally, but the potential for a big improvement will be there until a new operating system is designed.

2013-04-15 1:36 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
When the fundamental storage changes from volitile to nonvolitile, a new reality exists and everything changes. The old operating systems can obviously treat it conventionally, but the potential for a big improvement will be there until a new operating system is designed.
Why would we go from volatile to non-volatile?
Flash memory can’t be used as RAM. It can only be erased a limited number of times before wearing out and there’s no quicker way to accidentally wear out flash memory than to put a swap partition on it.
Even if that weren’t the case and we could use Flash memory as RAM, we’ve already got functionality along the lines you’re thinking of in the Linux kernel.
(For example, the ext2 filesystem driver has had “execute in place” support for memory-constrained, flash-based mobile devices for years and there’s also mmap() for userspace apps. There’s a smooth migration path to be made when we’re ready for it.)
Edited 2013-04-15 01:40 UTC

2013-04-15 6:05 am
Neolander
Also, even if Flash memory is faster than hard drives, it’s nowhere near DDR RAM speeds. Swapping on an SSD or SD card remains pretty much as painful as swapping on an HDD, and I can’t believe it’s all the SSD interface’s fault.
There is some cool research going on regarding NVRAM, such as STT MRAM* or memristors, but I don’t think Flash memory will be able to go there.
* STT = Spin Transfer Torque. In short, the main issue regarding MRAM today is that we don’t know how to flip the magnetization of a small magnet without affecting that of neighboring magnets, which limits storage density. STT research is about using spin-polarized (“magnetized”) electrical currents flowing directly into the magnet in order to do that.
Edited 2013-04-15 06:13 UTC

2013-04-15 6:14 am
gilboa
… Let alone the implications of having a complex firmware that handles garbage collection behind the OS’ back.
In 40 years, we have learned all-there-is-to-know about magnetic drives – especially, how they fail.
We have yet to reach the same level of maturity when it comes to SSD’s. (Let alone the fact the possibility of bricking all the members of the storage pool, all at once, due to a firmware bug).
SSD will replace HDDs – there’s no doubt about it.
However, I tend to choose caution over innovation when it comes to data storage…
– Gilboa
Edited 2013-04-15 06:15 UTC
2013-04-15 7:16 pm
Flatland_Spider
The SATA interface chip introduces lag. SATA wasn’t designed for SSDs, so it’s kind of a bottleneck when added to an SSD. A much more direct way to access the drive would help with the speed. The Fusion-IO stuff is a good example of the speeds that could be reached when SATA is eliminated.
The consensus is SSDs using NAND are a stop gap measure until NVRAM is commercially available in bulk. NAND becomes less efficient as it gets smaller, unlike transistors which become more efficient, and producers are already starting to see the affects of this. More NAND chips on few channels means reduced speed, and the smaller NAND cells wear out faster. It’s a good first gen solid state disk product, but ultimately, there will be a better technology that will have a longer run than NAND.

2013-04-15 3:57 pm
Lennie
Why would we go from volatile to non-volatile?
It could happen if it become cheap enough the first mass-market products will ship this year:
http://hardware.slashdot.org/story/13/04/04/016221/non-volatile-dim…
2013-04-15 4:08 pm
Neolander
Oh, there’s something I missed this morning too…
Why would we go from volatile to non-volatile?
I’d say the main reason for doing that would be increased reliability and simplified abstractions.
Reliability would be increased because machines could be smoothly powered off and back without losing any state, and without a need for hackish “save RAM data to disk periodically” mechanisms. Suspend and hibernate could well cease to exist in less than a decade, replaced by the superior alternative of simply turning hardware on and off by flipping an hardware switch.
Abstractions would be simplified because there wouldn’t be a need to maintain two separate mechanisms to handle application states and data storage and interchange through file. A well-designed filesystem could instead address the use cases of both malloc() and today’s filesystem calls, much to the delight of “everything is a file” freaks from the UNIX world
(As an aside, the latter could actually already be done today, by allocating all free RAM into a giant ramdisk, mounting it alongside mass storage, and treating process address spaces as a bunch of memory mapped files. It simply doesn’t make sense at this point, since both memories have such different characteristics…)
Edited 2013-04-15 16:21 UTC

2013-04-15 6:02 pm
Alfman verbose=1
I also think non-volatile ram would make a lot of sense, assuming the technology were feasible and not overly compromising like flash is today.
High throughput database and file system processes are obvious candidates, they would benefit tremendously by eliminating the need to O_DIRECT/fsync constantly for committing transactions.
By unifying ram/disk into one concept, we could open up new programming methodologies where programs and/or data can simply exist without having to sync state begin disks and ram. I’d even go further and make stateful objects network-transparent too so that they “just exist” and never need to be serialized from a programmer’s point of view (such things could be handled automatically by the languages/operating systems). Like you say, this could be emulated today, but it’d necessarily have to be in a lower performance and/or less reliable fashion that NV-RAM could achieve.
Modern NAND flash is not ideal, the way it works adds latency and has undesirable addressing properties. NOR flash is technically far closer to a RAM substitute since it’s truly random access and more reliable than NAND without needing the whole Flash Translation Layer in front of it. If NOR flash could be made cheaper and more densely, it would completely replace NAND.
2013-04-15 7:21 pm
saso
What you describe is already in place, it’s called “suspend to RAM” (aka “sleep”) and it’s far from simple. There’s tons of runtime state that needs to be stored and restored when a machine changes power states that isn’t in main memory. Just a little food for thought:
* peripherals (graphics cards, displays, mice, scanners, etc.)
* timing circuits (programmable interrupt clocks, watchdogs, etc.)
* environmental dependencies (open network connections, security contexts, etc.)
All of these need to be gracefully taken care of and reinitialized, and if possible made to continue previously interrupted tasks. All of this is already handled by current OSes. And all of this is very, very messy and complicated.

2013-04-15 7:59 pm
Neolander
If NVRAM becomes as dirt cheap as DRAM is today, nothing would prevent its use in peripherals and timing circuitry too. The scenario which I describe, getting rid of that suspend kludge at last, can only work if everything that holds state inside of a computer is based on NVRAM.
It’s true that environmental dependencies would still have to be taken into account. But these are handled at a higher level than hardware considerations, and can consequently be taken care of in a much cleaner way. Network connections can time out and be brought back, as an example.
If you think of it, a constantly failing wireless network connection should be much more of a hassle to handle than infrequent suspends, and yet if you aren’t in a hurry modern OSs can handle that.
Edited 2013-04-15 20:15 UTC
2013-04-15 10:37 pm
Alfman verbose=1
saso,
“All of these need to be gracefully taken care of and reinitialized, and if possible made to continue previously interrupted tasks. All of this is already handled by current OSes. And all of this is very, very messy and complicated.”
Indeed, however it’s complicated BECAUSE they use volatile ram. All of that mess could be avoided in the future with NV-RAM. That’s the point, hypothetically if future NV-RAM could be built to be as practical as normal RAM, then there wouldn’t be a reason to use normal ram anywhere. Making devices power up into their previous state would be free, or next to it, without any of today’s complications caused by volatile ram.
Edited 2013-04-15 22:39 UTC

2013-04-15 7:25 pm
Flatland_Spider
It’s looking more and more like the future belongs to SoCs, and RAM will become just another cache on the CPU. To really make that work NVRAM is needed.

2013-04-15 6:09 am
gilboa
When the fundamental storage changes from volitile to nonvolitile, a new reality exists and everything changes. The old operating systems can obviously treat it conventionally, but the potential for a big improvement will be there until a new operating system is designed.
I’m not sure I follow you: You mean using flash memory instead of RAM, or RAM instead of flash memory (NAND, NOR, etc)?
I assume you understand what are the implications of using flash memory to store volatile information right? (Suggestion: Wikipedia, “write amplification”).
– Gilboa

2013-04-15 2:29 am
TempleOS
I just assumed soon we would have NVRAM in place of RAM — within a decade. What, are you in your 60’s? You sound scared of change. What’s it to you? Lose all your skills?
I certainly hope the do a new operating system if RAM goes from volitile to nonvolitile!

2013-04-15 6:18 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
I just assumed soon we would have NVRAM in place of RAM — within a decade. What, are you in your 60’s? You sound scared of change. What’s it to you? Lose all your skills?
I certainly hope the do a new operating system if RAM goes from volitile to nonvolitile!
When I was a kid, I was very averse to change. I spent my teen years learning to deal with that and nown, in my 20s, I just don’t like change for change’s sake.
Why bother creating a whole new OS with tons of new bugs to find and, possibly, a whole new userland API to port applications to when existing kernels are modular enough to be adapted?
Yes, research OSes are cool, but you don’t need to wait for new hardware to try them out and new hardware won’t force people onto a new OS.
Also, even if I were worried about my skills, why would they become obsolete if they’re probably just gonna layer a POSIX API on top of whatever it is anyway?
Edited 2013-04-15 06:19 UTC
2013-04-15 1:07 pm
r_a_trip
I certainly hope the do a new operating system if RAM goes from volitile to nonvolitile!
What kind of newness are you thinking off? The biggest change it would bring, that I can think of, is that sessions can be instantly “saved”. Power off the computer, but have the OS frozen in its current state and ready to go when the computer is powered back on. Apart from that, I don’t see any paradigm shifting capabilities in having non-volatile memory at the core of a computer.
Then again, my imagination might be severy lacking. If you can enlighten me, you have my listening ear.

2013-04-15 2:40 am
TempleOS
I haven’t thought about but wouldn’t a lot change? I have a text/graphics file format in my operating system — think PDF. When I load it from disk, I put it in a doubly-linked list memory structure. If RAM was RAM, I could just store it, as long as I didn’t move it. (Work with me, I’m exploring possibilities.)

2013-04-15 6:36 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
I haven’t thought about but wouldn’t a lot change? I have a text/graphics file format in my operating system — think PDF. When I load it from disk, I put it in a doubly-linked list memory structure. If RAM was RAM, I could just store it, as long as I didn’t move it. (Work with me, I’m exploring possibilities.)
Doubtful. Formats for storing and exchanging data have very different requirements from in-memory data structures.
On-disk formats need to be clearly-defined, unchanging, space-efficient, and often use checksums and compression.
In-memory formats need to support efficient modification, often incorporate cache data, and are decompressed.
Load time will never go away because it’s just not feasible to waste 5.2MiB on disk for data that’d be 418.5KiB if you just PNG-compressed it first. (That’s actual numbers taken from GIMP’s readout for a deviantART pic named “Ganon Style”)
…and even if you did, there’d still be programs that’d use different in-memory and on-disk layouts because they’ve found a more efficient way to do the in-memory stuff.
PDF is even worse since 99% of the time is spent generating raster renderings of vector (or mostly-vector) data (and re-rendering them every time you change the zoom level).
You could easily end up with a CD’s worth of data (700MiB+) for a 1MiB PDF if you just kept it all around just in case the user decided to reload the file.
The space-time trade-off just doesn’t make sense.
Sure. I could see things like databases benefiting hugely, but they are nothing like PDFs. Database file formats are already basically what you’re talking about (A minimally-serialized memory dump, loaded from disk into a page cache on demand) and you never manually manipulate them directly.

2013-04-15 11:04 am
Laurence
Many databases are already run off RAM. Sometimes via dedicated RAM storage engines, sometimes using more conventional storage engines which are pointed towards virtual file system (read: RAM disk).
In fact for all the instances where it makes sense running a file system in RAM, we have already have RAM disk.
Edited 2013-04-15 11:13 UTC

2013-04-15 11:28 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
Many databases are already run off RAM. Sometimes via dedicated RAM storage engines, sometimes using more conventional storage engines which are pointed towards virtual file system (read: RAM disk).
In fact for all the instances where it makes sense running a file system in RAM, we have already have RAM disk.
Of course, but you still either have to battery-back it or sync it to non-volatile storage at some point if you’re not working with transient data.
My point was that, of all the common types of on-disk formats, a database is probably the most suited to an architecture where there’s no distinction between RAM and permanent storage.

2013-04-15 12:15 pm
Laurence
Of course, but you still either have to battery-back it or sync it to non-volatile storage at some point if you’re not working with transient data.
Usually you’d run temporary tables in there (eg web session data) so only non-persistent data would be lost. Much like how RAM is used normally in fact.
However I have seen RAM disks used for persistent data as well. The set up of that will differ from DBA to DBA, but used generally expect them to have a real time sync (either via a duplicate database server mirrored via TCP/IP link) or a RAID mirror locally (using a battery backed hardware RAID controller), then the whole lot sat on top of a UPS. (much of that is standard gear even without running persistent data on RAM disk though)
My point was that, of all the common types of on-disk formats, a database is probably the most suited to an architecture where there’s no distinction between RAM and permanent storage.
That depends entirely on the database engine. But in general you’d be right.

2013-04-15 7:50 pm
Flatland_Spider
The Linux devs are playing around with compressing allocated memory to increase transfer times and use RAM more efficiently.
With ECC RAM, RAM already has checksums, and checksummed RAM pages, at the OS level, would increase security.
OS X already renders everything in PDF, and it pre-renders icons and things of different sizes then stores them in an on-disk cache.
Every conceivable version doesn’t have to be pre-rendered. Only the most common sizes need to be pre-rendered, or only the rendered sizes need to be cached. Plus, the rendered versions don’t have to be saved with the original file.

2013-04-16 12:33 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
The Linux devs are playing around with compressing allocated memory to increase transfer times and use RAM more efficiently.
True, but from what I understand, that’s more like swap space and it can’t achieve the same compression ratios as data-aware formats like PNG. (Remember, PNG performs various transforms before DEFLATEing)
With ECC RAM, RAM already has checksums, and checksummed RAM pages, at the OS level, would increase security.
Point. I keep forgetting how much acceleration for checksum-related stuff is available in modern CPU instruction sets.
OS X already renders everything in PDF, and it pre-renders icons and things of different sizes then stores them in an on-disk cache.
Point. I suppose it could work as long as there’s a strong effort to shame and shun any applications which put their caches somewhere that can’t be expired independent of their cooperation or knowledge.
Every conceivable version doesn’t have to be pre-rendered. Only the most common sizes need to be pre-rendered, or only the rendered sizes need to be cached. Plus, the rendered versions don’t have to be saved with the original file.
I was under the impression “the most common size” for a PDF was “Fit Width”.
Also, I’m Canadian. I’ve got 5Mbit/800Kbit Internet and every member of my family (my writer/artist mother, my gamer brothers, my non-gaming, programming self, etc.) can never have enough disk space so wasting space on easily-regenerated caches for rarely-used files is not acceptable from a storage OR a transfer perspective.
“Pre-rendering the most common sizes” is only acceptable for things like icon themes where you can typically count the number installed on one hand and even the biggest ones have a small “per page” size.

2013-04-16 1:31 pm
Flatland_Spider
True, but from what I understand, that’s more like swap space and it can’t achieve the same compression ratios as data-aware formats like PNG. (Remember, PNG performs various transforms before DEFLATEing)
Agreed. They’re talking about using XZ compression, so it’s just bulk compression rather then specific compression, like in the case of PNG.
I suppose it could work as long as there’s a strong effort to shame and shun any applications which put their caches somewhere that can’t be expired independent of their cooperation or knowledge.
It’s the OS that manages that, so I don’t think applications have any say in the matter, which is how it should be.
I was under the impression “the most common size” for a PDF was “Fit Width”
And full page in window.
Also, I’m Canadian. I’ve got 5Mbit/800Kbit Internet and every member of my family (my writer/artist mother, my gamer brothers, my non-gaming, programming self, etc.) can never have enough disk space so wasting space on easily-regenerated caches for rarely-used files is not acceptable from a storage OR a transfer perspective.
Right, that’s why the render cache shouldn’t be stored with the original file, which is what I was pointing out.
There are lots of techniques and algorithms for cache management. The OS should keep the render cache at a reasonable size. We have enough CPU/GPU power now that re-rendering, or rendering everything on the fly, isn’t the performance hit it used to be.
2013-04-16 8:41 pm
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
Right, that’s why the render cache shouldn’t be stored with the original file, which is what I was pointing out.
There are lots of techniques and algorithms for cache management. The OS should keep the render cache at a reasonable size. We have enough CPU/GPU power now that re-rendering, or rendering everything on the fly, isn’t the performance hit it used to be.
My original point is that NVRAM wouldn’t change anything here. If Evince or Okular supported a non-volatile render cache, I’d turn it off to save space since I have many many PDFs, use each individual one infrequently, and always have less space than optimal. (I collect things like YouTube videos)
As far as PDFs go, I usually just read a research paper or electronic component’s data sheet, write a library, circuit diagram, or take notes, and then keep it around just in case.

2013-04-15 3:12 am
TempleOS
Operating Systems with Paging could not do this — my operating system is single-address-map. In all the cases where I read files and place them into pointer-linked structures… I would never have to serialize them unless exporting to another machine. My crazy heap pointer-linked structures could be sphegetti with one file mixed in other files, freely, just like heap memory naturally gets all mixed-up.

2013-04-15 6:42 am
ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
Operating Systems with Paging could not do this — my operating system is single-address-map. In all the cases where I read files and place them into pointer-linked structures… I would never have to serialize them unless exporting to another machine. My crazy heap pointer-linked structures could be sphegetti with one file mixed in other files, freely, just like heap memory naturally gets all mixed-up.
Filesystems are already spaghetti. That’s what fragmentation means and different filesystems have different approaches to keeping track of which file a fragment belongs to and their ordering.
You just don’t see that because the OS does such a good job of providing an abstraction layer between you and the on-disk format.
Honestly, the most promising use I can think of for what you’re envisioning is what WinFS and GNOME Storage were trying for… A filesystem that, rather than being hierarchical, is an SQL database.
Both attempts failed for lack of performance but, if they can bring non-volatile storage up to the same performance level as RAM, that problem might go away.

2013-04-15 11:36 am
Laurence
You’re missing his point there. I think he’s talking about the future of storage working a bit like this:
Say you have a character array of the alphabet, you can just request the 5th element in that array and the value returned would be “e”. Say if that array was actually a file in RAM, then we’d only ever need to pass pointers rather than having to duplicate the the file (ie persistent version in storage and and a volatile version in RAM).
There’s a few issues with that mindset though:
1/ You’d need to know all the memory addresses of each bit of data you’re wanting to retrieve before hand. That would require an epic centralised database that would fragment in no time at all. This isn’t a show stopper though, there will be ways around it, but it’s certainly a major issue that would need to be resolved. (it’s probably worth noting that we can already jump to specific blocks of data within a file without having to preload the file beforehand. But it’s beyond impractical to do so, which is why we use the much simpler approach we have currently)
2/ The contents of files are not static. Take XML parsing, the tags will land in different locations within the file depending on the data held (bad example, but the 5th line of a dictionary will differ depending on the language of the dictionary). So most forms of structured data will still need to be stepped through before they can be parsed – which defeats the point of “spaghettifying” pointers.
What wouldn’t be an issue is “in memory” changes no affecting the saved versions (eg opening a Word document, making edits and the version I’m editing not automatically editing the version I have saved unless I explicitly click “save”) as we already have technology developed that works around that limitation (CoW file systems).
While TempleOS’s vision is interesting, I think it adds waaay too much complexity for comparatively little gain. Particularly when most instances of structured data will still need to be stepped through.
2013-04-15 7:35 pm
Flatland_Spider
Honestly, the most promising use I can think of for what you’re envisioning is what WinFS and GNOME Storage were trying for… A filesystem that, rather than being hierarchical, is an SQL database.
Both attempts failed for lack of performance but, if they can bring non-volatile storage up to the same performance level as RAM, that problem might go away.
BeFS did the filesystem as a database thing well before those two, and it worked really well.

2013-04-16 12:30 am
TempleOS
If we never replace RAM with NVRAM… there are machines that never reboot. You could design a crazy system that would be demolished if it ever lost power. You could treat RAM like NVRAM in an insane sort of way.
Then, imagine your file system directories done with KMAlloc() or whatever, and get rid of the concept of blocks, just bytes. A file is just some memory from KMAlloc().
“KMAlloc” is kernel malloc in Linux right? I have a different name in my OS.