To read all comments associated with this story, please click here.
More likely the second point.
What I don't understand myself is that I thought that flash had now: an increased number of read/write cycle and included wear levelling mechanism inside the flash controller itself so why is-it still needed to include such mechanism in the FS?
Most likely because on the software side it is a lot easier to change the method, upgrade the filesystem and so on, whereas a generic wear-leveling mechanism inside the firmware just won't be as good when it has to suit a whole range of filesystems. A software side mechanism can take into account even the slightest detail, but a firmware one just has to deal with the lowest possible detail: block level.
External flash devices, like USB sticks, have wear leveling and so forth implemented in the device.
Internal flash devices, especially on things like cell phones, have a simple raw i/o interface to the device.
If Linux is to be competitive on embedded devices, then it needs better support for those sorts of flash devices.
MTD + JFFS2 had severe performance problems, and while YAFFS is pretty good, it's never had wide spread acceptance.
Or is he referring to Flash media needing to erase much larger data blocks possibly containing other data, such that you'd need to move pieces of the File Allocation Table itself to avoid losing things?
Well, I can't say I'm a professional on these matters, and I _might_ talk just plain bulls*it, I guess he means the fact that for example ext2/3 usually uses a block size on 512 bytes, whereas a flash drive uses anything between 16kb to hundreds of kilobytes..So that'd mean a whole amount of unnecessary moving of data, I guess.
Problem is there seemed to be a lot of details missing. At the least I expected that an new file system would consulate writes to diffirent sectors into a single block write. There was also no mention about write caching to collect a large number of writes operations before modifing a flash block. Done properly, you can reduce writes to the flash itself by a large amount.
I know I don't know much about file systems, but I was under the impression that most filesystems do allow files to be scattered in pieces across the filesystem, hence fragmentation and the problems that causes, unless lack of that is how ext2 and ext3 avoid needing to defragment...
Actually, scattering files across the medium is what allows ext[2-4], other FFS-inspired filesystems, and most *nix filesystem in general to reduce fragmentation. If you scatter files, you can later expand them into adjacent free space. If you pack them tightly, then you need to find someplace else to store the additional data. This is the root cause of fragmentation on sequential allocation filesystems like FAT and NTFS. However, fragmentation is only an important consideration if seeks are expensive. So on a hard disk, it's important to know about free blocks and allocate them in a smart way.
On flash, seeks are fast, so FAT on an FTL-based flash medium isn't actually that bad. Lookups are still just about as inefficient as practically conceivable, but it's hard to design a filesystem that has bad read latency on flash media, even if that's your primary goal. In addition, some FTL implementations understand enough about FAT to determine whether a given filesystem block is free, allowing for some limited garbage collection. So if your flash medium has an FTL, you may be best off using FAT regardless of whether you're running Windows.
However, FTL is only a short-term solution to aid flash adoption by supporting legacy filesystems that don't properly address the requirements imposed by flash. Only the filesystem fully understands the metadata and the relationships between data blocks. It's far easier to teach the filesystem about flash than to teach the FTL about one or more filesystem types. This is a case where the abstraction layer gets in the way of the solution.
The principle reason why FFS-like filesystems won't work on raw flash is because the mapping between an inode number and the offset of the inode on disk is a simple arithmetic function. Since flash requires a block to be moved somewhere else when it is updated, it breaks the assumption. Either they would need to change the inode number to reflect its new position on the medium, requiring updates to inodes and directory entries that refer to the inode, causing further relocations and a rippling series of inode number changes, or they would need to come up with some other way of locating inodes. Both of these changes are more than invasive enough to warrant a brand new filesystem.
Not surprisingly, LogFS chose to decouple inode numbers from locations on the medium. They keep a pair of journals that flip-flop on every journal update. The journal contains the current location of the inode file. The inode file contains every inode in the filesystem, arranged in a tree structure. When any erase block is written, the only other erase blocks written are for the inode file and the journal. This prevents having to update an erase block for each inode up the tree when the inodes dance around in response to a write. It also makes the code for updating inodes the same as for updating data, since they are both the contents of a file.
LogFS isn't the perfect solution, but it's pretty good. It currently has to scan each erase block to find free filesystem blocks. So identifying erase blocks for garbage collection isn't as efficient as it could be. Wear leveling is implemented in a rather simplistic way. Keep a count of write cycles for each erase block, and stop using erase blocks when they exceed a certain count. There is no attempt to level the wear, just a guarantee not to exceed manufacturers' specifications. I'm not sure the effect is really any different. Perhaps it could be made better by initially setting this limit quite low and then periodically increasing it in coarse increments when a certain percentage of allocation attempts are rejected for exceeding the cycle limit.
The more pressing problem is that most consumer flash products have an FTL that cannot be bypassed, and this will continue until Windows and consumer electronics devices have filesystems like LogFS than support raw flash. For flash vendors, selling raw flash is a liability because most consumers won't know if their OS or device supports it or not.
It's no surprise that the primary target hardware for LogFS is the OLPC XO, since they're not constrained by legacy technology. In fact, until recently, the lead developer of LogFS had never run it on an actual raw flash medium, developing only on a simulated medium. He first attempted it on an XO at a conference where he was presenting. He warned that he had no idea if it would actually work on real flash. It failed miserably, producing a light-hearted chuckle from the audience of other Linux developers, but I believe he fixed the bugs soon afterwards.
Edited 2007-05-19 00:48







Member since:
2005-08-28
After that a relatively large piece of flash needs to be erased. The size of these erase blocks differ -- it is usually between 16 and 128KB.
After this erase, all data is gone and cannot be recovered. So a flash filesystem has to make sure that no important data is in the area before it gets nuked. If there is, the filesystem has to move it elsewhere first.
Moving data elsewhere means there is no fixed relationship between the physical location of data on the device and the logical location of data in terms of file and file offset. Ext2/ext3 and most of the other disks filesystems depend on such a fixed relationship, so they don't work as flash filesystems.
I know I don't know much about file systems, but I was under the impression that most filesystems do allow files to be scattered in pieces across the filesystem, hence fragmentation and the problems that causes, unless lack of that is how ext2 and ext3 avoid needing to defragment...
Or is he referring to Flash media needing to erase much larger data blocks possibly containing other data, such that you'd need to move pieces of the File Allocation Table itself to avoid losing things?