Linux Not Fully Prepared for 4096-Byte Sector Hard Drives

Recently, I bought a pair of those new Western Digital Caviar Green drives. These new drives represent a transitional point from 512-byte sectors to 4096-byte sectors. A number of articles have been published recently about this, explaining the benefits and some of the challenges that we’ll be facing during this transition. Reportedly, Linux should unaffected by some of the pitfalls of this transition, but my own experimentation has shown that Linux is just as vulnerable to the potential performance impact as Windows XP. Despite this issue being known about for a long time, basic Linux tools for partitioning and formatting drives have not caught up.

The problem most likely to hit you with one of these drives is very slow write performance. This is caused by improper logical-to-physical sector alignment. OS’s like Linux use 4K blocks (or multiples of 4K) to store data, which matches well with the physical sector. However, nothing restricts you from creating a partition that starts on an odd-numbered 512-byte logical sector. This misalignment causes a performance hit since the drive has to read and rewrite the 4K sectors with whatever 512-byte slices changed.

WD claims to have done some studies and found that Windows XP was hardest hit. By default, the first primary partition starts on LBA block 63, which obviously is not a multiple of 8. They provide a utility to shift partitions by 512 bytes to line them up. WD also tested other OS’s and declared both MacOS X and Linux to be “unaffected”. I don’t know about MacOS, but with regard to Linux, they are not entirely correct. Following are the results of my experimentation.

The first thing I did was test the performance effect itself. It has been suggested that WD might internally offset block addresses by 1 so that LBA 63 maps to LBA 64. This way, Windows XP partitions would not really be misaligned. I performed a test that demonstrates that WD has not done this. I’ve included the source code to my test at the end of the article. This program does random 4K block writes to the drive at a selectable 512-byte alignment. So if I pass 0 to the program, it runs the test on 4K boundaries. If I pass 1, the test is on 4K boundaries plus 512. The effects of this test are amplified by the use of O_SYNC, which insists that all writes hit the disk immediately, but it demonstrates the problem. Note that I realize that all my testing is “quick and dirty,” but I’m just trying to demonstrate a point, not analyze it in painful detail.

1000 random aligned 4K writes consistently take between 7 and 8 seconds.
1000 random unaligned 4K writes consistently take between 22 and 24 seconds.

Now, this just demonstrates the problem we already know about. What about how it affects filesystems. Next, to formatting the drives.

I have two drives, /dev/sdc and /dev/sdd, both identical Green drives. I partitioned them as follows:

For /dev/sdd, I used fdisk to add a Linux (0x83) primary partition, taking up the whole disk, using fdisk defaults. By default, the partition starts at LBA 63.

For /dev/sdc, I used fdisk the same as with sdd, but after creating the partition, I realigned it. I did this by entering expert mode (“x”), then setting the start sector (“b”) to 64.

Once that was finished, I formatted both drives using the command “time mke2fs /dev/sdc1” (and sdd1).

/dev/sdc, which was aligned, took 5m 45.716s to format.
/dev/sdd, which was not aligned, took 19m 53.609s to format.

That’s a difference of greater than a factor of three!

Now to file tests. I ran two test. The first test was to copy one large file. I have a Windows XP disk image for qemu-kvm that takes up 18308968KiB. I copied the file (from my much faster 7200 RPM drives in RAID1 configuration) to one drive, then the other, then I reran the first test to avoid buffering effects.

$ time cp winxp.img /mnt/sdc   # ALIGNED
real    5m9.360s
user    0m0.090s
sys     0m20.420s

$ time cp winxp.img /mnt/sdd   # UNALIGNED
real    13m26.943s
user    0m0.110s
sys     0m19.350s

Pretty striking difference. I didn’t really expect this. Since this is one large file, and it can be written linearly to the disk, I expected that we would see a very slight performance hit. I think this is something that itself should be investigated. There’s no reason for long contiguous writes to get hit this hard, and it’s something that the kernel developers need to look into and fix. To complete the testing, I next tried random writes. I have some stuff I’ve been working on for school, lots of small files of all sorts of different sizes. So I decided to copy that stuff recursively.

$ time cp -r Computer Architecture/ /mnt/sdc   # ALIGNED
real    42m9.602s
user    0m0.680s
sys     1m59.070s

$ time cp -r Computer Architecture/ /mnt/sdd   # UNALIGNED
real    138m54.610s
user    0m0.660s
sys     2m15.630s

This performance hit of a factor of about 3.3 is surprisingly consistent across operations. And this is severe. I’ve read people guessing that there would be a 30% performance loss. But a 230% performance loss is exceptionally bad.

In conclusion, these drives are on the market now. We’ve known about this issue for a LONG time, and now it’s here, and we haven’t fully prepared. Some distros, like Ubuntu, use “parted”, which has a very nice “–align optimal” option that will do the right thing. But parted is incomplete, and we must rely on tools like fdisk for everything else. But anyone manually formatting drives based on popular how-tos that pop up at the top of Google searches is going to cause themselves a major performance hit, because mention of this alignment issue and how to fix it is conspicuously absent. I’ve done a lot of googling on this topic, and as far as I can tell, this issue has really not been taken seriously. There’s plenty of discussion on aligning partitions for SSDs and VMWare volumes, but nothing about the issue relating to these new hard drives. And no fix for fdisk. Most of the drives still being sold today have 512-byte sectors, so lots of people will say “not my problem”, but it will become your problem soon since all the hard disk manufacturers have been very eager to make the switch. This time next year you may have trouble buying a drive without 4K sectors, and you’re going to want all your Linux distros to handle them properly.

Evaluation setup and methodology:

  • Gentoo ~amd64 system with 2.6.31-gentoo-r5 kernel
  • fdisk version: fdisk (util-linux-ng 2.17)
  • The drives are identical, but I did not try swapping configurations to make sure that one drive isn’t fundamentally slower than the other.
  • Core 2 Quad at 2.33GHz (Q9450), 8GiB of RAM
  • MSI X48 Platinum motherboard — Intel X48 Express + ICH9R

Related articles:

http://lwn.net/Articles/322777/
http://hardware.slashdot.org/article.pl?sid=06/03/24/0619231
http://bugs.gentoo.org/show_bug.cgi?id=304727

About the author

Timothy Miller is a Ph.D. student at The Ohio State University, specializing in Computer Architecture, and Artificial Intelligence. Prior to going back to school, he worked professionally as a chip designer. Tim is also the founder of the Open Graphics Project.

Random block write code:

#define _LARGEFILE64_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <fcntl.h>

char buffer[4096];

int main(int argc, char *argv[])
{
    int fd, i, off;
    long bk, byte;
    
    if (argc<2) {
        off = 0;
    } else {
        off = atoi(argv[1]);
    }
    
    srandom(off);
    
    fd = open("/dev/sdc", O_RDWR | O_SYNC);
    printf("fd=%dn", fd);
    
    for (i=0; i<1000; i++) {
        bk = random() % 200000000;
        byte = bk * 4096 + off * 512;
        lseek64(fd, byte, SEEK_SET);
        write(fd, buffer, 4096);
    }
    
    close(fd);

    return 0;        
}

65 Comments

  1. 2010-02-14 10:48 am
    • 2010-02-14 12:16 pm
      • 2010-02-14 12:27 pm
      • 2010-02-14 2:09 pm
        • 2010-02-14 2:21 pm
    • 2010-02-14 3:06 pm
    • 2010-02-14 3:48 pm
      • 2010-02-14 5:14 pm
      • 2010-02-14 6:02 pm
      • 2010-02-15 3:29 pm
  2. 2010-02-14 11:06 am
    • 2010-02-14 12:06 pm
      • 2010-02-14 12:34 pm
        • 2010-02-14 1:15 pm
        • 2010-02-14 1:16 pm
          • 2010-02-14 1:43 pm
          • 2010-02-14 3:00 pm
          • 2010-02-14 3:58 pm
        • 2010-02-14 1:24 pm
        • 2010-02-14 1:36 pm
    • 2010-02-14 2:15 pm
      • 2010-02-14 5:55 pm
    • 2010-02-14 2:33 pm
  3. 2010-02-14 11:06 am
  4. 2010-02-14 11:10 am
  5. 2010-02-14 1:13 pm
    • 2010-02-14 2:17 pm
    • 2010-02-14 3:02 pm
  6. 2010-02-14 1:24 pm
    • 2010-02-14 5:28 pm
      • 2010-02-15 2:10 pm
        • 2010-02-16 5:57 pm
        • 2010-02-17 6:12 pm
  7. 2010-02-14 1:29 pm
  8. 2010-02-14 1:46 pm
  9. 2010-02-14 2:32 pm
  10. 2010-02-14 2:52 pm
    • 2010-02-14 3:05 pm
    • 2010-02-14 3:18 pm
    • 2010-02-14 3:55 pm
    • 2010-02-14 5:29 pm
    • 2010-02-15 4:23 pm
  11. 2010-02-14 3:12 pm
    • 2010-02-14 3:20 pm
    • 2010-02-14 3:22 pm
    • 2010-02-15 7:41 am
  12. 2010-02-14 3:14 pm
  13. 2010-02-14 6:16 pm
  14. 2010-02-15 2:19 am
    • 2010-02-15 4:27 am
      • 2010-02-15 8:18 am
        • 2010-02-15 9:28 am
        • 2010-02-15 3:41 pm
          • 2010-02-16 12:13 am
      • 2010-02-15 2:01 pm
  15. 2010-02-15 11:35 am
  16. 2010-02-15 9:53 pm
  17. 2010-02-15 10:10 pm
  18. 2010-02-16 1:46 pm
  19. 2010-02-16 3:58 pm
  20. 2010-02-16 4:12 pm
    • 2010-02-16 7:53 pm
  21. 2010-02-16 4:17 pm
  22. 2010-02-17 2:40 pm