Recently, I bought a pair of those new Western Digital Caviar Green drives. These new drives represent a transitional point from 512-byte sectors to 4096-byte sectors. A number of articles have been published recently about this, explaining the benefits and some of the challenges that we'll be facing during this transition. Reportedly, Linux should unaffected by some of the pitfalls of this transition, but my own experimentation has shown that Linux is just as vulnerable to the potential performance impact as Windows XP. Despite this issue being known about for a long time, basic Linux tools for partitioning and formatting drives have not caught up.
Post a Comment
Thanks for the excellently prepared article, I didn’t have to do anything to it.
I really didn’t know about this issue, thanks for investigating and presenting it so clearly. Does this affect disks of a certain size (and above), or any size that are manufactured to this setting?—I certainly want to avoid this problem when replacing HDDs for WinXP machines.
Indeed an excellent article!
I think it would happen in all 4K sector drives regardless of size. If you started at LBA 63, the drive would be forced to update 2 4K sectors because the virtual sectors would lay across 2 physical sectors.
P=======P=======P=======
=V=======V=======V======
Edited 2010-02-14 12:20 UTC
Hm, how is LBA addresses specified exactly?
I thought it was specified as multiples of drive sector size, in which case addressing via LBA should always be aligned - but is it in reality always a multiple of 512?
Is there a difference between how LBAs in the MBR are interpreted, and how ATAPI interprets LBAs?
As if gparted (which is just a pretty gui + parted) can't screw things up too. fdisk and the likes are for people that know what they are doing.
In case you don't know there are distros that have text mode installs and use utilities like fdisk/cfdisk. Just because you can use gparted it doesn't mean you have to.
It's not so much Linux, it's the DOS-compatible partition that fdisk creates.
If you don't need DOS-compatiblility, you wouldn't have a problem.
It's a DOS/Windows-compatibility thing you are trying to attribute to Linux. DOS/Windows has a problem, Linux just tries to be compatible.
As you said, parted does it just fine.
Edited 2010-02-14 11:09 UTC
The fdisk man page (on my machine at least) contains a lengthy warning about how fdisk will quite happily create some pretty dodgy partition layouts and it recommends parted for doing anything even remotely unusual. I guess this falls into that category.
All the major noob-friendly distros use gparted for doing the partition editing, don't they? Will that protect users from these kinds of problem, then?
All the major noob-friendly distros use gparted for doing the partition editing, don't they? Will that protect users from these kinds of problem, then?
I do consider Mandriva to be pretty newbie-friendly all in all; it's clear, consistent, and provides an extensive selection of documentation and loads of online help if needed. Also, it's really stable and has excellent control center utility.
But alas, Mandriva doesn't actually use gparted. They use some sort of a tool of their own which apparently uses libparted as its backend. As far as I know quite a few distros actually do it that way. But as the article states, you seemingly have to use "--align optimal" option which does the right thing. It doesn't automatically align the partitions properly without that. And I have no idea if those custom partitioning tools employed by various distros pass such an option to libparted. If they don't then that'll be a very important issue to fix immediately.
I'd actually prefer if distros rolled out an update of some sort which will check the currently installed system and its partitioning scheme and warn if they are misaligned and would provide a way of fixing it; not everyone re-installs their system all the time and as such could be using misaligned partitions for years before next re-install.
gparted get it's right:
"When enabled, Round to cylinders aligns partition boundaries on the cylinder boundaries for the disk device. Enabled is the default setting."
The Ubuntu installer uses Partman from Debian, which uses parted in the background.
So that's a start.
But if parted will do it right when run from partman, I don't know yet.
Edited 2010-02-14 13:19 UTC
Yeah, Mandriva would certainly fall within the category of distros which should sort all of this stuff out automatically without the user having to worry. Distros like Arch, Gentoo & Slackware all generally expect their users to be aware of the technical issues. But for the mainstream distros this does need to be fixed.
Maybe we could take a look at our respective distros and file a bug report if there could be an issue.
Mandriva was the first distro to make partitioning the hard drive mortal friendly.
Mandrake Linux 7.0 was when they first released it and I recall thinking "Holy crap, this thing needs to be sold separately"
I went so far as to use the install disk up to the partition step to repartition my hard drives for a while.
So, since you're using Linux, wouldn't it behoove you to use the GPT (GUID Parition Table) scheme which handles, by design, the new block size?
ref: http://en.wikipedia.org/wiki/GUID_Partition_Table
Well apart from anything, fdisk doesn't support GPT. Parted does, but as he said, you can get parted to automatically solve the problem anyway.
The problem as TFA describes it is that a bunch of the tools & tutorials out there today will give you bad results if you use them with the new block size. They will probably give you bad results if you use them with GPT, too.
Edited 2010-02-14 17:57 UTC
Hi,
No.
Very old disk drives used "CHS" (Cylinders, Heads, Sectors) instead of LBA. Due to limitations this didn't work for drives larger than about 500 MiB, so the industry shifted to LBA; and created a CHS->LBA translation scheme.
Due to BIOS limits, this CHS->LBA translation scheme usually uses "63 sectors per track", which is the highest number of sectors per track that the old BIOS disk interface can handle.
For performance reasons OSs make partitions that start/end on track boundaries (having a few sectors at the start or end of a partition that are on a track by themselves causes more disk head movement).
Basically what I'm saying is that the problem wasn't caused by *any* OS. The problem is caused by 30 years of backward compatibility (and the lack of foresight, from BIOS, disk and OS designers).
The ironic part is that the original IBM design supported floppy disks and hard drives with different sector sizes. It's unfortunate that this aspect of the original design was lost, and unfortunate that these new hard drives need to emulate 512-byte sectors to begin with.
-Brendan
Thanks for the interesting and informative article! Should be careful when installing my next pair of drives...
Just one minor observation: can you actually have a "230% performance loss?" That sounds as if the performance of the system turned negative... I think it would be clearer to say 230% overhead (in operation time) or 70% performance loss (in average throughput).
There recently was a discussion about this on the util-linux-ng mailing list:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926
The posts on the fdisk list seem to imply that the version of fdisk you are using will do the right thing provided you're using a .32 kernel that can properly report the disk topology. Could you test with that?
Here's the post I'm referring to:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926
Here's the post I'm referring to:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926
Reporting disk topology requires the hardware to also communicate that topology. AFAIK many of these drives do not. Christoph Zimmermann summarized the same observation in the thread you point to:
- it is a must that the partitions are aligned correctly to 4KiB
boundries. else the drive is unusable slow.
- the drive does NOT report its physical sector size. so doing proper
programming is not enough.
It looks like the discussion in that thread is about aligning partitions by default on a sufficiently large granularity to "get by" (as Vista and above do.) Note that other technologies (eg. SSDs, RAID) benefit from large alignment (larger than 512 bytes or 4Kb.)
I did battle with this issue just this morning. I had to manually configure the partitions to both begin and be length of multiples of eight using parted.
I didn't actually suffer of major read/write performance loss, only some random reads made the drive go crazy and very slow. Write speed topped at 70 MB/s, and after setting the partitions well-aligned it has gone up to 80 MB/s.
You're both wrong and wrong. It's a hack mode that can be enabled by connecting pins 7-8 on the HDD. But of course it's not enabled by default, because it would totally screw up every other OS.
Edited 2010-02-14 13:25 UTC
mine is WD15EADS.
Currently the only "Advanced Format" disks (as WD calls it) are the Green models ending in "EARS". WD15EADS is of the previous generation and therefore not affected.
The Advanced Format disks are called WD10EARS, WD15EARS and WD20EARS.
I thought it was specified as multiples of drive sector size, in which case addressing via LBA should always be aligned - but is it in reality always a multiple of 512?
I read that every major HDD manufacturer has agreed to using a 512-byte-emulation mode until the end of 2014.
I also read that there is a way for software to ask the disk about its real physical layout but that Western Digital hasn't implemented such feature in its current line of 4k disks. Therefore no software can detect them and take care to stay aligned.
Edited 2010-02-15 14:11 UTC
I know it is not supposed to be an advanced format drive, but running the code at the and of the article indeed shows performance differences with alignment other than 0 and 8. (I ran it on a partition starting at sector 64)
Also the disk freezes a lot when doing io.
I am sending it back.
Where did you get this info? If you read WD's site:
http://www.wdc.com/en/products/products.asp?driveid=773
Formatted Capacity 2,000,398 MB
Capacity 2 TB
Interface SATA 3 Gb/s
User Sectors Per Drive 3,907,029,168
That's 512 byte sectors, model # WD20EARS
'Parted Magic' uses core programs of GParted and Parted to handle partitioning tasks with ease, while featuring other useful programs (e.g. Partimage, TestDisk, Truecrypt, G4L, SuperGrubDisk, ddrescue, etc...).
If you ever used PartitionMagic with windows, 'Parted Magic' is a superior linux partitioning tool that you can use from a cd, usb or load it from its own directory on the drive.
http://partedmagic.com/
So using cp is about as braindead as rm -rf /* for testing disk io. Its all about the block size that's read/written which in the case of cp is 1 character at a time. Something like dd or tar would provide a better metric for streaming writes. tar -cpf - some_path/ | tar -xpf - -C /path/to/final/destination
Or you can use dd which allows you to slice and dice and adjust block sizes trivially, then you can write to a raw block device and see what it can do sans filesystem crap.
An interesting test would use variable block sizes of 512, 768, 1024, 2048, 4096, 8192, 16384, which will show an odd output block size at 768, and the performance of 1 and 2 bit shifts above and below the new block size. Just to show how brain dead a block size of 1 is, I am throwing that in here too.
for BS in 1 512 768 1024 2048 4096 8192 16384 ; do
for SKIP in 0 1 2 4 8; do
dd if=/dev/zero of=/dev/sdc bs=${BS} seek=${SKIP} count=1024k
done
done
it's the duty of the blockdevice-driver and/or the filesystem to collect several such manipulations before writing them to the disk
Well, no. cp doesn't do 1 char at a time, it tries to minimize io. It might not be clear from the code though. Even stdio keeps an internal io buffer you can't normally see.
A quick truss/strace on recent FreeBSD/CentOS/Solaris shows 64k buffers/4k buffers/mmap the whole darned file approaches.
I just got a WD15EARS (1.5 TB, SATA 3 Gb/s, 64 MB Cache) a couple of days ago. I formatted it as ext3 and have filled it ~90% at ~20MB/sec. Is there anyway (please-oh-please-oh-please) that I can fix this in-place, i.e. without having to reformat? It must be technically possible..
For /dev/sdd, I used fdisk to add a Linux (0x83) primary partition, taking up the whole disk, using fdisk defaults. By default, the partition starts at LBA 63.
For /dev/sdc, I used fdisk the same as with sdd, but after creating the partition, I realigned it. I did this by entering expert mode ("x"), then setting the start sector ("b") to 64.
http://www.gnu.org/software/parted/faq.shtml
Starting from 1.7, GNU Parted will automatically align partitions to the physical sector size reported by an ATAPI-compliant drive.
Surely you should be using parted to partition drives, and not fdisk?
I believe the whole point is to emulate the 512b sectors so that legacy OS's will work. The drives can't query the OS and then modify how they report themselves depending on what is supported. They do provide a jumper so you can manually turn the legacy emulation on or off, but most people aren't going to mess with that.
Yes it can be dealt with, yet at the same time it seems as though these drives should be able to report the correct geometry when queried properly. That would mean the partitioning tools would be aware of it from the start rather than having to manually deal with the problem. Most people I know, even ones with good technical knowledge, wouldn't have known how to handle this one as they don't delve that deep into drive partitioning. For the sake of avoiding trouble whenever possible the drive should report the geometry properly when queried by an os that knows how to ask for the *real* geometry and not that ridiculous LBA compatibility hack we've had to live with for so long thanks to bios and Windows.
First of all it's important to distinguish between logical block size which is used when sending commands to a device and the physical block size which is used by the device internally.
Linux has supported (SCSI) drives that present 4KB logical block sizes for a long time. For compatiblity with legacy OS'es, however, consumer grade ATA drives with 4KB physical blocks continue to present a 512-byte logical block interface. The knob indicating that the drive has 4KB physical blocks is orthogonal to the logical block size reporting, allowing the information to be communicated without interfering with legacy OS'es like XP that only know about 512-byte sectors.
We have worked closely with disk manufacturers for a long time to make sure we were ready. Western Digital have been instrumental in the ATA specification in terms of the alignment and physical block size parameters. The engineering sample drives I have received from WDC have all implemented the physical block size knobs correctly. Which makes it even more baffling that they end up shipping an advanced format drive that gets it wrong. I have no idea why they did that. The location of the block size information in IDENTIFY DEVICE is unlikely to be inspected by legacy systems, so I highly doubt it's a compatibility thing. Brown paper bag time for Western Digital...
It is true that the effects of this particular drive reporting incorrect information could have been mitigated by a 1MB default alignment. However, that would still have caused misalignment for other drives that come wired with 1-alignment to compensate for the legacy DOS sector 63 offset. So blindly aligning to 1MB won't cut it. Windows Vista/7 don't do that either. Like Linux, they compensate based upon what the drive reports.
Linux 2.6.31 and beyond will report device alignment and physical block size for all block devices. It is then up to the userland partitioning utilities etc. to adjust start offsets accordingly. You'll find that both parted and util-linux-ng have been updated to do this. And that modern fdisk will in fact align on a 1MB (+/- drive alignment) boundary by default.
Caveat being that Fedora is the only community distribution I know of that's using the updated bits. I don't think all of them made it into Fedora 12 but I'm sure Fedora 13 will do the right thing.
So I encourage you to work with your distribution vendor to ensure they start shipping recent partition tooling.
Martin K. Petersen
Kernel Developer, Oracle Linux Engineering
[code]
#define _LARGEFILE64_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define LIMIT 1000
char buffer[4096];
int main(int argc, char *argv[]) {
int fd, i, off;
long bk[LIMIT], byte;
if (argc<2) {
off = 0;
} else {
off = atoi(argv[1]);
}
srandom(off);
/* fill array of randoms */
for (i = 0; i < LIMIT; i++) {
*(bk+i) = random() % 2000000000;
}
*bk = 0; /* goto begin */
off *= 512; // mul
off += 4096 // add
fd = open("/dev/sds", O_RDWR | O_SYNC);
printf("fd = %d", fd);
for (i = 0 ; i < LIMIT; i++) {
byte = bk * off;
lseek64(fd, byte, SEEK_SET);
write(fd, buffer, 4096);
}
close(fd);
return 0;
}
[/code]
[i]Edited 2010-02-15 22:07 UTC
Wouldn't simply setting proper sectors/tracks option do the job? Like it's described here: http://www.ocztechnologyforum.com/forum/showthread.php?48309-Partit...
Hi!
In the util-linux-ng mailing thread that some commenters have already mentioned here (thread started by myself, BTW), I did a test similar to yours, only fully automated and using a ready-made benchmark named PostMark which is quite well suited for exposing this particular performance problem:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926/fo...
This benchmark script is able to automatically expose the optimal partition offset that offers the best performance.
In the same thread, you can read that the util-linux-ng guys have already committed a fix for this issue in fdisk:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926/fo...
What's left to be fixed now is parted:
http://parted.alioth.debian.org/cgi-bin/trac.cgi/ticket/251




