Software Linux RAID 0, 1 and ‘No RAID’ Benchmark

Submitted by cpina 2005-12-06 Linux 41 Comments

“Although having this reputation, my question was: was a RAID 1 system too slow? Was it slower than not having any RAID? In fact, there are people that say that it is the other way round – these people say that having software RAID 1 is faster than having just one hard disk drive. In any case, who is right?”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

41 Comments

2005-12-06 8:54 pm

Robert Escue
Could he have first used two drives from the same manufacturer? His comment “It may be strange having two different hard disks in a RAID but it is more common than having the same model.” might be applicable to a home user, but every array I have worked on uses the same model and manufacturer.

Why didn’t he experiment with different stripe sizes and what was the default stripe size? Also, why didn’t he use iozone, it would have been interesting to see the I/O differences graphically between the single, RAID 0 and RAID 1 tests.

2005-12-06 9:03 pm

Anonymous
“It may be strange having two different hard disks in a RAID but it is more common than having the same model.” might be applicable to a home user, but every array I have worked on uses the same model and manufacturer.

Well, actually having different brand or different stock disk in array is better than having all the disk from the same brand/same stock.

The more the disks are similar, the more the probabilities they broke in the same time frame, leaving you no time to replace them !

2005-12-07 4:21 am

Anonymous
What you do in this case is buy 3 from the same manufacturer – run 2 of them for a few months, and replace one with the 3rd…

Then you have 2 with different life expectancies in the machine, and a 3rd as backup when one of the other 2 dies.

2005-12-07 6:13 pm

nharring
Time in service is but one of many variables which affect MTBF on hard disks. Further, this is pretty unrealistic for real world scenarios, admins have better things to do with their time than replace hard drives to satisfy statistical anomalies.

2005-12-07 7:28 am

Marcellus
The more the disks are similar, the more the probabilities they broke in the same time frame, leaving you no time to replace them !

Err… The risk that the two drives will fail within the same day is as good as non-existent. Using similar drives from different manufacturers doesn’t give you any measurable additional protection from multiple disk failures.

2005-12-07 8:21 am

fepede
Err… The risk that the two drives will fail within the same day is as good as non-existent.

Evidence says different.

2005-12-07 8:40 am

Marcellus
Evidence says different.

Just to clarify, my statement is that for two drives in the same array, where it is the drives that fail and not failures due to controller faults or other things that can happen.

RAID 1 with two drives can be modelled as parallel redundancy, and if they are of the same model, etc. the extra drive means 50% higher MTBF for the RAID 1 array.

Units like the infamous Deathstar’s can obviously not be modelled like this, since the data that is available for those units are flawed.

Please state the evidence that you talk about, and clarify exactly what you mean.
2005-12-07 11:41 am

fepede
Please state the evidence that you talk about, and clarify exactly what you mean.

If an Hard Disk crash long before its estimated MTTF, it is often due to a defect in the production chain.

If it is the case, then probably all the disks produced in that stock will suffer from the same failure cause, and then will have an high probability to crash in the same time frame.

This is not a “written rule”, but it’s statistically proven.

So, having redundancy on drives that may crash at the same time would make redundancy unuseful at all !
2005-12-07 1:39 pm

Anonymous
If an Hard Disk crash long before its estimated MTTF, it is often due to a defect in the production chain

So? All it means is that your particular drive’s TIME TO FAILURE is *less* than the MEAN …

MTTFs are not meant to be applied to individual components, only as a statistical measure applied to large numbers of them …
2005-12-07 4:57 pm

Anonymous
So? All it means is that your particular drive’s TIME TO FAILURE is *less* than the MEAN …

So ? You are completely missing the point here.

MTTFs are not meant to be applied to individual components, only as a statistical measure applied to large numbers of them ..

You are still missing the point and answering things not related to what i was trying to say.

Good luck with your disks arrays !
2005-12-07 5:25 pm

Anonymous
You are still missing the point and answering things not related to what i was trying to say.

just because you have a disk that crashes early it doesn’t necessarily mean a defect, ever heard of the bathtub curve? it can be bad treatment of the disk,or just bad luck.

If you have several bad disks from the same batch, I’d be inclided to agree with you, and in that case MTTFs are out of the window.

Good luck with your disks arrays

Our own disks are in good shape thanks, and those of clients, one of whom has arrays totalling over 350 spindles, the oldest of which I installed 8 years ago and never have had data loss due to failed disk. But there have had “near misses” where disks in different arrays have failed simultaneously, after all it’s only chance which disks end up in which arrays.
2005-12-07 6:17 pm

nharring
Have you ever actually been in this situation, or are you just trying to be contrary?

My production environment has just shy of 10TB of disks deployed, and in every array we’re using identical drives. For one, we’ve had very few drive failures, the MTBF on modern drives is very high and generally quite accurate.

Many hardware RAID controllers, which represent the normal configuration for the vast majority of real world raid deployments, require virtually identical or totally identical drives to be able to mirror or stripe them. This is especially true of controllers which target the enterprise arena, which is also where the vast majority of raid deployments are done.
2005-12-07 1:29 pm

Anonymous
Check link below for recent evidence:

http://blog.fastmail.fm/?p=521
2005-12-07 2:00 pm

Marcellus
Check link below for recent evidence

That is one isolated case that doesn’t mention some things like what kind of drive failures they were talking about. It may very well have been failures that were caused by a faulty controller in combination with running the rebuild at the same time as the server was using it.

It does however not amount to much of evidence that two “identical” drives are worse than two from different batches or manufacturers.

This case may have just been very bad luck, but the risk that it would happen was still pretty much non-existent.

Also, redundancy should be done on several levels, like using several RAID controllers, or redundant servers.

Redundancy on drive level doesn’t help much if it’s a controller failure.

It’s like some bad cases where people/companies have used redundant PSU’s where they were connected to the same power outlet.
2005-12-07 1:31 pm

Anonymous
Err… The risk that the two drives will fail within the same day is as good as non-existent.

Evidence says different.

Evidence is correct! Though disks can “just” fail they can also be provoked into failure, e.g. if your aircon has a problem and you start to get heat buildup you can easily have several disks fail at once 🙁

2005-12-07 10:55 pm

Anonymous
Marcellus wrote:

Err… The risk that the two drives will fail within the same day is as good as non-existent. Using similar drives from different manufacturers doesn’t give you any measurable additional protection from multiple disk failures.

I had that happen…

On a HP StorageWorks EVA 5000.

In the same RSS group (RAID5). The second drive failed before the hotspare had fully replaced the first failed one.

Lost 24 TB of data. Gone. Took a week to restore.

These were the top-end FC-SCSI disks, best available for any money.

So even if the probability low, it isn’t zero

2005-12-06 9:39 pm

Smartpatrol
Speed is not the only concern when creating a RAID array. Hardware is always preferred in a server environment not only does it take a lot of the IO processing that has to be done off of the main CPU it most often offers high speed memory for caching. It’s also transparent to the OS which in my opinion is the best part. The only times I have used Software RAID in a serious server environment was to logically group large cache centric array LUN’s like EMC 9gb disk targets. Since that type of disk is already hardware redundant.

2005-12-08 10:01 pm

Anonymous
There are valid reasons to want raid to be hidden from the OS, especially if the OS is Windows and is horrible at handling raid. With Linux I’m almost happier to remove the hardware raid controller from the picture because I’m much more comfortable with the software being able to gracefully handle doing it’s job. Also, while having RAM in a raid controller may speed things up, odds are if it does, having more system RAM would speed things up a lot more and you could buy more of that for the same amount of money.

There are lots of tradeoffs with storage but one thing continues to be clear. The medium sized, easy, small, cheap storage devices will continue to move into the capacity and speed ranges typically occupied only by high-end expensive hardware not just because speeds and capacities just naturally increase over time but also because the complicated task of grouping multiple disks together gets easier, more robust, and more common every day. As time goes on it will become harder to justify the costs of the wildly expensive storage solutions that exist in the market today.

I just built myself an 8 drive software raid5 array using Linux, a dual-core CPU (one core can do IO while the other runs programs) in a 3U rack mount case for $2500. It holds 2 terabytes and I can write to the raid5 partition at 120 mb/sec and read at 150mb/sec. My old 3Ware 6000 series 4 drive hardware array would top out at about 6mb/s on writes (never tested reads but they were respectable). Granted, todays hardware controllers aren’t as slow as the old ones but in my mind, the future lies in well designed software raid solutions. As the features provided by the hardware cards (speed, ease of installation, notification, automatic rebuilding, hot swap) are recreated in the software raid systems the justification for the expense of the hardware card will go away. The trend twords multi-core CPU’s will only accelerate this since many machines will have a whole lot of CPU cycles left over. No use having a $500 cpu sitting and waiting for your raid controler’s $20 embeded CPU to do something it could do 5 times as fast.

I’m not saying we’re there yet though. I seriously thought about using a hardware controller and I’m still testing the setup and debating it. I just see that down the road, software raid will likely become much more common in server enviornments than it is today.

2005-12-10 4:40 am

Anonymous
“There are valid reasons to want raid to be hidden from the OS, especially if the OS is Windows and is horrible at handling raid.”

That’d be one almost nobody cites, but sure. Nice troll.

“With Linux I’m almost happier to remove the hardware raid controller from the picture because I’m much more comfortable with the software being able to gracefully handle doing it’s job.”

You must not run a lot of linux systems. Linux is like every other general purpose OS out there, it crashes, it has bugs, it occasionally does the most braindead thing possible. The reason is that its a general purpose OS. Think about it as a toaster/cellphone/vcr/blender/oven. It can’t do any one thing exceptionally well, and often ends up doing many things poorly. Linux isn’t as bad as that, but its certainly not bulletproof.

Hardware raid controllers aren’t perfect either, but they tend to be a hell of a lot more reliable because they’re a lot simpler. There isn’t “software” per se, but rather a mixture of firmware and dedicated silicon doing a small, well defined set of tasks. Further, its done by companies with dedicated QA environments that can in fact test the vast majority of scenarios, if not all of them. This is why you don’t hear very often about enterprise raid controllers randomly eating a file system. Yes it happens, no it doesn’t happen nearly as often as a Linux box does it.

“The medium sized, easy, small, cheap storage devices will continue to move into the capacity and speed ranges typically occupied only by high-end expensive hardware not just because speeds and capacities just naturally increase over time but also because the complicated task of grouping multiple disks together gets easier, more robust, and more common every day”

Yes, you are in fact seeing the whole market creeping upwards. Yes formerly enterprise only technologies are marching into peoples homes. Software RAID isn’t the devil, but it isn’t enterprise grade either. You’re also ignoring a massively important fact about enterprise storage systems, that being that RAID isn’t even really a feature people contemplate anymore. We don’t even “expect” it to be there, since nobody in their right mind would sell such a product without it. We’ve moved on to much more advanced, and much more important, features. Things like snapshotting, block level replication, highly redundant components throughout a system, built in clustering, built in support for umpteen specialty data types and configuration situations. Hardware arbitration of write paths to support clustering with shared storage. This is the stuff that matters to enterprise customers, not the difference between 120MB/sec and 130MB/sec.

“I just built myself an 8 drive software raid5 array using Linux, a dual-core CPU (one core can do IO while the other runs programs) in a 3U rack mount case for $2500”

How exactly were you planning on forcing that distribution of work? Or were you actually meaning that with two cores you’re less likely to have CPU contention during heavy IO? Have you done any testing on this, or is it just a theory?

“As time goes on it will become harder to justify the costs of the wildly expensive storage solutions that exist in the market today.”

I genuinely doubt it. Not because I’ve heard that same thing for almost 10 years now, and yet it’s remained false the whole time, but because I’ve actually watched the evolution of enterprise storage for almost that long, and watched as high end feature expansion outpaced price growth by a huge margin. Enterprise storage products are becoming cheaper, not more expensive, not even staying AS expensive. Instead the ROI just keeps getting better.

I’ll give you a personal experience example. 2 years ago my company purchased a pair of NetApp F820 filers in a clustered configuration, with just under 2TB of useable storage. Those were almost end of lifed products, and factory refurbs to boot. This year we just bought a new pair of FAS3050s with an additional 5TB of useable storage, for a couple grand more than the first pair. You might think we got ripped off the first time, but we payed a hair under fair market value for them. Same thing this time. What changed? Storage prices have fallen through the floor, and NetApp keeps driving costs down through economies of scale. I know EMCs product lines are going through the same cycle, and every other enterprise vendor is mirroring it.

Unless software RAID and linux IO options in general start advancing at an absurd rate, there will remain a market for real enterprise storage technologies for a long, long time.

For what its worth, I run software raid on my box at home because for my home environment it really does have the best cost benefit.

2005-12-06 11:54 pm

Anonymous
With the cost of these cards and the fact that they are real hardware RAID and the kernel module is included in distros like CentOS it’s a no brainer.
2005-12-07 12:07 am

AndrewZ
Any form of RAID1 is a better protection of your disk data than any form of RAID0. RAID0 only provides a performance benefit when done in hardware, like SCSI, and has a pretty high likelihood of losing all your data. Even if it did come on your motherboard for free 😉
2005-12-07 12:22 am

archer75
RAID 0 does almost nothing to speed up real world performance. It only helps when writing large files to desk. Due to high latency it can actually slow down all other disk access.

Now i’m talking real world tests here. Not synthetic benchmarks, those don’t count.

I even decided to setup a RAID 0 array to test it. The results are just as I expected them to be.

2005-12-07 12:30 pm

Anonymous
RAID0 is commonly used with RAID1. Several tens of disks in two striped set. This gives good performance for relative low cost. 40 SATA disk is realtively cheap today.

http://en.wikipedia.org/wiki/RAID#RAID_10

“RAID10 is often the primary choice for high-load databases, because the lack of parity to calculate gives it faster write speeds.”

2005-12-07 6:12 pm

nharring
This is something of a misnomer, because database workloads are not easily categorized. OLTP databases tend to do lots of random IO, whereas reporting databases tend to do lots of linear reads. Both workloads have different optimization strategies at all levels, including filesystem, disk subsystem and so on.

RAID10 is also fairly wasteful of disks, much more so than RAID5. This is a pretty serious consideration in most shops, especially since the performance gap tends to be pretty small with a hardware raid controller and competent admins to do FS and IO tuning.

2005-12-07 1:00 am

Anonymous
Multiple different disks in raid will not help boost your performance or reliability. The slowest disk will be the bottle neck. Silly.

As someone who deals with a lot of seasoned administrators I can tell you on the enterprise level, software RAID 5 is used frequently. Software RAIDs can autoexpand to knew disks gracefully while the system is on with generic controllers. This is especially nice on Linux using EVMS. If your proprietary raid controller dies and the company that made it goes out of business, how do you get your data? With software RAID the data is not tied to the controller.

RAID 0 definately has performance benifits in software mode. In my own experience with two RAID 0 setups, the sequential read and write was almost twice as fast as a single disk in the array. There were only two disks in the RAID 0 setups.

RAID 0 can also give you an advantage in real world tasks, such as encoding raw AVI to disk. In my own experience I NEEDED a RAID to record raw AVI from tv at certain qualitys or my sound and video would get all skippy in the recorded version.

RAID 1 isn’t going to give you a performance benifit on write, but it can on read, think about it. You can read from a RAID 1 just as if it were a RAID 0.

2005-12-08 10:05 pm

Anonymous
RAID 1 isn’t going to give you a performance benifit on write, but it can on read, think about it. You can read from a RAID 1 just as if it were a RAID 0.

It can, but in the case of Linux (which this benchmark was testing) it doesn’t. As of the last time I checked, nobody had written the driver to take advantage of the fact that you could read striped from a mirrored array. This is too bad since it seems like an obvious potential benefit.

2005-12-07 1:05 pm

chakie
This topic actually interests me, but I haven’t been able to reach the server all day.
2005-12-07 2:08 pm

Anonymous
The article fails to mension the tiny details about disks that are visible trough s/hdparm, like disk cache, prefetch, multisector transfer and most importanly the state of unmaskirq

The other thing I consider big flaw in these bechmarks , is that they have filesystem related benchmarks. (take into account the tar and cp of linux kernel as start point and ask yourself why there is such big difference).
2005-12-07 2:53 pm

AndrewZ
I applaud the author for taking the initiative to perform experiments that the rest of us just talk about. These are very interesting conclusions. I never knew RAID1 in software was so slow.

I wish more people would take the effort to find out things for themselves instead of just carping.

2005-12-07 5:18 pm

Robert Escue
Unfortunately his tests only cover Linux, it does not look at Solstice DiskSuite and Solaris Volume Manager (Solaris), Veritas Volume Manager (multi-platform), and the volume managers used in AIX and HP-UX.

So his tests give a general indication of the performance of the volume manager in Debian only on his hardware.
2005-12-07 6:25 pm

nharring
I can’t applaud the author, since he obviously failed to understand what he was actually benchmarking. Read the detailed description of the tests the author invented.

For example: “read the copied files from /dev/null and return them to /dev/null.” This doesn’t test anything except the efficiency of your bit bucket. Either he explained it wrong or he doesn’t understand what he’s testing. Either way it damages credibility.

“Then, we unpack using tar -xf linux.tar inside the RAID (it is almost written because it is copied and the RAM of the system is enough to fit the file).” There’s no such thing as “almost written”. He means that its written to cache and scheduled for disk write, which will happen either when dirty buffers exceed a threshold or when a timer expires. With a write of this size that threshold will almost certainly be hit and the write will begin streaming to disk almost immediately.

/dev/null is also a bad data source for disk writes, since it creates sparse files which some systems handle in a special way that adds efficiency. /dev/urandom would be a much better test since its nonblocking and outputs data that can’t be optimised.

While the concept of the article was good, the author seems to have not fully understood what he was testing and therefor ended up putting out more flawed benchmark results.

2005-12-07 10:10 pm

cpina
Hi,

Thanks for your suggeriments/criticisms (and other people who rode it and liked too).

About /dev/null, it is a mistake in article! When I wrote it, I wrote /dev/null, I had to write /dev/zero, of course! (in some places). dd from /dev/null to a file is a “new empty file”, my mistake. I will correct now, at least I will add a note on top of article.

About tar -xf linux.tar, some parts was written to disk, this is 100% sure. Maybe not whole Kernel, but some parts yes (I checked using vmstat, etc.). Else, how is possible that it depends of which RAID we are using?

In “real life” we will do tar -xf kernel.tar and it will takes more or less time if we have one RAID or another RAID. Maybe the explication is not perfect, and next time I will umount RAID device and add this time to tar time (to be sure that everything is on disk).

Thanks lot of,

2005-12-10 4:47 am

Anonymous
Thanks for noticing, I figured that it had to be a mistake. I still dispute the validity however of using /dev/zero as a source for large, contigous write testing. I still contend that you run the risk of being silently optimized without realizing it. While not true on all filesystems, use of something like /dev/urandom removes all doubt.

If you’re worried about tainting the test by the cpu cycles /dev/urandom might use, you can either remove it from measurement by benchmarking with a dd if=/dev/urandom of=/dev/zero and measuring cpu and other system usage then subtracting that from your final results.

Or you could mount a second disk in the system and read the file multiple times into /dev/null to gaurantee cache locality, then do the write. Even better would be to write a simple C app that mmap()s the source file and then writes to disk.

If you wanted to assure being written to disk, just add &&sync; to your test command and force a disk flush at the end. Measuring to the return of sync will tell you real time to write. This is a common usage pattern of real software (postgresql, qmail, etc).

Anyhow, thanks for noticing my comments and taking them constructively, which is how they were intended. I’d love to see you integrate what everyone here said and present new results to see if the changes make a difference.

2005-12-07 3:22 pm

Tuishimi
Striping… it makes sense that it would be faster since you are breaking up the writes to 1/2 each to two disks. It’s like adding platters to a drive.

I would like to see him do some tests with shadowing too. I always wondered if that slowed a system down.

2005-12-07 6:32 pm

nharring
“it makes sense that it would be faster since you are breaking up the writes to 1/2 each to two disks.”

No, you aren’t. You might be, depending on how your filesystem is laid out, what blocks are available for re-allocation, how much data the filesystem contains, the age of the filesystem, how much data turnover the filesystem has experienced, and on and on.

2005-12-07 7:37 pm

Tuishimi
Well, yes of course. It was a very over-simplistic statement, you are right. But the gist of it is that the writes are “hopefully” broken up. That way you not only gain the space of 2 disks, but the performance gain of “faux” extra platters.

2005-12-07 11:45 pm

Anonymous
ZFS
2005-12-08 12:24 am

Rev.Tig
I don’t mean to troll but, you had 24TB of data before a backup routine ran? WTF are you collecting? Not trolling just interested Oh and we have HP branded HDs failing reasonably often, they get 2 years and then they start going south… the R in RAID I understood, the I makes me wonder whose budget it came off

2005-12-08 10:42 pm

Anonymous
Quote:

I don’t mean to troll but, you had 24TB of data before a backup routine ran?

===

How do you surmise that from what I wrote?

Of course it was backed up. Lots of SAN partition (~150) presented to lots of hosts(~100), all of it backed up.

Full/synthetic full/incrementals.

Backups ran normally for a long time.

Trouble is, with only 2 tape drives, even at 20MB/s it takes a while to recover all the filesystems and all the hosts.

2005-12-08 5:24 am

Shade
Really, I don’t mind software RAID 1 at home… In a home environment RAID 1 does a good job helping to make sure you don’t lose all of that precious ‘media / data’ you have collected over the years (I don’t have 24TB though).

If you are concerned with speed (or space) put your ‘multiuser media’ directory (Oh, and think up a good permissions regime while you’re there, you don’t want you idiot cousin to delete a couple of gigs of your files), and a backup of your package lists on the RAID 1 array, and move users’ document directories to there (and link them back to their homes). Backup any 3rd party software your distro doesn’t package (ie. Loki games). Include the home directories if you are that concerned with people losing their desktop settings. Then keep the rest of the root filesystem on a normal (ie. non-RAID 1) partition(s)).

Since every distro worth it’s salt has a ‘netinst’, and because broadband is readily available, so long as you have a package list, installing to a new drive should be trivial (with free software, and pre-rolled binaries).

Really, if you are concerned with the speed of RAID 1 just put the stuff you can’t afford to lose on the RAID 1 array (AKA- ‘Media’, ‘Data’, ‘backups of Unpackaged 3rd party software’, and a ‘package list’). Then you’ll get ‘normal’ speed (or save space) for the things you want to be fast… Well, it’s worked well for me that way… YMMV.
2005-12-09 11:24 am

Anonymous
nice advertisement. someone mod us down.