IBM’s new 64bit chip obliterates Itanium’s best

Eugenia Loli 2004-11-19 IBM 30 Comments

IBM has been winning the TPC-C benchmark war against HP for some time and has now really put some pressure on its rival to keep up. Not only did IBM obliterate HP’s Itanium-based server record of 1m transactions per minute, but it also did so with a better price/performance system.

About The Author

Eugenia Loli

Ex-programmer, ex-editor in chief at OSNews.com, now a visual artist/filmmaker.

Follow me on Twitter @EugeniaLoli

30 Comments

2004-11-19 1:13 am
Anonymous
I’m intersted in when this technology will flow to IBM’s consumer chips (980?)
2004-11-19 1:15 am
Anonymous
Now we just need to get one of those POWER5 CPUs in our machines
2004-11-19 2:34 am
Anonymous
I have one at work.
Single processor, 12G Ram, and 280G disk.
2004-11-19 3:18 am
Anonymous
Looks like IBM has done some major work on this core. Performance is significantly higher than POWER4+. Core frequencies of 1.9 Ghz……these new procs are monsters. If scaled down into a consumer level package similar to the PPC970 (…980?) these would give X86 based 64-bit processors a solid kick in the face, so to speak.
It will be interesting to see what kind of performance Sun can squeeze out of the USIV. It’s functions are heavily parallelized making traditional benchmarking schemes insufficient in regards to testing its probable real world performance.
Itanium will definetly be crammed into a niche corner.
Competition is a good thing.
Now if I can only get PPC 970 (and hoepfully 980) procs and boards without having to give money to Steve Jobs……I would be one happy FreeBSD geek. = )
2004-11-19 4:07 am
Anonymous
IBM sells mobos and whatnot with the 970s in them. They cost money, but I seem to remember that a ppc non-mac system was a good deal… Linux, NetBSD and FreeBSD would all run superb on them.
There was even talk before the G5’s from apple were released that IBM wanted to sell Linux-based PowerPC workstations and extremely low profit(at-cost) to essentially flood the market with the processors, build clientel and make a lot of linux users very, very happy.
2004-11-19 4:18 am
Anonymous
The 32-CPU Power4 system already beated the 64-CPU Itanium2 system.
So it is not a big suprise that the 64-CPU Power5 system can get 3 times the performance of the 64-CPU Itanium2.
2004-11-19 4:19 am
Anonymous
Do you have any links about those motherboards ? I’d be interested to know how much they cost…
2004-11-19 4:45 am
Anonymous
I was looking for them. A quick IBM search shows they advertise powerpc-based servers, but I couldn’t find the workstations….
*10 minutes of google searching*
Ta DA!
http://slashdot.org/articles/03/07/20/0152245.shtml
I haven’t heard anything about it since, but that is the original article I was recalling. Hope that helps you.
2004-11-19 5:39 am
Anonymous
“Now if I can only get PPC 970 (and hoepfully 980) procs and boards without having to give money to Steve Jobs……I would be one happy FreeBSD geek. = )”
What’s the problem to give money to Steve Jobs, the Powermac are maybe the best systems built around the PowerPc 970, so what? If you want a BSD running on it, well OsX is a the best choice out there, also based on FreeBSD. Do you really need to troll…..
Whatever, The Power5 appears to be a monster, and shows the exellent design that the PowerPc architecture is built on (the Power5 is built with two PowerPc cores). It will be great to see the improvement made in the Power5 cores, in a future G5 successor.
2004-11-19 6:00 am
Anonymous
This explains some of the performance gains…. a 64CPU box with 2 cores is similar to a 128 way (theoretically)
TPC-D should start noting the # of CPUs __and__ the number of cores per CPU as this may confuse some people. It confused me and I’m a computer engineer by trade Though it is probably noted in the fine print of the PDF.
It’s not really comparing apples to apples when you have a a 32 CPU system (1 core per cpu) being compared to a 32 CPU system (2 cores per cpu).
Either way, IBM’s low Price/tpmC ratio is still quite amazing.
I recommend reading this:
http://arstechnica.com/articles/paedia/cpu/POWER5.ars
2004-11-19 6:09 am
Anonymous
IBM sells mobos and whatnot with the 970s in them. They cost money, but I seem to remember that a ppc non-mac system was a good deal… Linux, NetBSD and FreeBSD would all run superb on them.
FreeBSD does not have a ppc64 port. There has been one in progress for a long time, but it has very few developers and resources…
2004-11-19 6:22 am
Anonymous
This explains some of the performance gains…. a 64CPU box with 2 cores is similar to a 128 way (theoretically)
Well that doesn’t explain it because this had 64 cores. IBM don’t currently make them any bigger (I don’t think they have plans to, either).
However it IS somewhat similar to a 128 way in that it has 128 concurrent execution contexts – because the POWER5 CPU has SMT, although I’m not sure if it was on or not for this test. It seems to be much more successful than Intel’s implementation though, so I wouldn’t be surprised if it was on.
TPC-D should start noting the # of CPUs __and__ the number of cores per CPU as this may confuse some people. It confused me and I’m a computer engineer by trade Though it is probably noted in the fine print of the PDF.
It’s not really comparing apples to apples when you have a a 32 CPU system (1 core per cpu) being compared to a 32 CPU system (2 cores per cpu).
They have everything in the full disclosure document. Note that even comparing systems with the same number of cores isn’t exactly comparing apples to apples. This thing has 6 and a half thousand disks, for example while the now 3rd placed superdome had 2100.
2004-11-19 6:26 am
Anonymous
This thing has 6 and a half thousand disks, for example while the now 3rd placed superdome had 2100.
Not to mention double the amount of memory. These two factors (memory and disk) are most likely to be behind the huge performance.
Interesting trivia, 4TB memory will set you back a cool $9 mil. Dang!
2004-11-19 8:10 am
Anonymous
They have everything in the full disclosure document. Note that even comparing systems with the same number of cores isn’t exactly comparing apples to apples. This thing has 6 and a half thousand disks, for example while the now 3rd placed superdome had 2100.
Looking at just the hardware, it would be a fair comparison. But the TPC-C specifications does take this into consideration. Specifically, it requires that the size of the database be proportional to the throughput. That is, if someone wants to run 3 million trans/sec, compared to 1 mil;
then they have to have a database 3 times bigger. This makes sense — otherwise, I could just build a database that takes up like 1GB of space and produce like like 10 million transactions/second or something…
Thus, in a way, its not unreasonable that the IBM machine slightly more than tripled the spindle count and doubled the memory considering it had a database 3 times larger.
2004-11-19 9:20 am
Anonymous
Thus, in a way, its not unreasonable that the IBM machine slightly more than tripled the spindle count and doubled the memory considering it had a database 3 times larger.
… And triple the disks, not surprisingly triples the throughput.
The only reasonable part about the test is price/performance.
2004-11-19 12:43 pm
Anonymous
Trippling the number of disks doesn’t triple the throughput. It’s not a linear relationship. There is a law of diminishing returns in effect here.
2004-11-19 2:42 pm
Anonymous
1) The i595 system tested has 32 chips & 64 cores. SMT was most likely turned on; IBM’s implementation is to run both steams at the same time since the POWER system has more execution paths then it can use (I think it’s something like 5 instruction issue per clock and 9 path, so the second steam could issue instructions to the extra 4 paths). Intel’s implementation is to run one stream until a banch statement and then switch to the second stream so make sure a banch missprediction doesn’t take place.
2) Disk throughput would be high on the box. IBM likes to use 30GB hard drives in RAID5 stacks of 14 drives. Each stack has it’s own controller with about 1250MB disk cache (I don’t know why they have such an odd size cache). The system uses 128 bit PCI-X cards. Also they had a large number of disk towers; each tower has it’s own PCI-X back plane with the base tower having a controler per tower. IBM also like to place processors on every controller to reduce the overhead on the main CPU’s.
2004-11-19 5:17 pm
Anonymous
This should be no surprise. The Intel chipset and architecture are over 30 years old. And they were substandard then. If it wasn’t for the price of the Intel chipset, it would have been a footnote in history.
IRQs? Hyper-limited registers? A handful of DMA channels? Paged memory? ACPI? No card-supplied driver support? Ugh. 1978 never left.
2004-11-19 8:45 pm
Anonymous
“Paged memory?”
Hum… hello, I hate to break the news to you, but unless you know of another approach, how would you implement virtual memory? Maybe you meant segemented memory? Which has not been an issue since the 386 was introduced and as long as you run code in that mode.
The intel architecture is not perfect, however it is remarkable the juice Intel designers have been able to extract out of it.
2004-11-19 9:19 pm
Anonymous
it’s a p595 (can run AIX and Linux, some i5/OS), not an i595 (which can run i5/OS (aka OS/400), AIX and Linux).
It’s got 8 MCMs (multi-chip-modules) which is a 5″x5″ package consisting of 4 dual core chips and the L3 caches:
http://www.theinquirer.net/?article=12145
I’m sure they would have SMT turned on (so now 8x4x2x2 = 128 logical cpus, 64 physical). Yes, POWER5 SMT is substanially better than Intel’s implementation – very rarely a performance penalty (at worst it is allegedly 5%), normally 30% speedup and maybe as high as 50%.
All the components are _standard_ pSeries parts – it just has _alot_ of them 🙂
List price was $29m, discounted to $16.7m. 2TB of memory was $9m (list) IIRC.
The backend disk was mostly RAID-0, with RAID-5 for the DB logs (not sure why since RAID-0 would have been at least as fast and cheaper). The backend disk was DS4500 (aka FAStT900) dual RAID controllers (45 I think) with about 11 drawers of disk attached to each dual-controller (14 disks in a drawer). 6400 15Krpm 36.4GB disks + 100 x (10Krpm 145GB?).
They were attached with ninety 64-bit PCI-X adapters. The main CPU unit is attached to PCI-X adapter drawers via something called RIO (remote I/O) cables which are very quick (something like 4GB/s (byte not bit)).
TPC-C is not real life – for starters you would almost certainly mirror your data. However it does show how well the hardware can scale should it need to.
2004-11-20 1:39 am
Anonymous
IRQs?
Practically every computer architecture on the planet has hardware interrupts (and software exceptions).
Hyper-limited registers?
Doesn’t tend to matter very much these days because they have register renaming and their “hyper limited” registers are actually backed with a much larger pool of registers available to the rename engine.
Issues can crop up with HPC codes that are really optimised to the bone…
A handful of DMA channels?
Ugh, those are the old ISA host initiated DMA things. They don’t get used in modern x86 systems – they just use PCI DMA with bus master initiated DMA.
Paged memory?
Yeah it may be from the ’60s (or late 50s?) but it is still state-of-the-art.
ACPI?
You think ACPI was from 1978? No it is actually much more recent. It had a shakey start, but these days it is a good system. Those that don’t know the insides of ACPI aren’t really in a position to put it down (and that includes me).
Put it this way, it was developed by people who have better understanding of the problems and requirements than you or I.
No card-supplied driver support?
No thanks. BIOS writers have enough trouble with the basics. I’ll take open specifications and operating system provided drivers which are cross platform and actually work.
2004-11-20 1:49 am
Anonymous
Trippling the number of disks doesn’t triple the throughput. It’s not a linear relationship. There is a law of diminishing returns in effect here.
Tripling the number of disks triples the number of IOPS that can be performed. The database will be very well spread over all disks, so diminishing returns doesn’t apply so much provided you have the CPU power and IO bandwidth to keep the disks busy.
2004-11-20 8:17 am
Anonymous
IRQs?
Practically every computer architecture on the planet has hardware interrupts (and software exceptions).
Looks like the poster meant vectorized interrrupts.
The APIC which the x86 line works with is indeed limited, there are very few levels 16 minus one for cascade, one for timer etc….leaves not much.
Hyper-limited registers?
Doesn’t tend to matter very much these days because they have register renaming and their “hyper limited” registers are actually backed with a much larger pool of registers available to the rename engine.
Issues can crop up with HPC codes that are really optimised to the bone…
The limitation is the non-orthogonality of instruction set vs registers.
Paged memory?
Yeah it may be from the ’60s (or late 50s?) but it is still state-of-the-art.
Yes it’s an improvement over the 86/286 segmented memory that the 8086 started with.
At the time the IBM PC specs came out, many a developer were dismayed as they knew the 68000 architecture was much more powerful and easier to program. With its 32 bit registers, it was also future proof, in a clean way.
2004-11-20 8:28 pm
Anonymous
“Put it this way, it was developed by people who have better understanding of the problems and requirements than you or I.”
Probably true. However, the specification those smart people designed appears to have subsequently been implemented, in most cases, by non-housetrained monkeys.
2004-11-21 2:02 am
Anonymous
Looks like the poster meant vectorized interrrupts.
The APIC which the x86 line works with is indeed limited, there are very few levels 16 minus one for cascade, one for timer etc….leaves not much.
Doesn’t matter for the vast majority of desktop machines and small servers. Sharing IRQs works fine.
However, x86 systems with IO-APICS (ie. SMPs, some UPs), you can have a huge number of interrupts available.
Yes it’s an improvement over the 86/286 segmented memory that the 8086 started with.
It is an improvement because it is the state of the art (still, today). PPC64 uses it, IA64 uses it, Alpha, SPARC, ARM, MIPS, x86, x86-86, PARISC, etc etc all use paged memory.
They all use practically the same implementation too (pagetables), except for PPC64 which uses a hashtable (this is not a superior solution though).
At the time the IBM PC specs came out, many a developer were dismayed as they knew the 68000 architecture was much more powerful and easier to program. With its 32 bit registers, it was also future proof, in a clean way.
Ironic that m68k is dead and gone while (some extention of) the 80086 is still kicking along as some of the most powerful general purpose chips available today.
2004-11-21 9:19 am
Anonymous
>Looks like the poster meant vectorized interrrupts.
>The APIC which the x86 line works with is indeed limited, >there are very few levels 16 minus one for cascade, one
>for timer etc….leaves not much.
Note, Nforce2’s(e.g. ASUS A7N8X Deluxe) IRQs are numbered up to 22.
2004-11-21 9:26 am
Anonymous
>The limitation is the non-orthogonality of instruction
>set vs registers.
Expanded to 16 registers in X64. Note that AMD 29000 has 129 registers.
2004-11-21 9:27 am
Anonymous
>Whatever, The Power5 appears to be a monster, (SNIP)
It’s 276 million transistor solution.
2004-11-21 9:59 am
Anonymous
>This should be no surprise. The Intel chipset and >architecture are over 30 years old.(SNIP)
Note, AMD K7 uses a licensed Alpha EV6 bus.
2004-11-21 10:03 am
Anonymous
>If scaled down into a consumer level package similar to >the PPC970 (…980?) these would give X86 based 64-bit >processors a solid kick in the face, so to speak.
Power5 is ~276 million transistor chip, while X64s wouldn’t be at ~210 million transistor (approximate transistor for Dual Opteron) level after dual core release.