Next Itanium Consumes Less Power

Thom Holwerda 2006-01-27 Intel 17 Comments

Intel’s forthcoming ‘Montecito’ member of the Itanium processor family will consume 100 watts, a significant drop from the 130 watts of current models and an advantage in an era when power consumption is a top enemy. Intel spokesman Scott McLaughlin confirmed the figure at an Itanium Solutions Alliance meeting here. The change means Itanium will have about 2.5 times the performance per watt of the current Itanium 2 9M model.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

17 Comments

2006-01-27 9:31 pm

Smartpatrol
Can’t wait for the first Itanium notebooks to arrive!<joke>
2006-01-27 9:52 pm

rayiner
Itanium is shaping up to be an umitigated disaster of a CPU. Wasn’t EPIC supposed to make for smaller, simpler CPUs that could scale to higher clockrates? Yet, a 1.6 GHz Itanium2 needs nearly 600m transistors and 130W to deliver SPECint performance at the level of a 2.2 GHZ Opteron (which uses ~115m transistors and consumes ~70W?) Yes, I understand Itanium has all sorts of “big iron” features, but they could’ve easily been added to an Opteron-like chip without quadrupling the transistor count! Even the Itanium2’s famous FPU performance is only about 10% faster than an Opteron 254’s.

Montecito doesn’t seem to make things any better. What is up with only 1.6 GHz at 90nm? One of the theoretical “advantages” of VLIW is that it can forgo the complex structures used for out-of-order execution. In modern RISCs, these structures can often be the critical bottleneck preventing the rest of the CPU from running at higher clockspeeds. Obviously, the Itanium2 is being bottlenecked somewhere (perhaps at the massive register file with the enormous number of ports necessary to feed 6 integer units?)

2006-01-28 6:15 am

Get a Life
A 1.66GHz Madison 9M has more than 500 million transistors. Most of that is from its caches, and doesn’t represent any additional complexity to the core. The purpose of EPIC is to maximize parallelism and reduce the complexity of the core logic. If Intel wanted to aim for higher clockspeeds or just reduce power consumption they certainly aren’t making it very obvious with the processes they’ve used to fab the Itanium and Itanium 2.

It has a base SPECfp2000 score about 39% higher than the Opteron 254. It has a base SPECfp_rate score about 20% higher than an Opteron 254. Its integer scores are somewhere between an Opteron 148 and an Opteron 252.

Relying on SPEC scores for making performance predictions would be a serious mistake. The Itanium 2 is very sensitive to compiler performance. You could be buying a pretty expensive disappointment without doing sufficient research.

2006-01-28 8:05 am

rayiner
Most of that is from its caches, and doesn’t represent any additional complexity to the core.

Its been shown quite explicitly (hah!) that Itanium needs the ridiculously sized caches to perform adequately. It doesn’t matter whether the extra transistors ar ein the caches or in the core — what matters is the final die size.

The purpose of EPIC is to maximize parallelism and reduce the complexity of the core logic.

The general purpose of VLIW is to improve performance while enabling simpler and smaller processors. However, it appears that the massive cache demands of the EPIC architecture outweigh the moderately reduced complexity of the core CPU.

If Intel wanted to aim for higher clockspeeds or just reduce power consumption they certainly aren’t making it very obvious with the processes they’ve used to fab the Itanium and Itanium 2.

Intel did want higher clockspeeds for Montecito — it was supposed to run up to 2.0 GHz. They couldn’t achieve it, even at 90nm. In theory, a VLIW should be able to scale to higher clockspeeds than an OOO RISC. The complex structures that enable OOO can often become critical timing paths in the logic that bottleneck clockspeeds. The Itanium design isn’t showing this theoretical advantage in practice.

It has a base SPECfp2000 score about 39% higher than the Opteron 254. It has a base SPECfp_rate score about 20% higher than an Opteron 254.

The highest SPECfp_base score for an Opteron 254 is 2256 for a Sun Fire machine. Te highest SPECfp_base score for a 1.66 GHz Itanium2 is 2851. That’s 26% faster, not 39%. Similarly, using the Sun Fire machine as the Opteron reference, I get a rates difference of

16%, not 20%. That’s not a whole lot for a CPU that costs several times more, and has a poorer system interconnecet to boot.

2006-01-28 8:42 am

Get a Life
Its been shown quite explicitly (hah!) that Itanium needs the ridiculously sized caches to perform adequately. It doesn’t matter whether the extra transistors ar ein the caches or in the core — what matters is the final die size.

You attacked the Itanium 2 on the grounds of its complexity running contrary to the design goals of EPIC. The size of a processor does not represent its complexity. The majority of the Itanium 2 that you alluded to is in its caches, and not the actual core. Its core is quite simple.

Intel did want higher clockspeeds for Montecito — it was supposed to run up to 2.0 GHz.

A 2GHz part does not suggest a goal of theoretically higher clockspeeds. A 2GHz part would be “welcome to the club” for the majority of its competitors. Intel obviously desires to increase the clock speed of their processor, but nothing in their actions suggests that obtaining high clock speeds is their goal. They’ve consistently used older processes for fabrication, and completely ignored the clockrate and power consumption of the project for more than a half decade.

This is the entry I used:

Hewlett-Packard Company ProLiant BL25p (AMD Opteron (TM) 254) 1 core, 1 chip, 1 core/chip 2051 2258

http://www.spec.org/cpu2000/results/res2005q3/cpu2000-20050902-0465…

This the Madison 9M I used:

HITACHI HITACHI BladeSymphony (1.66GHz/9MB Itanium 2) 1 core, 1 chip, 1 core/chip 2851 —

http://www.spec.org/cpu2000/results/res2005q4/cpu2000-20051212-0516…

Here’s your selection:

Sun Microsystems Sun Fire X4200 2 cores, 2 chips, 1 core/chip 2256 2518

http://www.spec.org/cpu2000/results/res2005q4/cpu2000-20050906-0467…

This is the entry you want:

Sun Microsystems Sun Fire X4200 1 core, 1 chip, 1 core/chip 2132 2344

That’s 33.7% not 26%.

People interested in buying the Madison 9M will most-likely not be doing so based upon SPEC performance, but rather performance in specific computational tasks and doing so with large clusters.

2006-01-28 5:50 pm

rayiner
You attacked the Itanium 2 on the grounds of its complexity running contrary to the design goals of EPIC. The size of a processor does not represent its complexity. The majority of the Itanium 2 that you alluded to is in its caches, and not the actual core. Its core is quite simple.

No, I attacked the I2 on the grounds of its enormity running contrary to the design goals of VLIW. The whole point of VLIW is to do more with fewer transistors. The great weakness of Itanium is that while it saves a few transistors (say few because, discounting SMT, the POWER5 core isn’t that much larger than the I2 core), the savings are erased by the huge instruction and data caches needed to keep the core fed and to hide access latency without the benefit of out-of-order execution.

A 2GHz part does not suggest a goal of theoretically higher clockspeeds. A 2GHz part would be “welcome to the club” for the majority of its competitors. Intel obviously desires to increase the clock speed of their processor, but nothing in their actions suggests that obtaining high clock speeds is their goal.

Itanium2 was supposed to ship at 1.8 GHz to 2.0 GHz (with Foxton). It didn’t. Something is bottlenecking the clockspeed of the design, and with a supposed 100W TDP, its probably not heat. So the CPU is likely timing limited in some critical path, something that VLIWs are supposed to have an easier time of than OOO RISCs.

This is the entry you want:

Sun Microsystems Sun Fire X4200 1 core, 1 chip, 1 core/chip 2132 2344[/i]

Why is that the system I want? SPEC is a single-threaded benchmark — it doesn’t matter if the other Sun Fire has two CPUs. I want the significantly higher base score of the second system.
2006-01-28 7:33 pm

Get a Life
No, I attacked the I2 on the grounds of its enormity running contrary to the design goals of VLIW.

Wasn’t EPIC supposed to make for smaller, simpler CPUs that could scale to higher clockrates?

The Itanium is a simple processor. It simply has an enormous quantity of cache. Yes, its cache is necessary for it to obtain its performance levels. So are the caches on Intel’s mobile line and forthcoming processors.

Itanium2 was supposed to ship at 1.8 GHz to 2.0 GHz (with Foxton). It didn’t.

Which still doesn’t suggest that Intel’s design has been intended to scale to high clock speeds. It’s never been released with higher clock speeds. Intel could produce an EPIC processor with small real estate that could scale to high clockspeeds, and still perform like ass.

Why is that the system I want? SPEC is a single-threaded benchmark — it doesn’t matter if the other Sun Fire has two CPUs. I want the significantly higher base score of the second system.

To compare ‘kind to kind’ to remove any variance for comparison other than compiler and processor. As an aside, while SPEC is a single threaded benchmark (or more specifically its subtasks are each unthreaded programs), however it should be noted that the rate scores reflect multiple processes.
2006-01-28 8:46 pm

rayiner
The Itanium is a simple processor. It simply has an enormous quantity of cache.

I said “smaller, simpler”. Itanium 2 is neither small, nor particularly simple. At 25m transistors, its core is not a whole lot smaller than that of the aggressively out of order POWER4. And its certainly not small, not even relative to CPUs like the POWER5.

Yes, its cache is necessary for it to obtain its performance levels. So are the caches on Intel’s mobile line and forthcoming processors.

Yonah is still a 90mm^2 processor, despite having two cores and being a relatively aggressive OOO design.

Which still doesn’t suggest that Intel’s design has been intended to scale to high clock speeds.

No, it suggests that Intel’s design has intended to scale to higher clockspeeds than it has achieved. Which is my point — it should be easier to get a VLIW to a high clockspeed than the equivalent RISC.

To compare ‘kind to kind’ to remove any variance for comparison other than compiler and processor.

The Sun Fire and the Itanium system are completely different. Replacing one completely different system with another doesn’t make any difference. The Opteron certainly doesn’t benefit in any way from being in a 2-CPU system, and we were comparing SPECfp, not SPECfp_rate.

2006-01-27 11:52 pm

stephanem
Jeeez Intel if 2 billion is burning a hole in your wallet, just make a cashier’s check out to me – 1Million will go to the good people OSNEWS.COM

Seriously, focus on VIIV, cut your losses short and dump Itanium – if Windows isn’t going to support it, you’re dead!. UNIX cannot help you recover 2Billion because you’ll have to go up against Power and Sparc and there’s a ton of software packages for those two so if you want ISVs to port to Itanium, start cutting checks and send them out pasted on top of the free developer Itaniums boxes.

Edited 2006-01-27 23:54
2006-01-28 3:22 am

poundsmack
am i the only one in the world who still has faith in Itanium? or at least the desire to see it succeed?

2006-01-28 6:18 am

Get a Life
There’s so much cache on a Montecito that it might as well be considered a memory module with a built in processor. I think Intel and HP can use all the faith they can get.

2006-01-28 1:38 pm

jwwf
Its been shown quite explicitly (hah!) that Itanium needs the ridiculously sized caches to perform adequately. It doesn’t matter whether the extra transistors are in the caches or in the core — what matters is the final die size.

Wrong and wrong. If you check out the specint, fp, and 2-way rate for all Madison variants, you will find that the improvement of Madison 9M (400 MHz FSB) over Madison 3M (533 MHz FSB) is less than 10%, and that the much more recent Madison 9M (667 MHz FSB) is in this range as well.

I do not think that 3 MB is in any way ridiculous in a modern process.

The large cache variants exist for the same reason that Potomac Xeon with 8 MB L3 exists — for multiprocessor machines with a larger than ideal number of CPU’s on a single FSB segment. This is a platform issue and has nothing to do with the CPU design.

It seems that you concede this in your last lines, saying that Madison ‘only’ does 20-30 percent better than Opteron, and that’s ‘not a lot’ for a CPU with worse interconnect technology — in fact, it seems like this speaks well of the core Itanium2 technology, that it does that much better with a definitely worse memory subsystem.

Furthermore, your claim that only die size matters, and it does not matter whether the transistors are in the core or cache, is patently false. Both because of the regular structure of caches, and because of (in some cases lots of) redundant rows being built in, the % of rejects due to cache problems is way lower than that from core defects–that is, modern (read, all intel) cache designs just don’t contribute much to lowering yields. A large die with good yields can easily be at price parity with a smaller one with worse yields.

2006-01-28 6:19 pm

rayiner
Wrong and wrong. If you check out the specint, fp, and 2-way rate for all Madison variants, you will find that the improvement of Madison 9M (400 MHz FSB) over Madison 3M (533 MHz FSB) is less than 10%, and that the much more recent Madison 9M (667 MHz FSB) is in this range as well.

3M is a sodding lot of cache, for a single core. Hell, even POWER5+, which gets very good SPECfp figures itself, only has 1.92MB of cache on die! Look at the SPECfp base of the 1.5MB Altix 350 on the database. At 1.4 GHz, it gets a SPECfp_base of 1668. With 3MB of cache, the same system gets 1931, about 16% higher. The 6MB systems are almost 40% higher (normalized to 1.4 GHz), although some of that is probably due to the system itself. That performance puts it on a par with a 2.4 GHz Opteron, a CPU that has only 1MB of cache. And the I2’s integer performance with 1.5MB is depressing — on the order of an old Athlon XP 2.2 GHz.

I do not think that 3 MB is in any way ridiculous in a modern process.

3MB is a lot of cache (even on a modern process) just for an extra 26% performance.

It seems that you concede this in your last lines, saying that Madison ‘only’ does 20-30 percent better than Opteron, and that’s ‘not a lot’ for a CPU with worse interconnect technology — in fact, it seems like this speaks well of the core Itanium2 technology, that it does that much better with a definitely worse memory subsystem.

It speaks poorly of the I2 core technology, when a (relatively) tiny mass-market CPU can come within spitting distance of it, despite its enormous advantage in cache size. More generally, it speaks poorly of EPIC, that Intel’s best shot at EPIC, a chip that has the luxury of a 400mm^2+ die, a chip that’s had billions of dollars of investment over the better part of a decade, etc, can barely beat a CPU that iss used almost unchanged in $1000 “gamer rigs”, and is smaller and uses less power to boot. And even then, it can only beat it in floating-point performance, not integer performance. If EPIC was a good design, it should be embarrassing the Opteron given the amount of resources that have gone into its development. 26% is not embarassing, not for AMD, anyway.

A large die with good yields can easily be at price parity with a smaller one with worse yields.

That’s a good point, but what I was getting at is that a 400mm^2 die is obviously more expensive to produce than a 120mm^2, manufacturing advantages of large cache structures aside. Of course, let’s compare the I2 to a processor in its own class — the POWER5+. The 90nm POWER5+ keeps up quite well in SPEC (beats the I2 in FP), and is a drastically smaller chip than Montecito is projected to be, even considering that it has an integerated memory controller and SMT, and Motecito does not.

2006-01-28 3:49 pm

1c3d0g
Get a Life and jwwf are right. In the end, it does not matter what makes the CPU tick, only the task it has to accomplish – as quickly as possible.

Itaniums are a Nuclear Weapons Research Expert’s dream CPU. In this particular scenario, there’s no need for 10,000+ relatively weak CPU’s with 40,000+ relatively small memory modules. 😐 No, in this specific instance, a few hundreds of massive (in FPU performance) CPU’s with equally massive (in capacity) memory modules are better suited to the simulation at hand. This is just one task…there are other jobs the Itanium is also capable of executing, better than any other processor.

So all the nay-sayers need to take a step back and realize the unique duty that the Itanium processor has to fulfill. I’m not saying the Itanium is perfect or that its price isn’t outrageous, but it certainly has its role on this planet. I hope that this very specialized CPU can be better understood through my (simplified) example. 🙂

2006-01-28 6:25 pm

rayiner
So all the nay-sayers need to take a step back and realize the unique duty that the Itanium processor has to fulfill. I’m not saying the Itanium is perfect or that its price isn’t outrageous, but it certainly has its role on this planet. I hope that this very specialized CPU can be better understood through my (simplified) example. 🙂

No doubt the I2 is a good CPU for the niche it seems to have found. If you need the extra performance per core it can offer in highly-optimized FP applications, the I2 is your thing. However, the Itanium was not meant to be a CPU for doing nuclear weapons research. It was supposed to be a general purpose CPU for running general purpose programs, a competitor to Alpha, PowerPC, SPARC, etc. At one point, it was even concieved to be an x86 replacement, though such hopes were quickly dashed. The fact that Itanium has had to retreat to a highly specific niche* speaks to the failure of EPIC in fulfilling its original goals. If Itanium can never crawl out of its niche, it will be a failure, not just in regards to its original goals, but monetarily. A tremendous amount of money has gone into Itanium, and its doubtful whether just the hardcore scientific computing niche can allow Intel to recoup that investment.

* ie: not just scientific apps, but scientific apps in which its performance per core advantage over the Opteron leads to enough of a reduction in node count to significantly improve performance through the reduction in communications.

2006-01-28 7:43 pm

Get a Life
IA64 was touted as a replacement for IA32. Intel planned on developing IA64 cores for HPC, workstations, and eventually even to replace desktop processors. This was predicated on their ability to write a compiler that could optimize general-purpose code and market adoption of the platform. Neither materialized, and Intel didn’t put forth any sort of concerted effort to this end.

2006-01-29 12:33 am

jwwf
3M is a sodding lot of cache, for a single core. Hell, even POWER5+, which gets very good SPECfp figures itself, only has 1.92MB of cache on die!

….

The 90nm POWER5+ keeps up quite well in SPEC (beats the I2 in FP), and is a drastically smaller chip than Montecito is projected to be…

Yes, but this 1.92 MB on-chip cache is backed by 36 MB off-chip L3. Running at a significant fraction of the core frequency through a wide interconnect, the costs of this are quite large versus a somewhat smaller, but equally effective, integrated cache. I can only assume that if IBM could fab something like Montecito, they would.

As a sidenote, I have also read that the closest IBM has come to disclosing the power dissipated in an IBM 4 die + 4 cache POWER* MCM is “in the neighborhood of a kilowatt”. POWER partisans maybe should lose the Itanium power-consumption talking points and realize that with a few exceptions, the cutting edge in performance has always been bad on the electrical bill. Alpha and older PA-RISC were horrible. My HP B2000 workstation claims to have a max power input of 620 W, and has a single PA-8500 inside. No idea on the average power, would like to measure.

Anyway, at 90 nm, with Intel’s cutting edge SRAM design, I believe 1 MB L3 is worth around 15 mm^3. I think I must just respectfully disagree here. This is a bargain, and if you can fab it, I dont see why not to throw a few megs on die. AMD, being apparently still capactiy constrained, can’t.

It speaks poorly of the I2 core technology, when a (relatively) tiny mass-market CPU can come within spitting distance of it

Must disagree again. I think 26% (or whatever) is respectable, especially across process generations, but the real question is the different set of goals of the two products. That is, what is the peformance of an Itanium 8 or 16 way versus an Opteron of that class– Answer is NaN, since none exist from high-tier vendors, and I’m not buying any $100K+ whitebox (although I’m not claiming some are not crazy enough to do so).

Also, if the 400 mm^3 die worries you, remember that this is an Intel decision, that they almost always fab the biggest chip and just disable cache. A smaller I2 is possible, and may even exist in I2 LV, but I haven’t seen anything on this. And actually, I believe I2 9M is 480 mm^3, yes, the ol’ reticle size But for a server CPU, I see no problem with this, indeed, if I’m paying $4K a CPU, I want no skimping!