Linked by Thom Holwerda on Sat 26th May 2007 22:16 UTC
Intel After years of delivering faster and faster chips that can easily boost the performance of most desktop software, Intel says the free ride is over. Already, chipmakers like Intel and AMD are delivering processors that have multiple brains, or cores, rather than single brains that run ever faster. The challenge is that most of today's software isn't built to handle that kind of advance. "The software has to also start following Moore's law," Intel fellow Shekhar Borkar said, referring to the notion that chips offer roughly double the performance every 18 months to two years. "Software has to double the amount of parallelism that it can support every two years."
Thread beginning with comment 243354
To view parent comment, click here.
To read all comments associated with this story, please click here.
rayiner
Member since:
2005-07-06

Neither PPC nor Alpha are any more conducive to thread-level parallelism than x86. Indeed, in many respects (sane memory ordering model), it's much more conducive to shared-memory parallel software.

Reply Parent Bookmark Score: 4

butters Member since:
2005-07-08

Old McDonald had multi-threaded PPC code, eieio.

It stands for Enforce In-order Execution of I/O.

Reply Parent Bookmark Score: 5

rayiner Member since:
2005-07-06

Maintaining the view that loads/stores execute in-order makes writing multithreaded code easier, not harder. You don't have to clutter your code with fence instructions in x86 as you do on some other architectures.

Reply Parent Bookmark Score: 4

butters Member since:
2005-07-08

Oh, I agree that x86 is right and PPC is wrong in this respect. While Core 2 does speculative reordering of loads ahead of stores, it maintains the semantic intent of the instruction stream it receives, paying a penalty of a pipeline flush if it mis-guesses the uniqueness of memory addresses. The programmer shouldn't have to police the processor's reorder unit wherever the instruction order is critical to the correct operation of the program. The processor should always assume that the instructions are ordered in a given way for a good reason. If it can prove that an alternative ordering is semantically identical, then optimize away. But if it can't, then it must assume that order matters.

One of the interesting early benchmark results for the POWER6 core is that it has taken the crown from Core 2 in terms of synthetic, single-threaded integer performance. It's interesting because the Core 2 is the a wide, aggressively OOO core topping out at 2.93 GHz, whereas POWER6 is a relatively narrow, mostly in-order core at a blistering 4.7 GHz. This says a lot about the trade-off between issue-width, instruction reordering, and clock frequency. One would assume that the POWER6 team hit their transistor budget and frequency target by thinning and simplifying the pipelines. Not only did this help them reach their performance/watt target for multi-threaded server workloads, but they also won the one-to-one pipeline comparison.

Armed with new evidence that clock frequency is inescapably central to single-threaded performance, above all other forms of architectural enhancements, and combined with the brutal relationship between frequency, voltage, and power, we find ourselves in a bit of a pickle. We know how to increase performance/watt on the server, where multi-threading is a natural extension of the workload. But how do we improve performance/watt in client workloads characterized by I/O-bound, latency-sensitive applications that are difficult to parallelize?

Actually, I think that the POWER6 team might have stumbled onto part of the solution. Put the execution units on a strict diet. Keep the load/store reordering, issue everything else in-order. Offset frequency increases by cutting transistors and aggressively pursuing process shrinks to allow lower voltages. Power is linear with frequency and transistor count, while it's quadratic with voltage. Invest the extra die space in caches, which help immensely in reducing noticeable latencies on client systems, while being fairly easy to optimize for power efficiency. Processors spend huge amounts of time waiting on loads. Transistors are much better spent implementing caches than implementing wide and aggressive pipelines.

I think this Intel fellow is way off the mark, quite frankly. Each x86 vendor has started with an architecture from opposing ends of the spectrum and worked toward the center. Intel started with a mobile design and adapted it to multi-core, wide-issue design. AMD started with a server architecture and adapted it to power efficiency. In the center lies this low-volume, high-margin, prestige market segment populated by gamers, enthusiasts, and multimedia professionals. This is the least power-conscious segment of the entire market. While Intel remains better positioned to provide strong performance/watt in the high-volume mobile sector, they have clearly made compromises in their design in order to win performance benchmarks and the associated enthusiast mindshare.

So look no further than your own choices, Intel. Software has its own problems to deal with without having to worry about thread-parallelism that may not be realistic in many cases. Software is unreliable, insecure, expensive to maintain, and threatened by a broken intellectual property system in some nations. There's way more complexity to software development than winning benchmark contests, and in this day and age, that holds true for CPU vendors as well.

Reply Parent Bookmark Score: 5

rayiner Member since:
2005-07-06

Oh, I agree that x86 is right and PPC is wrong in this respect.

I see that. I didn't get the eieio joke...

One of the interesting early benchmark results for the POWER6 core is that it has taken the crown from Core 2 in terms of synthetic, single-threaded integer performance.

I'm not convinced that this is really the whole of it. If you look at the SPECint_base for both chips, they're roughly the same (Power6 - 17.8, Core 2 17.5). Apparently the Power6 gets a HUGE boost from profile-guided optimization. I can see why profile-guided optimization would provide such a boost in performance for an in-order chip, but I'm not convinced you're going to see that in real-world code. SPEC is something of an optimal case for PGO, since you basically run the same data set that the profile info was collected for.

That said, I definitely think you're right that the Power6 folks are on to something. Achieving even performance-parity on integer code with the Core 2 is no mean feat. Whatever they're doing in the memory pipeline is clearly a promising technique, and likely to get better as it becomes more mature. There are still a lot of unknowns with regard to Power6 (how do you schedule for it? can the chip cover cache latency or do you have to take that into account? how sensitive is it to scheduling?) but it's clearly something to keep a watch on.

Reply Parent Bookmark Score: 2

Innominandum Member since:
2005-11-18

I was going after their push to multi-core when there's plenty of gains to be had one single-core systems. There's an undertone that all they have left is adding cores.

Reply Parent Bookmark Score: 2

kaiwai Member since:
2005-07-06

I was going after their push to multi-core when there's plenty of gains to be had one single-core systems. There's an undertone that all they have left is adding cores.


There is a limitation to how much one can squeeze out of a single processor before the actualy returns start to deminish to almost nothing. Sure, Intel could push the clock up wards again, but the issue of diminishing returns would come back to haunt them.

What there needs to be is a great emphasis at university on teaching youngsters how to code properly the first time; spending time teaching people how to actually design their application before they write it rather than simply throwing code at a problem and hope that it actually works correctly.

Anything is possible once good discipline enters the equation; the problem is, far too many programmers don't have it. Code it now and let some sucker sort out the issues in 20 years time.

Reply Parent Bookmark Score: 3

Doc Pain Member since:
2006-10-08

"Neither PPC nor Alpha are any more conducive to thread-level parallelism than x86."

There were other great architectures which had parallelism in hardware, i. e. real CPUs, not just CPU cores. These systems had a better "throughput", but this was many years ago.

- Intel Pentium 4 @ 1700 MHz, 575 int, 593 fp, 0.687 per MHz
- AMD Athlon @ 1333 MHz, 482 int, 414 fp, 0.672 per MHz
- DEC Alpha 21264A @ 833 MHz, 518 int, 590 fp, 1.330 per MHz
- HP PA 8700 @ 750 MHz, 569 int, 526 fp, 1.460 per MHz
- MIPS R14000 @ 500 MHz, 410 int, 436 fp, 1.692 per MHz

(SPEC 2000 INT / FP BASE values)

Of course, this professional equipment was not designed to take any market share in the home computing and entertainment market.

Reply Parent Bookmark Score: 2

rayiner Member since:
2005-07-06

None of the architectures you mentioned have anything special with regards to thread-level parallelism. Some may or may not be more conducive to instruction-level parallelism than x86, but that's a different bag of cats.

Reply Parent Bookmark Score: 1