To view parent comment, click here.
To read all comments associated with this story, please click here.
Oh, I agree that x86 is right and PPC is wrong in this respect. While Core 2 does speculative reordering of loads ahead of stores, it maintains the semantic intent of the instruction stream it receives, paying a penalty of a pipeline flush if it mis-guesses the uniqueness of memory addresses. The programmer shouldn't have to police the processor's reorder unit wherever the instruction order is critical to the correct operation of the program. The processor should always assume that the instructions are ordered in a given way for a good reason. If it can prove that an alternative ordering is semantically identical, then optimize away. But if it can't, then it must assume that order matters.
One of the interesting early benchmark results for the POWER6 core is that it has taken the crown from Core 2 in terms of synthetic, single-threaded integer performance. It's interesting because the Core 2 is the a wide, aggressively OOO core topping out at 2.93 GHz, whereas POWER6 is a relatively narrow, mostly in-order core at a blistering 4.7 GHz. This says a lot about the trade-off between issue-width, instruction reordering, and clock frequency. One would assume that the POWER6 team hit their transistor budget and frequency target by thinning and simplifying the pipelines. Not only did this help them reach their performance/watt target for multi-threaded server workloads, but they also won the one-to-one pipeline comparison.
Armed with new evidence that clock frequency is inescapably central to single-threaded performance, above all other forms of architectural enhancements, and combined with the brutal relationship between frequency, voltage, and power, we find ourselves in a bit of a pickle. We know how to increase performance/watt on the server, where multi-threading is a natural extension of the workload. But how do we improve performance/watt in client workloads characterized by I/O-bound, latency-sensitive applications that are difficult to parallelize?
Actually, I think that the POWER6 team might have stumbled onto part of the solution. Put the execution units on a strict diet. Keep the load/store reordering, issue everything else in-order. Offset frequency increases by cutting transistors and aggressively pursuing process shrinks to allow lower voltages. Power is linear with frequency and transistor count, while it's quadratic with voltage. Invest the extra die space in caches, which help immensely in reducing noticeable latencies on client systems, while being fairly easy to optimize for power efficiency. Processors spend huge amounts of time waiting on loads. Transistors are much better spent implementing caches than implementing wide and aggressive pipelines.
I think this Intel fellow is way off the mark, quite frankly. Each x86 vendor has started with an architecture from opposing ends of the spectrum and worked toward the center. Intel started with a mobile design and adapted it to multi-core, wide-issue design. AMD started with a server architecture and adapted it to power efficiency. In the center lies this low-volume, high-margin, prestige market segment populated by gamers, enthusiasts, and multimedia professionals. This is the least power-conscious segment of the entire market. While Intel remains better positioned to provide strong performance/watt in the high-volume mobile sector, they have clearly made compromises in their design in order to win performance benchmarks and the associated enthusiast mindshare.
So look no further than your own choices, Intel. Software has its own problems to deal with without having to worry about thread-parallelism that may not be realistic in many cases. Software is unreliable, insecure, expensive to maintain, and threatened by a broken intellectual property system in some nations. There's way more complexity to software development than winning benchmark contests, and in this day and age, that holds true for CPU vendors as well.
Oh, I agree that x86 is right and PPC is wrong in this respect.
I see that. I didn't get the eieio joke...
One of the interesting early benchmark results for the POWER6 core is that it has taken the crown from Core 2 in terms of synthetic, single-threaded integer performance.
I'm not convinced that this is really the whole of it. If you look at the SPECint_base for both chips, they're roughly the same (Power6 - 17.8, Core 2 17.5). Apparently the Power6 gets a HUGE boost from profile-guided optimization. I can see why profile-guided optimization would provide such a boost in performance for an in-order chip, but I'm not convinced you're going to see that in real-world code. SPEC is something of an optimal case for PGO, since you basically run the same data set that the profile info was collected for.
That said, I definitely think you're right that the Power6 folks are on to something. Achieving even performance-parity on integer code with the Core 2 is no mean feat. Whatever they're doing in the memory pipeline is clearly a promising technique, and likely to get better as it becomes more mature. There are still a lot of unknowns with regard to Power6 (how do you schedule for it? can the chip cover cache latency or do you have to take that into account? how sensitive is it to scheduling?) but it's clearly something to keep a watch on.
There is a limitation to how much one can squeeze out of a single processor before the actualy returns start to deminish to almost nothing. Sure, Intel could push the clock up wards again, but the issue of diminishing returns would come back to haunt them.
What there needs to be is a great emphasis at university on teaching youngsters how to code properly the first time; spending time teaching people how to actually design their application before they write it rather than simply throwing code at a problem and hope that it actually works correctly.
Anything is possible once good discipline enters the equation; the problem is, far too many programmers don't have it. Code it now and let some sucker sort out the issues in 20 years time.
"Neither PPC nor Alpha are any more conducive to thread-level parallelism than x86."
There were other great architectures which had parallelism in hardware, i. e. real CPUs, not just CPU cores. These systems had a better "throughput", but this was many years ago.
- Intel Pentium 4 @ 1700 MHz, 575 int, 593 fp, 0.687 per MHz
- AMD Athlon @ 1333 MHz, 482 int, 414 fp, 0.672 per MHz
- DEC Alpha 21264A @ 833 MHz, 518 int, 590 fp, 1.330 per MHz
- HP PA 8700 @ 750 MHz, 569 int, 526 fp, 1.460 per MHz
- MIPS R14000 @ 500 MHz, 410 int, 436 fp, 1.692 per MHz
(SPEC 2000 INT / FP BASE values)
Of course, this professional equipment was not designed to take any market share in the home computing and entertainment market.







Member since:
2005-07-06
Neither PPC nor Alpha are any more conducive to thread-level parallelism than x86. Indeed, in many respects (sane memory ordering model), it's much more conducive to shared-memory parallel software.