
After years of delivering faster and faster chips that can easily boost the performance of most desktop software, Intel says the free ride is over. Already, chipmakers like Intel and AMD are delivering processors that have multiple brains, or cores, rather than single brains that run ever faster. The challenge is that most of today's software isn't built to handle that kind of advance.
"The software has to also start following Moore's law," Intel fellow Shekhar Borkar said, referring to the notion that chips offer roughly double the performance every 18 months to two years.
"Software has to double the amount of parallelism that it can support every two years."
Member since:
2005-07-08
Oh, I agree that x86 is right and PPC is wrong in this respect. While Core 2 does speculative reordering of loads ahead of stores, it maintains the semantic intent of the instruction stream it receives, paying a penalty of a pipeline flush if it mis-guesses the uniqueness of memory addresses. The programmer shouldn't have to police the processor's reorder unit wherever the instruction order is critical to the correct operation of the program. The processor should always assume that the instructions are ordered in a given way for a good reason. If it can prove that an alternative ordering is semantically identical, then optimize away. But if it can't, then it must assume that order matters.
One of the interesting early benchmark results for the POWER6 core is that it has taken the crown from Core 2 in terms of synthetic, single-threaded integer performance. It's interesting because the Core 2 is the a wide, aggressively OOO core topping out at 2.93 GHz, whereas POWER6 is a relatively narrow, mostly in-order core at a blistering 4.7 GHz. This says a lot about the trade-off between issue-width, instruction reordering, and clock frequency. One would assume that the POWER6 team hit their transistor budget and frequency target by thinning and simplifying the pipelines. Not only did this help them reach their performance/watt target for multi-threaded server workloads, but they also won the one-to-one pipeline comparison.
Armed with new evidence that clock frequency is inescapably central to single-threaded performance, above all other forms of architectural enhancements, and combined with the brutal relationship between frequency, voltage, and power, we find ourselves in a bit of a pickle. We know how to increase performance/watt on the server, where multi-threading is a natural extension of the workload. But how do we improve performance/watt in client workloads characterized by I/O-bound, latency-sensitive applications that are difficult to parallelize?
Actually, I think that the POWER6 team might have stumbled onto part of the solution. Put the execution units on a strict diet. Keep the load/store reordering, issue everything else in-order. Offset frequency increases by cutting transistors and aggressively pursuing process shrinks to allow lower voltages. Power is linear with frequency and transistor count, while it's quadratic with voltage. Invest the extra die space in caches, which help immensely in reducing noticeable latencies on client systems, while being fairly easy to optimize for power efficiency. Processors spend huge amounts of time waiting on loads. Transistors are much better spent implementing caches than implementing wide and aggressive pipelines.
I think this Intel fellow is way off the mark, quite frankly. Each x86 vendor has started with an architecture from opposing ends of the spectrum and worked toward the center. Intel started with a mobile design and adapted it to multi-core, wide-issue design. AMD started with a server architecture and adapted it to power efficiency. In the center lies this low-volume, high-margin, prestige market segment populated by gamers, enthusiasts, and multimedia professionals. This is the least power-conscious segment of the entire market. While Intel remains better positioned to provide strong performance/watt in the high-volume mobile sector, they have clearly made compromises in their design in order to win performance benchmarks and the associated enthusiast mindshare.
So look no further than your own choices, Intel. Software has its own problems to deal with without having to worry about thread-parallelism that may not be realistic in many cases. Software is unreliable, insecure, expensive to maintain, and threatened by a broken intellectual property system in some nations. There's way more complexity to software development than winning benchmark contests, and in this day and age, that holds true for CPU vendors as well.