Linked by David Adams on Wed 4th Aug 2010 18:28 UTC, submitted by estherschindler
Hardware, Embedded Systems Anyone contemplating a new computer purchase (for personal use or business) is confronted with new (and confusing) hardware choices. Intel and AMD have done their best to differentiate the x86 architecture as much as possible while retaining compatibility between the two CPUs, but the differences between the two are growing. One key differentiator is hyperthreading; Intel does it, AMD does not. This article explains what that really means, with particular attention to the way different server OSes take advantage (or don't). Plenty of meaty tech stuff.
Thread beginning with comment 435525
To read all comments associated with this story, please click here.
HT vs. multi-cores + some background
by ndrw on Thu 5th Aug 2010 17:50 UTC
Member since:

It was an interesting read. Thumbs up for the article.

Just to put it in context, there were several stages of CPU's development, each marked with different performance characteristics:

Stage 1. CPU. Ages ago, code execution speed was directly related to the CPU clock frequency and to the number of clock cycles per instruction. Memory access was either not a concern at all, or it could be easily alleviated by adding a simple cache. At the same time Moore's law enabled us to scale clock frequencies and transistor densities exponentially. So far so good. Speeding up the CPU was the solution.

Stage 2. Going super-scalar. So we scaled the operating frequencies up, added some system level techniques like pipelining and additional caches to boost the CPU performance even more and then we hit a problem - a huge difference between a clock cycle duration and memory latency. When executing a typical application code the performance is no longer defined by CPU performance but by memory access delay. CPU works in bursts - 10 clock cycles of program execution - 100 cycles of waiting for data. Many more or less speculative techniques (super-scalar architecture) were developed to predict the data so that the CPU can continue to run with hope it will not have to discard these data all to often. This mechanism, although quite effective, is of course far from being perfect. This is the place where HT fits in by allowing another thread to execute when the original is waiting for memory (at the hardware level).

Stage 3. Copy-paste (multi-cores). All these techniques, combined with pipelining made it increasingly difficult to increase the clock frequency of the CPU. Design complexity, size, power consumption, all scale much worse than linearly with frequency. This is the primary reason why mutli-cores were proposed (and why Pentium 4 was a dead end). Except for IO's, memory access circuits etc., it's just like taking a CPU and copy-pasting it a few times. If supported by OS and applications they give a close to linear improvement of performance vs. circuit complexity and power consumption increase.

The cost of multi-threading and HT is the software interface. Most applications are still written as a sequential code and many of algorithms can't be implemented in parallel. However, where multi-threading actually gives measurable and predictable speedup, HT complicates already over complicated bottleneck part of the CPU and its overall speedup can be as low as 0 if the code executed by the CPU happens to be well optimized (so that it doesn't stall between memory accesses).

Also, elaborate caching schemes generally don't play well with multiple CPU's. It's less a problem with multi-cores than it used to be with multiple standalone CPU's but nevertheless, every time we write something to the memory we must make sure that all shadow copies of this memory cell are updated before they make their way into the CPU pipeline.

Stage 4. Performance/watt. System performance is now defined not by frequency, not by memory latency but by power consumption (and sadly enough that's a physical, not design limitation). As fabrication process scales down single transistors become less power hungry (~linearly) but work at slightly higher frequency and there are much more of them (~quadratically). So the power consumptions is bound to deteriorate unless we start using most of the chip area for a cache (which beyond some point doesn't improve performance significantly).

Although single-threaded performance of the super-scalar architecture is still the best, it's slowly becoming a niche application. Many highly parallel applications simply don't care about it at all. They just want to get maximum data throughput without frying the CPU. Similarly, in the embedded space, battery life is everything + these platforms usually come with some specialized hardware anyway. In these applications, the best way to built a CPU is to use a basic (and thus small and low power) core without any super-scalar extensions (or with just basic ones like static branch prediction etc.) and put many of them on the same chip with some extensive power and clock gating. That's more in line of what modern GPU's are nowadays, and this pattern is likely to spread even further as performance/watt becomes important.

Reply Score: 1