Linked by MOS6510 on Fri 17th May 2013 22:22 UTC
Hardware, Embedded Systems "It is good for programmers to understand what goes on inside a processor. The CPU is at the heart of our career. What goes on inside the CPU? How long does it take for one instruction to run? What does it mean when a new CPU has a 12-stage pipeline, or 18-stage pipeline, or even a 'deep' 31-stage pipeline? Programs generally treat the CPU as a black box. Instructions go into the box in order, instructions come out of the box in order, and some processing magic happens inside. As a programmer, it is useful to learn what happens inside the box. This is especially true if you will be working on tasks like program optimization. If you don't know what is going on inside the CPU, how can you optimize for it? This article is about what goes on inside the x86 processor's deep pipeline."
Thread beginning with comment 561952
To read all comments associated with this story, please click here.
Comment by Drumhellar
by Drumhellar on Sat 18th May 2013 01:03 UTC
Drumhellar
Member since:
2005-07-12

I do believe the terminology used in the article is off.

The 486 didn't introduce a "superscalar pipeline" to x86; it's just a pipeline, meaning multiple instructions at different stages of execution in a single execution unit.

"Superscalar" refers to having multiple execution units, whether pipelined or not.

He also conflates "Core" with "Core 2". They are different chips.
"Core" was derived from the Pentium M, and was 32-bit, and was single or dual core.

The Core 2 was 64-bit, and available in single, dual, or quad versions.

Reply Score: 6

RE: Comment by Drumhellar
by tylerdurden on Sat 18th May 2013 01:51 in reply to "Comment by Drumhellar"
tylerdurden Member since:
2009-03-17

since we're nitpicking, by your own definition the 486 is superscalar; it had multiple functional units.

Superscalar refers to the ability of running multiple functional units (usually redundant ones) in parallel, which I believe the 486 couldn't. The rule of thumb usually is that a superscalar microarchitecture can support theoretical IPCs larger than 1.

Reply Parent Score: 2

RE[2]: Comment by Drumhellar
by butters on Sat 18th May 2013 03:03 in reply to "RE: Comment by Drumhellar"
butters Member since:
2005-07-08

Superscalar means multiple instructions may be issued in a single cycle. The P5 Pentium was the first superscalar x86 chip. It could dispatch and issue one or two instructions per cycle in program order. The P6 (Pentium Pro through Pentium III) could dispatch three instructions per cycle and issue five instructions per cycle out of program order.

Pentium M and Core (1) are a direct evolution of P6 with the same three dispatch ports and five issue ports. Pentium M added micro-ops fusion with an additional two pipeline stages (12 to 14). Core allowed two cores to share a common L2 cache.

Core 2, besides the 64-bit GPRs, is wider than P6, with four dispatch ports and six issue ports. And Haswell is adding another two issue ports for a total of eight.

As for pipeline depth, the entire industry has converged on 12-15 cycles for CPUs designed to be clocked in the 2-4GHz range. Apple A6 and Qualcomm Snapdragon have 12-cycle pipelines. Atom is moving from a 14-cycle pipeline to a brand-new 13-cycle pipeline. ARM Cortex A15 has an eponymous 15-cycle pipeline.

But at clock frequencies below ~1.5Ghz, a shorter pipeline is more optimal. The 7-cycle ARM Cortex A7 is the best example of a modern core designed to perform well at low clock frequencies.

Reply Parent Score: 5

RE[2]: Comment by Drumhellar
by theosib on Sat 18th May 2013 03:09 in reply to "RE: Comment by Drumhellar"
theosib Member since:
2006-03-02

I'll admit that there is some variation in terminology, but the defining characteristic of a superscalar processor is that it will fetch, decode, and dispatch (queue up to wait for dependencies) multiple instructions on a cycle. It is a partially orthogonal issue that an out-of-order processor may issue (start actually executing) multiple instructions on a cycle due to muliple dependencies being resolved at the same time. There have been plenty of scalar processors with OoO capability (IBM 360 FPU using Tomasulo's algorithm, CDC6600, etc.). And there have been plenty of in-order superscalar processors (original Pentium, original Atom, ARM Cortex A8, various SPARCs, etc.).

I say that these are PARTIALLY orthogonal, because superscalar processors typically have multiple functional units of the SAME type, while scalar processors typically do not. Having the ability to decode multiple instructions at once massively increases the probability that more than one of those will target the same type of functional unit, putting pressure on the execution engine to have redundant functional units to get good throughput.

Out-of-order was developed as a way to decouple instruction fetch from instruction execution. Some instruction types naturally lend themselves to taking multiple cycles (e.g. multiply). Fetch and decode are a precious resource, and you don't want to hold them up just because you have a multiply instruction holding up the works. While that's going on, it would be nice to keep the adder busy doing other independent work, for instance.

So OoO was developed to keep fetch and decode productive. But then that opens up another problem where fetching and decoding one instruction per cycle can't keep the execution units busy. Thence came superscalar. It's an interesting optimization problem to find the right dispatch width and the right number of redundant functional units in order to get the best throughput, especially when power consumption constraints come into play.

Note: I'm a professor of computer science, and I teach Computer Architecture.

Edited 2013-05-18 03:28 UTC

Reply Parent Score: 5