The Big Crunch: the Downside of Multicore

Submitted by Nicholas Blachford 2006-05-01 Hardware 37 Comments

“Can you imagine getting a new PC and finding your software runs no faster than before? You probably can’t imagine it running slower. For some types of software however, that is exactly what is going to happen in the not too distant future. Faster processors running slower may sound bizarre but if you’re using certain types of data structures or code on large scale (8+) multicore processors it might actually happen. In fact it might already happen today if you were to run legacy software on one of today’s processors which feature a large number of cores.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

37 Comments

2006-05-01 8:04 pm
mini-me
If the software is legacy to a big degree (think windows 98,95,ME, 3.1) then the best solution is probably visualization – things will be more compatible this way

2006-05-01 8:24 pm
brewin
You mean virtualization?

2006-05-02 2:31 pm
Sphinx
The next frontier, virtual visualization.
2006-05-02 6:05 pm
Beryllium
No, he means that users who want to run legacy code should sit back and *imagine* how it runs.
Things work better that way. Just don’t let them make feature requests – too many people already live in a dream world when it comes to feature requests.

2006-05-01 8:32 pm
Frenetic
From the sound of the architecture specifics mentioned in the article, it seems to be talking about Intel CPUs. For example, AMD has a different way of accessing memory and uses on-chip controllers, making it possible for each core to have its own path for accessing memory. Keep in mind I am by not hugely knowlegable in regard to CPU architecture so I could be wrong. At any rate, I feel more confident that AMD, rather than Intel, will be more able to solve the problems mentioned in the article.
Edited 2006-05-01 20:33

2006-05-01 8:40 pm
ma_d
I think the issues impact AMD the same way. The memory controller is on die for AMD but that doesn’t, TMK, effect the cache relationship.
What he’s talking about has always been a fundamental issue in cache design: How to keep the cache correct, and fast.
Basically, multi-core’s add another set of ways to do cache meaning there can be more permutations of cache systems and there’s one more nasty problem to solve.
I am, however, not sure that his concerns here affect more than 2% of developers. Most application developers are not concerned with constant factors like cache misses, slow cache, and parallelization (however you spell that!).
It’s a good read though!

2006-05-01 8:38 pm
kscguru
Alas, most of this article is FUD.
1) Processors have elaborate cache-coherency state machine protocols implemented in hardware which are optimized to take the penalty out of cache coherency (bus-based ones are called “snooping”). Most programs do not share much memory at all – only programs written to be parallel share memory.
2) Memory bus bottleneck, latency problems: correct for current Intel architecture, but no mention of AMD’s HT & NUMA solution? Most of these problems were solved with the Opteron design several years ago. Opteron is the most exciting NUMA design out right now, why isn’t it mentioned?
3) Sun’s Rock and Niagra: These are server processors – where throughput is king. Of course they aren’t faster – they are designed to be slower and make up the difference in number of cores.
Individual pieces of software may run slightly slower on parallel machines. But having multiple processors provides better load balancing (which means a single CPU can devote 100% of cycles to a program, instead of 95% of cycles and 5% to OS), lower user-visible latency, and a host of other good things that will make the machine seem faster.

2006-05-02 3:59 am
rayiner
Memory bus bottleneck, latency problems: correct for current Intel architecture, but no mention of AMD’s HT & NUMA solution? Most of these problems were solved with the Opteron design several years ago. Opteron is the most exciting NUMA design out right now, why isn’t it mentioned?
The Opteron is a fairly simplistic (ie: no directory) and unscalable NUMA design. It fixes the bandwidth bottleneck, but not the latency or cache issues. Look at the memory latency for a 1P versus 2P versus 4P versus 8P Opteron box. At 1P, your at like 50ns. At 2P, you’re up to 70ns. At 4P, you’re well over 100ns, and at 8P, Hypertransport is chocked with all the snoop traffic.

2006-05-03 3:27 am
encia
@rayiner
Refer to Hypertransport 3.0

2006-05-01 8:42 pm
looncraz
AMD is preparing essentially reverse-Hyperthreading.
That is the CPU will feature an ultra-high speed complex scheduler (much like a super-micro kernel) that will utilize Out-or-Order execution techniques to spread single threads over multiple cores within the processor.
For some odd reason many people think this to be too demanding of a task and will causes less speedup than you might expect (say, 50% improvement, factoring in for basic sensibility). While there is some credibility in the arguments, I have found such existing technologies (albeit playing slightly different roles) in play in nearly every computer in the world. Already CPUs have OOE (Out-of-Order Execution), meaning the code that is inputted is not done in the order it was received. So what can you do with the code not currently being executed, that is older and we know cannot be dependent on the later instructions for proper execution within the processor? Simple, throw the older code bits onto other cores to get done instead of sitting around.
BUT, we can take it one step further: When some instructions come, send them to the first available core, re-order normally, and be done with it. Externally the CPU would appear to be a single core executing mighty quickly. No need for special compilers or software modifications to see benefits from multi-cores.
However, complexity is only added when providing programmable access to each core individually. This can be handled much like SMP, except you have to now rid yourself of the single-CPU virtualization so that SMP kernels and such work as intended naturally.
To get around this, you blend the two together, still throwing instructions to all cores for simultaneous (or nearly, more like over-lapping) execution, but also allowing direct to those cores, there will be some hardware locking involved, I’m sure, but the benefits will be worth the relatively small effort.
In fact, a few minor tweaks can create processors for different purposes. Use full virtualizing (maybe a BIOS switch?) when your using outdated single-threaded software and need the best performance you can muster, use JUST virtual SMP for file-servers and other applications where the instructions, while simple, are wildly threaded and optimized for SMP due to the nature of the work being done. And, of course, have both enable for a good mixture of single-threaded performance and threaded performance.
Now, think of a CPU with 8 cores. Your running Windows Vista, imagine a 2 CPU limit, well if you virtualize 8 CPUs, it will not help you out much, you will be using just two effectively well. BUT, why not allow the virtualization of 2 CPUs with 4 cores?
See where I’m going? Can’t wait for the fun to begin.
–The loon
(With my fifty cents worth)

2006-05-02 6:17 pm
snozzberry
I wanna mod that down just so I can mod it back up again to 5. Nifty points, sir.

2006-05-01 10:02 pm
james_parker
The answer, I suspect, is not going to be larger and fancier cache operations, but rather new memory technology. Forturnately, that appears to be on the horizon.
MRAM (Magnetic RAM) has impressive properties. It is nearly as fast as static RAM, and has lower power and competetive density to DRAM, and IBM, at least, believes it will be cost competetive with DRAM in full production. It’s also nonvolatile.
The technology is to the point now where Toshiba and NEC have announced a 16MB MRAM chip running at 1.8v.
This technology, if it continues to prove out, will allow caches to be flattened out (perhaps only an L1 cache) or eliminated altogether while maintaining high per-cpu performance. Further, this would simplify software development, as opposed to approaches such as the Cell architecture.
Jim

2006-05-01 10:30 pm
transputer_guy
No I don’t buy the MRAM story ever since the mid 80s, always tommorow. 16M is so far behind 1G DRAM to be useless.
I do buy the RLDRAM story instead, it offers effective SRAM performance and SRAM interfacing, at 2x DRAM price for 512MB size today, but it is only known about and used by the Ciscos etc for networking. It has 20x the raw throughput of any DDR SDRAM for full random access and has a little more block transfer rate ie 400MHz DDR on 64b bus. Unfortunately its doesn’t go on a DIMM, must be on the cpu board. So regular SDRAM DIMMs could be used fo bulk DRAM or HD caching.
The idea of multiple banks concurrently fetching data from DRAM means threaded processors, but the idea can also be pushed into highly banked concureent SRAM caches running at cpu clocks on the cpu die. I suspect Rock is doing that.

2006-05-02 7:35 am
timl
If it must be on the CPU board, with regular (DDR) SDRAM as bulk memory, wouldn’t that effectively make the RLDRAM some kind of cache again? Including all associated problems with cache coherency?
It sounds very interesting though, especially for custom applications that need lots of threads, and do not need the memory to be expandable. Then you could design the hardware with only RLDRAM. Also, video boards come to mind.

2006-05-02 9:32 am
transputer_guy
Yes RLDRAM could be effectively the 1st stop DRAM for as many 32MByte parts as might be put on a mobo, SDRAM being 2nd level. But it only makes sense for highly threaded processors since they can interleave their memory requests into the RLDRAM or an SRAM equivalent likeness of it. A single threaded cpu doesn’t get much benefit from RLDRAM ability to run 8 accesses concurrently although a 20ns worst case full row cycle is still far better than the 60ns of regular parts plus controller. The RLDRAM3 std is moving to 533MHz command bus with DDR I/O and that means 1.9ns issue rates and 15ns access time.
The cache coherancy issues pertain to current systems. The RLDRAM MMU model I am looking at is for a message based threaded cpu, a Transputer. It uses an object based MMU with inverted page type structure. It would also work for SDRAM too, but throughput is around 10-20x worse so few threads could run on that.
I believe the IBM Power mainframes also use L3 cache which is IIRC a proprietary 5ns DRAM very similar to RLDRAM core. Fast DRAM has been around along time, but to be fast and big and cheap it has to come from the regular DRAM vendors who are not looking outside networking markets. Samsung also has a similar sort of networking product.
Not to confuse with RDRAM from Rambus who have much faster signal rates but also very high latencies, one could perhaps take RLDRAM architecture and add the much more elaborate RDRAM adaptive I/O drivers to allow DIMM use, but thats another story.
I favour putting the cpu on the DIMM instead and calling it a TRAM or Transputer module and then have a mobo with slower HT like interconnects to drop in a no of TRAMs. Um sounds familiar, looks like an Opteron multi socket board. BTW DRC has this 940 socket FPGA for Opteron coprocessor with added memory on board, one could argue that this 940 drop in module is a reinvention of the TRAM. In fact HT links look alot like old Transputer links with 15yrs development.

2006-05-01 10:21 pm
transputer_guy
Interesting, but as usual no mention of alternate solutions on the memory side, just the same old bigger caches solution. The cache is really part of the problem now rather than the solution it once was. When SRAM caches get really big (several megabytes) they end up performing far worse than specialized fast DRAMs, bigger in area, leaky as hell, and make cpu chips too big too expensive
Over the last 20yrs or so, DRAM has changed relatively little except in a few ways. They changed Ras,Cas address muxing to be syncronous but still every access that misses the cache takes many DRAM bus cycles and very little use of multiple banks is used and this occurs over hundreds of cpu clocks. DRAM Ras-Dout has halved from around 120ns to 60ns (before the controller is added back in), Cas out data rates have improved much more, but thats for single blocks of data.
When I write code that searches trees, walks graphs, each pointer hop uses only a few words of store per hop and all the work done by the cache in bursting large cache lines is unused. It is better for code fetches and high locality data though. Try running always random memory accesses on an Athlon and the cpu is reduced to 300ns fetch rates.
There is another type of DRAM called Reduced Latency DRAM (RLDRAM) from Micron used in the networking industry that throws away the now retarded multiplexed address structure from Mosteks original mid 70s 4027 and goes back to a syncronous SRAM like architecture with addr-dout in 20ns and allows all 8 banks to operate concurrently since only 1 clock cycle is used to start each bank. That means memory issues every 2.5ns if every clock at 400MHz can go to a different bank, and 8 clocks of latency. That means the caches can be orders smaller on the other side.
The downside of the RLDRAM is that it is really a Multithreaded memory solution looking to be matched to a Multithreaded processor design, the 2 designed together. We hear about MT in various forms all the time such as Niagara, which is all well & good but we don’t hear about MT on the memory side.
An FPGA Transputer I have in design runs 40 odd threads on 10 or so very simple PEs each 4 way threaded (alot like Niagara) but combines that with an MMU that uses RLDRAM ability to start a memory cycle every few ns. Every thread load or store is around 4-6 thread clocks (or 8x that in real clocks). These threads are effectively free of the Memory Wall plague having replaced it with a Thread Wall problem. A paper at wotug.org describes it. Each thread may only give 100Mips, but 40 of them with no Memory Wall looks pretty good. Ofcourse it scales fairly well since the local caches are quite small.
Now for single threaded programmers esp those that keep dregging up Amdahl’s law, one curse is replaced by another, but there are plenty of people out there who relish the thought of having hundreds of light threads instead of 1 fast one thats always dragging.
For instance in a PC situation, I would like to see the entire graphics system go back to using the general purpose processors, divide the screen into tiles with one thread per tile performing all the graphics. The RLDRAM easily has enough bandwidth to share for lots of smaller PEs which can then turn around and be used for other parallel operations.
On the programming side I have proposed taking a subset of Verilog a naturally parallel language used for chip design and combining that with a smaller subset of C++. It seems odd that the worlds parallel programming community is almost entirely oblivious to parallel languages used to describe hardware. They will say that software isn’t hardware etc so we needn’t bother to look. Ironically all the embarassingly parallel apps that one might throw money at to accelerate into hardware, must be more like hardware afterall. In the end, it is all CSP (communicatiting sequential processes) whether we dress it up and run it as code, or synthesize it into hardware.
Transputer guy
(my 40 processes worth)

2006-05-02 3:47 am
rayiner
On the programming side I have proposed taking a subset of Verilog a naturally parallel language used for chip design and combining that with a smaller subset of C++. They will say that software isn’t hardware etc so we needn’t bother to look.
Yes, because what we need in this world is more hardware-oriented concepts polluting our programming languages.
Intel has the right idea — make the CPU wider, deeper, and more OOO. Sure, its inefficient, but who cares — transistors are cheap — programmers who can write parallel code are not.
2006-05-02 1:34 pm
viton
I have proposed taking a subset of Verilog a naturally parallel language used for chip design and combining that with a smaller subset of C++.
This is exactly what i’m trying to do. But i feel what this is the too complex task for my first compiler =]

2006-05-02 4:02 pm
transputer_guy
For a 1st compiler it is too much of a jump, by the time you have done separate Verilog & C compilers, it will be more obvious. Unfortunately Cxx & Verilog are pulling the C syntax in different directions and the semantics are too different, I would defer to Cxx since Verilog is the minor user base even though Verilog is syntax cleaner.
What I propose for the n’th time is this.
Just as C++/C#/Java add methods and a good deal more to struct to get classes, do the same again, add concurrency to the class with live signal ports and call it a process. This follows much more in line with Verilog module as a process, but also take quite a bit of the class stuff out, processes are now objects.
process pname (in foo,,, out bar,,, int i,,,) {
<class methods & data stuff>
<verilog always,initial statements>
<verilog assign expressions>
}
The runtime is now basically a Verilog simulator engine, you could allow event driven dataflow code mixed with std C sequential flow. The idea is to better occam which uses channels, use “wires” or “signals” instead.
Once you have this language, you can definitely write general purpose seq C codes, but if the HW style is used, it can be synthesized into HW by some slight syntax changes back to Verilog, ie use industry std synthesis. Many others who propose C only solutions have to reinvent their own synthesis, not smart when its already done and sort of freely available.

2006-05-01 11:09 pm
hraq
Instead of “the Downside of Multicore” title it must be the “Downside of Bad or Lazy Programmers”. Multicore is the best thing that happened to computers since it started, the problem we face now is the unability or unwilling of programmers to multithread their applications. DCC (Digital Content Creation) applications are a bright good example of how programmers should develop their source code.

2006-05-01 11:44 pm
slate
Instead of “the Downside of Multicore” title it must be the “Downside of Bad or Lazy Programmers”. Multicore is the best thing that happened to computers since it started, the problem we face now is the unability or unwilling of programmers to multithread their applications.
Much bigger brains than you already know about the enormous complications with developing _properly_ multi-threaded code. http://www.lambda-the-ultimate.com is your friend.
Edited 2006-05-01 23:44

2006-05-02 2:54 am
hraq
Starting from “”scratch”” is an excellent approach to be taken when there are huge problems or Complications as you like to call it. If Multicore technology is the only way to improve the Processing Speeds of todays’ computers and the future ones then the applications or even the programming languages should be rewritten to cope with this change. If one of them will stay still then our computing future is doomed; and crying from obstacles never will be the solution.

2006-05-02 12:11 am
BrickCaster
The biggest part of maximizing cpu performance has always been fine grain task management by end user. Uneducated users just let the cpu idle, nothing new here. The downside is not lazy programmers (doesn’t exist), the downside is lazy users that don’t even know what to do with performance.
Feeding the cpu with workload is (has always been) a user task , not a programmer task, the programmer task is writing quality software.
There is no fundamentally difference between multitask and multicore: if the cpu idles it’s basically the user laziness.
Edited 2006-05-02 00:18

2006-05-02 2:36 am
hraq
It seems that you never heard of Task Manager.
I always check CPU cores to see if a process is using both cores, I ended up seeing very very very few applications which makes the CPU 100% on both cores; so whos problem is it the user or the programer? Of course the programmer. The only dominant way multitasking right now is useful in is when you run many application at once; maybe in the future we will see bigger benefits of multicore on a single running application.
Another good example, Windows or Linux is highly multithreaded OSs which shows big performance differences when runnning them on multicore CPUs.
If you check Task Manager and check the option to check the amount of threads running for each process you will understand how lazy alot of programmers out there; big developers like Microsoft, autodesk, adobe, others are an exception, as they are on the edge of technology.

2006-05-03 5:56 am
BrickCaster
I have two glasses, one is empty, well that’s because of lazy wine producers. Good wine should fill all my glasses, i have bought all these glasses and i want them to be filled or i want my money back.
Never should i fill a glass myself.
You complain the 2nd core is unused.
But think a minute: who uses the 1rst core ?
You do.
You run a program and it uses one core.
Run a second one, it will use the second core (or multitask if monocore).
Using every cores only makes sense for inerently parallel tasks such as ray-tracing like POV-Ray SMP does.

2006-05-02 3:55 am
rayiner
It’s really not that easy. The DCC folks were handed parallelism on a silver platter. The rest of us have to work for it. Doing vertex transform in parallel is almost stupidly easy. Doing matrix inversion in parallel requires reading some research papers to implement algorithms CS PHDs have deviced. Doing dataflow analysis in parallel requires being a CS PHD and a couple of years of free time in which to find a parallel algorithm.

2006-05-01 11:40 pm
slate
Before I picked up my new laptop, I remember reading that that in some instances (heavy integer operations), the PM was faster than the Dual Core. And doesn’t the PM have a bigger cache?
2006-05-02 1:54 am
cerbie
Cache: Yeah, they’re working on it. This has been a big thing for the K8 and the new Intel chips.
Main memory: DDR2 is getting to have lower latency than DDR, which got to lower latency than SDR, and so on. It’s not keeping up with the processors, but it’s not sacrificing much. Yes, it’s a major bottleneck, but based on current CPUs and memory (and heavily overclocked iterations of them), it’s a bottleneck that is being dealt with fairly well.
Software and parallelism: doesn’t this basically fall under the same domain of turning recursive algorithms into loops? Yeah, you’ve got to get your head around some new stuff, but when there is a real demand, as is beginning now, the devs will take it up and work with it.
Amdahl’s Law: I’ve not sen it called that, but isn’t this also well-known?
“Between Amdahl’s law and a limited front side bus…”
You know, even Intel is going to give that up, here, one day. AMD did, and they sure aren’t missing it.
“In order to keep the power usage at the same level, every time you double the number of cores you need to halve their power consumption, that is not going to be easy.”
Yet somehow, even Intel has managed to get their DC parts using less power than older SC parts. Not bad.
“While I got that prediction wrong I still expect vendors of complex cores to switch to simpler designs simply because of power concerns.”
Because Core Duo is using so much power? And AMD has those smoking-hot Opteron EE chips.
Athlon64 chips have been using less power per unit of performance with each chip revision, and they’re soon to change processes.
Pentium 4 chips have been reducing their power use similarly, though still hot.
Turions, Opteron EEs, and the Pentium M and Core are offering excellent performance at low power. Even if quad-core versions of them used double the power of the duallies, they would still offer excellent performance per watt.
The future is bright (not without concerns or speed bumps, though), and the article is FUD.
2006-05-02 3:35 am
Cloudy
are now reimplementing everything we did in big iron in the 80s, but now in SOCs.
The only difference is that processors are so cheap now, who cares if you throw cycles on the floor?
2006-05-02 5:32 am
Cloudy
I’m rather surprised no one has pointed out the rather effective use of co-processors to offload the CPU in modern systems. Of course, ‘co-processor’ is now spelled “GPU”, but the idea of purpose-built co-processor is the one area that has paid off hansomely.
Maybe it’s time to revisit the IBM/360 I/O architecture for I/O performance offloading ideas and look into other forms of coprocessor specialization.
2006-05-02 12:13 pm
MikeekiM
Speed for different cores in application 10% Serial, 90% Parallel.
1 core 100 (1x)
2 cores 55 (1.8x)
4 cores 32.5 (3.1x)
8 cores 21.3 (4.7x)
16 cores 15.6 (6.4x)
1) First of all, where did the author get these stats?
These look like the OLD intel p4 dual cpu stats. I think the new Duo Core numbers will be much better.
2) Java’s been multi-threaded for years. Where have you guys been? Should have learned Java, and taken those concepts back to C/C++, no wait, in C and C++ you can also write multi-threaded apps! OMG! It must be the programmer!
Places you will see performance improvement:
Databases
IDE’s
WebServer Apps
All Java Apps.
Probably C# apps as well.

2006-05-03 2:35 pm
renox
>1) First of all, where did the author get these stats?
*Sigh*
He computed them from Amdhal law which say that even with a small serial percentage, when the number of CPU grow, the serial part of the code reduce acceleration.
This is hardware independent ie you cannot do better than Amdhal law (not totally true sometimes other effects occur such as cache size, but it gives the trend).
Google about it or even better read Hennessy and Patterson book on CPUs instead of reacting like this, you might learn something (though it’s not an easy book)

2006-05-02 8:17 pm
Nicholas Blachford
Alas, most of this article is FUD.
To sum up the article pretty much says “there are going to be changes, you need to adapt”.
How is that FUD ?
—-
First of all, where did the author get these stats?
These look like the OLD intel p4 dual cpu stats. I think the new Duo Core numbers will be much better.
Amdahl’s law – it includes a formula.
They are the maximum possible performance scaling level for any application which has 10% serial code. The formula is:
S + (100-S)/N
S= % of code which is serial
N = number of cores
The speedup % is 1 divided by the above.
—-
I’m rather surprised no one has pointed out the rather effective use of co-processors to offload the CPU in modern systems. Of course, ‘co-processor’ is now spelled “GPU”, but the idea of purpose-built co-processor is the one area that has paid off hansomely.
It is mentioned in the article briefly, this is also what AMD are doing with their “accelerators”.
—-
From the sound of the architecture specifics mentioned in the article, it seems to be talking about Intel CPUs.
No, these problems are going to impact all conventional CPU designs from all vendors.
—-
Intel has the right idea — make the CPU wider, deeper, and more OOO. Sure, its inefficient, but who cares — transistors are cheap
The ever wider OOO approach has ran out of steam, that’s why everyone is going multicore. Transistors are cheap but power is not. Intel’s new architecture isn’t inefficient, quite the contrary, it’s very efficient. Even then the performance gain isn’t that great and big chunk of that gain comes from an improved memory bus.
programmers who can write parallel code are not.
Wanna make some serious money? Learn parallelisation.
Many server type apps are already parallel (e.g. Java/J2EE stuff) but desktop stuff isn’t, it’ll need to change and that’ll require good programmers…

2006-05-03 3:30 am
encia
>The ever wider OOO approach has ran out of steam, >that’s why everyone is going multicore.
AMD’s Reverse-HT is an OOO and multi-core hybrid.
2006-05-03 3:52 am
encia
>The ever wider OOO approach has ran out of steam
Not quite e.g. refer to AMD’s K8L i.e. double FP units thus wider FP OOO.
2006-05-03 5:26 am
rayiner
The ever wider OOO approach has ran out of steam, that’s why everyone is going multicore.
Not really. Everyone’s going multi-core because they can (the transistor budget gets larger faster than other things), but OOO, and single-threaded performance in general, has hardly run out of steam. With Conroe, Intel is going to be pushing 3000 SPECINT at 3.3 Ghz. That’s about double the performance of a 2.5 GHz G5, a CPU that predates it by only about a year. Intel got a 15% performance boost over Yonah, which was already an integer monster. That’s just with conventional designs, too. There are a lot of ideas in the pipeline that could push OOO even further. A short summary of the more interesting ones I’ve read about recently:
– Checkpoint Recovery Architecture. Based on the idea of checkpointing the CPU state at hard to predict branches and then replayingfrom the nearest checkpoint on mispredictions or faults, this allows the processor to keep effectively hundreds or thousands of instructions in flight, as compared to the dozens of instructions in flight in modern processors.
– Wakeup Free Schedulers. Allows pipelining of the critical schedule step, removing a significant bottleneck in frequency scaling. There are also scheduler designs that allow drastic increases in the size of the issue queues, while lowering the delay within the scheduler and thus decresaing the cycle time. There are also other scheduler designs that segregate long-latency operations and their dependents into a seperate “L2 scheduler”, allowing effectively hundreds of instructions to be waiting in the scheduler at any given time.
– Hierarchical Load-Store Queues. The load-store queue is a significant bottleneck in cycle time, and bigger load store queues can also substantially increase program performance. Hierarchical load-store queues allow for multiple levels of LSQs (like multi-level caches), which can reduce cycle time as well as increase IPC.
– Segmented register files. The large multi-ported register files required to support modern superscaler CPUs is an impedement to both cycle time and to further superscalarity. Segmented register files allow these critical structures to be substantially simplified, with a substantial decrease in cycle time with a minimal decrease in IPC.
And these are just improvements on the existing models of OOO. I’m not even going to get to the potential of things like priority-queue versus register-oriented CPUs. And that’s completely discounting process improvements. One of the big impediments to wide superscaler CPUs today is that wire delays aren’t keeping pace with transistor switching speeds. What is to say that this trend will continue to exist? As process technology advances, the crucial bottlenecks in the designs will change as well.
Wanna make some serious money? Learn parallelisation.
Many server type apps are already parallel (e.g. Java/J2EE stuff) but desktop stuff isn’t, it’ll need to change and that’ll require good programmers…
I’ve worked on moderately parallelized codebases before (40 threads per node, several nodes per machine, a couple of dozen machines per network), and its not fun. Even when there is a natural unit of concurrency, as there was in that case, it’s a drain on productivity.
Poor programmers isn’t even the beginning of the problem. The problem is the lack of analytic tools for concurrent programming. When a good programmer sits down to solve a problem, they try to formalize it. They’ll say, “oh, this can be modeled as a shortest-path problem on a graph, and I can use Djikstra’s algorithm to solve it.” Unfortunately, even good programmers just cannot apply that sort of analysis to most parallel algorithms. In many cases, a parallel algorithm may exist for a task, but its very complex or not well-known. In some cases, a parallel algorithm does not exist for a task, and then the programmer is stuck using ad-hoc methods.
I’m not saying that parallel programming will never happen. What I’m saying is that single-threaded performance has a long way to go, and as long as it does, the incentive to parallelize hardware and software is minor outside of fields that either really need the performance, or fields in which parallelization is very simple. Eventually, computer science will evolve to the point where a parallel version of Dijkstra’s algorithm is as familar to a reasonably educated programmer as the classic sequential one. Eventually, it’ll reach a point where there is a mature formal theory of concurrent computation, one that is taught to CS majors just as lambda calculus is taught to them today. However, that time is still a ways away.
Intel, AMD, and the rest realize that software is going to drive the parallelization of hardware, not the other way around. That’s why even the latest “multi core trend” involves 2 100mm^2 CPUs on a die instead of 20 10mm^2 ones. Basically, Intel, AMD, and IBM are taking advantage of the fact that for any given process, the optimal core die size is limited by the trade-off between IPC and cycle time. Thus, they maximize single-threaded performance per core, and only then put multiple cores on the die because they have die-space left over. That’s not “going multicore because there is no other way”, that’s “going multicore because you can”.

2006-05-02 8:22 pm
Nicholas Blachford
The speedup % is 1 divided by the above.
Oops! “%” should not be there.