Linked by Thom Holwerda on Sat 11th Mar 2006 21:23 UTC
Thread beginning with comment 103878
To view parent comment, click here.
To read all comments associated with this story, please click here.
To view parent comment, click here.
To read all comments associated with this story, please click here.





Member since:
2005-07-06
Languages with automatic paralellism extraction: NESL, LUCID, GLUT, Scout, several flavours of FORTRAN, some visual languages and several more.
The things you mention strike me as being a lot like APL. They don't extract parallelism from sequential code, but provide constructs that can be parallelized without resorting to a threading model. I think those designs have a lot of merit, but what I was talking about was more along the lines of IBM's Octopus compiler. However, I don't know how well Cell will deal with them. Context-switching the SPE is rather slow, so Cell was really meant for something more like an agent-oriented or producer-consumer model.
What's the problem with these numbers? These are quite common for >3 Ghz CPUs, except the 2 issue. Which is not so bad, as issue != IPC, and IPC < issue (always).
The numbers are horrible! I don't know where you find that these are common for 3 GHz CPUs. The only ~3GHz processor with numbers that bad is the P4, and at least its L1 cache latency and simple integer latency are good. Not to mention the fact that its highly OOO and has a giant branch predictor table.
Other ~3GHz CPUs, like the Opteron and G5 have much better numbers. The Opteron's branch misprediction latency is 11 cycles, the L1 cache latency is 3 cycles, the L2 is 15 cycles. The G5's branch misprediction is 15 cycles, the L1 cache is 3 cycles, the L2 is 13 cycles. Both have much bigger branch predictors.
The local memory is not an L1, obviously. You can NOT miss any access to a local memory, you never get a penalty.
The 6-cycle penalty is for load-to-use. Since there is no L1 cache, that's the minimum latency for any memory operation, and is thus comparable to the L1 latency in a cached architecture. Even IBM's docs admit that this penalty is quite bad for integer code.
This model has been used in many consoles (PS1, PS2, Dreamcast, GBA, Xbox360) and I have programmed for it several times. It's extremely efective.
None of those things are meant to run integer code with any speed, and at least some of those at least have an L1 cache with less than a 6 cycle latency!
The 18 cycle missprediction is the same or smallet than a Pentium4 (except you are getting 8 CPUs for the price of one).
Yes, but at least the Pentium 4 has enormous branch prediction resources to mitigate it. On an SPE, you have 3 cycles up front for ANY branch, and if its not statically predictable, you pay 18 cycles every single time. Something like an AI code will spend all its time paying branch penalties.
When I talk about HLL, I talk about languages tailored to create parallelizable programs without doing low level grunt work. Languages which supports structures like array and set natively.
I am not talking about languages with run time type checking and things like that, holy jesus!
So why did you reply to my original post? The whole point of my post was that the Cell might very well be nice to program if you're used to working with the metal, but makes no sense for the type of programming most people are doing. Even people who program in C spend most of their time writing rather abstract code. On an SPE, even something basic like a virtual function call in C++ becomes an extremely expensive operation.
mmm? Prefetching is useless in integer code? What are you talking about? Prefetching is useful in any code which can request data beforehand, specially sequential accessing.
Maybe you mean useles in code with frequent branching...
When I say integer code, I don't mean a triangle rasterizer. I mean conventional integer programs (what you'd find in SPECint). That means lots of branches and random load/store.
OOO is a silly patch for a bad software hait.Wasting 40 million transistors to execute 3 instructions/cycle (in a good day) is anti-economical.
Transistors are cheap. Billable programmer hours aren't.
I respectfully disagree.
A serious compromise is having to translate x86 code (with several extensions) in an already large pipeline.
The overhead of x86 translation is ~5% on modern CPUs. A few million transistors here or there, who cares?
problems the Cell will face are very similar to the ones a multicore x86 will face: memory synch, memory bandwith, concurrency mechanisms speed, and the inhability of current languages to automatically parallelise the code.
They'll face it in an entirely different way. Multi-core x86 will be able to attack the problem with high single-thread performance cores (so you have far less parallelism you need to extract), and with the benefit of truely high-level languages. Cell programmers will have to deal with a primitive in-order architecture in addition to all the concurrency headaches. x86 programmers will get to work with a shared memory model. Cell programmers will have to deal with the obscure local memory model. The basic problems will be similar, but x86 programmers will get a lot more help from the hardware in dealing with them.