Big Blue announced it would use the Cell processor designed by it, Sony and Toshiba in a range of high end servers. Applications that need good processing power and can render graphics are the target of Big Blue’s desire. The first systems will become available in the third quarter of this year.
I eagerly followed the development of Mac OS’s preperation for a chip like the Broadband Engine – years before it existed or had a name. All the way back to the early days of QuickTime it was clear the single core Mhz race would come to a crashing halt and the future would be in Cell like designs.
And when it arrives, Apple’s, Jobs’ actually, incompetence in dealing with IBM has now locked them out of the chip that they as a company would be able to leverage more than any other computing company in the world. When IBM dumped Apple last year I am afraid it was for good. I doubt IBM would take Apple back as a customer no matter how hard Jobs begged.
I don’t know whether to laugh or cry when I read about Apple trying to get people excited about Viiv from Intel.
And one more thing…spare us the “Apple looked at Cell and blah,blah,blah” damage control mantra.
The chip business isn’t run by a bunch of whiny kids who hold grudges; these are businesses that will make money where they can. If using Cell was profitable for both Apple and IBM I’m sure it would have happened.
BTW, Apple isn’t promoting Viiv.
The single core paradigm has existed for decades and will continue to exist for the forseeable future. To make architectures like Cell useable in the general case, we’re going to need a revolution in computer science theory — a revolution that brings to the theory of concurrent computations to the same level of maturity as the current theories of sequential computations.
To adopt an oft-used phrase: “it’s the programmers, stupid!” It’s the software that makes hardware necessary and useful at all. Over the past decades, hardware has become more general-purpose in order to make software easier to write. Programmers have started taking advantage of the flexibility of modern hardware by adopting higher-level languages and programming techniques that reduce their work. A move back to more special-purpose designs is a step backwards, not a step forwards.
For a programmer’s point of view, Cell sucks. No out order execution? You mean I have to care about code scheduling again? That’s so 1990s! Crappy branch prediction? You mean I have to start unrolling loops again? Heck, even loops are becoming passe — how is Cell going to handle recursive lambdas? Two-issue? How is a two-issue processor going to handle languages which result in lots of type checks and null pointer checks? The SPE’s don’t even do dynamic branch prediction — how well are they going to handle a language in which every statement results in a function call? And what the heck is this “local store”? I don’t care where my objects get allocated, as long as the garbage collector cleans up afterwards! On top of all that — how antiquated is the idea of tying your code to a specific processor? How is a Cell-oriented program going to run on an Opteron or a Pentium-M?
Beyond that, what programming paradigms is Cell well-suited for? On a traditional CPU, I can write everything from functional, to object-oriented, to agent-based code. On Cell, all I get is producer/consumer, and maybe agent-based if I can keep the message-passing to a minimum. Everything else will perform like crap. How many programmers are going to want to fit their round programs into the square hole of the producer-consumer model? Would you like to be the guy at Adobe forced to rewrite Photoshop to adopt the algorithms to a glorified stream processor?
It seems to me that the people who see Cell as the second coming of Christ aren’t programmers. Which is telling, because ultimately its the programmers that will get to decide whether Cell becomes a success outside the embedded market. Given that Cell increases how much work they have to do by an order of magnitude, they can kill it simply by ignoring it. How useful will Cell be in the desktop/workstation space if nobody codes specially for it?
Single-threaded software design is dead. Processors for the foreseeable future are not going to increase dramatically in single-core performance, so any design that starts off now expecting to operate in an unthreaded environment in the future is a waste of money. What isn’t dead is the maintenance of single-threaded codebases, which will continue for decades to come in much the same way as legacy code from previous decades still operates happily today.
The work on concurrent computing doesn’t have to just start now, because it’s been happening for a long time (CSP, CCS, pi-calculus just for example topics to start with). What hadn’t happened was mass-market adoption. Programmers write single-threaded code because they are tied to antiquated programming languages that make threading absurdly difficult to manage without introducing errors, and because the mass-market PC hardware remained mostly uniprocessor (because most mass-market software remained single-threaded…).
The lack of OoO execution and absence of branch prediction doesn’t really impact programmers except for compiler authors. All of the high-level languages you allude to have no guaranteed means of expressing instruction order, or expressing how branches are dealt with. It obviously has performance-tradeoffs, but it’s not the source of frustration for development. The source of frustration is better touched upon with making the tradeoff between obtaining any of the performance benefits of the Cells design, and having code that is generally-portable. To make using the SPEs transparent will involve a lot of tool work and leave the SPEs as batch processors for performing specific jobs. That’s fine (that’s what we use GPUs for after all), but when you pair it with the otherwise mediocre performance of the rest of the processor, what’s the point of using one of these processors?
Single-threaded software design is dead.
Except most software doesn’t take very good advantage of more than a few cores. Even on OS X, where SMP has been the standard for years, most apps are not highly multithreaded.
Processors for the foreseeable future are not going to increase dramatically in single-core performance, so any design that starts off now expecting to operate in an unthreaded environment in the future is a waste of money.
Conroe is going to see a very significant increase in single-threaded performance. The clock-speed stagnation appears to be receding into the past. If the projected 3 GHz can be achieved at introduction, scaling close to 4 GHz should be possible at 65nm. And 45nm is just around the corner. What isn’t going to happen in the foreseable future is a Cell-like increase in the number of cores. Dual-core is just starting to propogate in the mainstream, and quad-core likely won’t appear in the mass market for years. Yes, taking full advantage of dual core CPUs is going to require work, but its not going to be the same level of effort as an 8-CPU Cell design, or designs with even more cores. Basic locking is “good enough” for a 2 core CPU. Its when you get into 8+ cores where multithreading becomes really hard, and you have to start doing fancy synchronization stuff like RCU.
The work on concurrent computing doesn’t have to just start now, because it’s been happening for a long time (CSP, CCS, pi-calculus just for example topics to start with). What hadn’t happened was mass-market adoption.
These ideas have been around for awhile, but are not nearly as mature as good old lambda calculus. There is a large body of existing theory on the design of sequential algorithms. The body of theory regarding parallel algorithms is minscule by comparison. Moreover, the body of practical experience with highly concurrent languages is small compared to the body of practical experience with sequential languages. At the end of the day, parallel languages and parallel programming won’t take off until some varient of pi calculus is taught in every CS program, right alongside quick sort.
The lack of OoO execution and absence of branch prediction doesn’t really impact programmers except for compiler authors.
That’s the theoretical consideration. In reality, there is an engineering tradeoff between the sophistication of the high-level optimizer and the sophistication of the code generator. With limited programmer resources, it makes far more sense to use highly OOO processors (which exist), then to make super compilers (which don’t). At the end of the day, the prorammer is forced to deal with that engineering tradeoff, in the form of the performance characteristics of his favorite compilers.
All of the high-level languages you allude to have no guaranteed means of expressing instruction order, or expressing how branches are dealt with.
High-level languages tend to result in code with high branch density, lots of loads/stores, and control flow that is difficult to analyze statically. This means that a good branch predictor and a deep OOO instruction window help performance enormously.
It obviously has performance-tradeoffs, but it’s not the source of frustration for development.
Programs have certain performance requirements. If high-level languages don’t perform well on a given processor, that means lower-level ones will have to be used, which results in frustration during development.
Edited 2006-02-09 15:27
Except most software doesn’t take very good advantage of more than a few cores.
A nontrivial chunk of software for Win32 is multithreaded. The large majority of it is not pervasively concurrent because it’s been designed to operate most efficiently in a uniprocessor environment and even more often simply moved piecemeal to a multithreaded design by simple logical partitioning from existing code. PC software won’t in general take adavantage of more than a few cores, because it hasn’t been able to take advantage of more than a few cores due to practical limitations of >4 CPU x86 systems. It’s also not particularly interesting to state that most software available in such a situation isn’t even designed to operate on more than one processor because “most software” for the platform composes decades of development. In ten years “most software” wouldn’t be pervasively concurrent even with new development starting from the position of developing software with concurrency in mind.
Even on OS X, where SMP has been the standard for years, most apps are not highly multithreaded.
And again this is another uninteresting property. Most software for OS X is aged in origin. There is software, especially CPU-bottlenecked software that is multithreaded and does take advantage of concurrency for noticeable performance improvement.
Conroe is going to see a very significant increase in single-threaded performance.
Conroe is a dual-core processor. Whatever increase it sees vs. Yonah or vs. Presler it sees in single-thread performance it’s going to see more in multi-thread performance.
If the projected 3 GHz can be achieved at introduction, scaling close to 4 GHz should be possible at 65nm.
You can’t project a 1GHz clock increase on the same process for a processor that hasn’t even been released yet. There has been plenty of disappointment all around in clock-scaling and little evidence to show improvement without making sacrifices. Is there a 1/3 overclocked Yonah or Presler with air cooling somewhere I should look at?
And 45nm is just around the corner.
And through the woods.
What isn’t going to happen in the foreseable future is a Cell-like increase in the number of cores.
So what? Cell obtains its core count by making them vector processors attached to some memory. That doesn’t make single-threaded development a dead-end. Every single non-budget processor (not Millville for example) in Intel’s roadmap is multicore (along with every other major processor vendor).
Dual-core is just starting to propogate in the mainstream, and quad-core likely won’t appear in the mass market for years.
Intel will be selling quad-core desktop processors with Kentsfield next year. Hypothetically, anyway.
Basic locking is “good enough” for a 2 core CPU. Its when you get into 8+ cores where multithreading becomes really hard, and you have to start doing fancy synchronization stuff like RCU.
RCU has nothing to do with making multithreading easy, it provides certain performance properties in time and space. STM is a much better example for dealing with conceptual complexity reasoning about concurrent programs.
These ideas have been around for awhile, but are not nearly as mature as good old lambda calculus.
Nothing after the untyped lambda calculus will ever be as mature in the sense of age (1930s for untyped lambda calculus vs. 1973 for actor model for comparison). In the sense that Milner provided an encoding of the lambda calculus within the pi-calculus, you can continue to reason about programs in the limited parallelism provided by the lambda calculus. Or the extended nondeterminism provided by any number of extensions of the lambda calculus to include things like futures.
Since most programmers don’t actually rely on any formalisms for reasoning about their programs this is all academic. Which is interesting for me, but has little applicability for the Java programmers writing concurrent programs interfacing concurrent databases operating on concurrenct operating systems.
At the end of the day, parallel languages and parallel programming won’t take off until some varient of pi calculus is taught in every CS program, right alongside quick sort.
Presumably you don’t actually mean at the same time, since the lambda calculus properly isn’t studied alongside of the quicksort algorithm in the CS curriculum. You could easily teach a parallel quicksort alongside the normal implementation, but whether that is a good idea or not is not immediately obvious. Universities offer a number of classes covering topics in parallel computing. Given the degree to which undergraduates already seem to lose the information they should be acquiring in early ds&al classes, mixing parallel and sequential algorithms into the same time-period will simply result in pushing other things out. The real bottleneck for increasing the adoption of concurrent programming is really tools. The most mainstream pervasively-threaded programmers are Java and C# programmers, and their threading model is difficult to use without introducing errors while retaining scalability.
That’s the theoretical consideration. In reality, there is an engineering tradeoff between the sophistication of the high-level optimizer and the sophistication of the code generator.
In reality IBM’s compiler group has already provided significant performance improvements on the Cell. The voodoo that is loop unrolling, utilizing conditional instructions, and scheduling for their scarcely 2-issue processors. You’ll have to forgive my sarcasm.
At the end of the day, the prorammer is forced to deal with that engineering tradeoff, in the form of the performance characteristics of his favorite compilers.
At the end of the day, the programmer has little say on what his code runs on. He actually has little say about what compiler he uses, and less about what language he uses. The pervasively used “high level languages” are interpreted scripting languages with poor computational characteristics because of their simplistic implementation strategies. The next in line are “not really high level high level languages” like Java and C#, where compiler complexity is tossed into the quality of support for a given platform by the JIT compiler, which picks off as much low-hanging local-optimization as it can with respect to branching and inlining. Then C++ and C, after which it’s all down-hill from there in terms of actual usage. Interesting languages are essentially irrelevant, not that the implementation concerns differ significantly.
None of this has anything to do with the discussion at hand, though. General-purpose code in high-level languages has no concern over OoO execution by the processor. It’s not generally performance-critical and either it will perform well on the platform given the available compilers or that’s just something people using that build have to live with (for example the Pentium which very many people managed to live with for quite sime time, in an era of OoO processors). If it’s performance-critical then it ceases to be general-purpose.
High-level languages tend to result in code with high branch density, lots of loads/stores, and control flow that is difficult to analyze statically.
“Bad” code runs badly on any processor, causing expensive stalls, limiting ILP, and making abysmal use of pre-fetching potential. The degree to which it can be made to run less bad changes, but poor code hurts the Pentium 4 despite its implementation strategy. That’s part of the whole purpose of profile-driven optimization or probably more sensibly long-term with projects like Dynamo.
I’m embedded/console/STB programmer for years and for me the Cell is a very tasty piece of hardware. Simple, fixed architecture with maximum raw power. Local store. These words are music for my ears =)
Cell is optimized for high performance media crunching, not for a language where “every statement results in a function call” or even garbage collector.
Elephants can’t fly.
Anyway nothing can stop IBM to add OOO in PPE or improve branch prediction in SPE.
> how antiquated is the idea of tying your code to a specific processor?
Haha. So we have a software what do nothing but burning the cpu cycles.
Tying your code, for example, to P4/SSE2 is an antiquated idea. Don’t do it guys. Use 386+x87 and intel will do all the work for you!
Edited 2006-02-09 10:42
Anyway nothing can stop IBM to add OOO in PPE or improve branch prediction in SPE.
a) Today’s Cell doesn’t have those things.
b) If it did have them, it couldn’t have as many SPEs (at the same price).
Tying your code, for example, to P4/SSE2 is an antiquated idea. Don’t do it guys. Use 386+x87 and intel will do all the work for you!
With those, as with other evolving processor lines, it’s only a question of compiler switches and possibly some assembler-coded routines.
With Cell it’s a question of tying your whole data model and algorithms to its particular local-memory arrangement.
Also, apps won’t be able to take advantage of future Cells with more SPEs or bigger local memory unless programmers put in extra effort now to design their apps in such a way that they can adapt automatically.
it’s only a question of compiler switches and possibly some assembler-coded routines.
Will some SSE3 specific code run on anything other? So this code is tied to one cpu.
SIMD code may require rearraged data and different algorithms than scalar code.
Also, apps won’t be able to take advantage of future Cells with more SPEs
With the right programming model this will be done by OS automatically.
or bigger local memory
This will not happen in a near future. They didn’t left the space between SPE local memory area and registers, so they have only one choice: breaking the SPE memory area to 2 pieces.
I just wanted to mention that the the branch prediction and other things you seem to think will become the programmer’s problem actually will probably be handled in the compiler. This seems to be a trend. Intel’s Itanium (1 and 2?) require’s much more of this kind of work to be done in the compiler, which to be honest can probably do a better job when targetted at a specific processor, but probably not as well when having to deal with multiple generations, or future yet-to-be-designed CPUs.
Cell’s first incarnation at the very least is not suitable for competing with the x86 for general-purpose code. The way some people look at processor designs like the T1 and the Cell are more cores = better, which is frankly stupid in the higher clockrate = better way. These designs have very impressive performance for complexity results in _specific areas_.
The Cell isn’t just a “We’re a massively multicore processor! We can get so much done!” scenario. The Cell is a _heterogeneous_ computing environment requiring code to be compiled and loaded for the SPEs saparately from the PPE. Each of those SPEs is little more than a vector processor with a local store for instructions and data. It has no branch prediction (assumes a branch is always not taken without hinting), it’s only dual-issue with properly interleaved uses of rotate/shift/load/store with airthmetic (with a variety of latencies to schedule around), and has middling integer op latencies. To obtain the advantages of the Cell architecture code needs to:
1. accept the limited single-precision float format
2. make effective use of simd
3. eliminate as much branching as possible
4. minimize ‘task’ switches on the SPEs
A compiler can help with 3 a lot, 1 and 4 are design matters, and 2 can be aided with autovecorization but that’s limited and does not make full use of the processing capacity that would make dealing with all of the tool and logistic problems associated with heterogeneous processor development worthwhile. Different strategies can be used for utilizing the SPEs and making their use less of a burden on the programmer, but they come at the cost of utilizing the capacity that’s pointed at when people bring up the Cell as interesting.
Cell basically takes all of the complexity concerns of transitioning to typical homogeneous shared-memory multiprocessing, and increases it by expanding the necessary complexity in compilers, debuggers, linkers, operating systems, and libraries. When looking at it from the perspective of having a large amount of legacy code, it looks more like a waste of time than something to be optimistic about. On the other hand, when looking to develop specific applications, it can appear as promising as ‘shader’ development.
Absolutely. Cell is very interesting for certain applications, but not general applications. For example, in the communications field, there are software-defined radios that have a board with 2-4 G4 (MPC74xx) processors and maybe a couple of Virtex FPGAs. All that hardware is used to do signal processing on the data coming in from the receiver. Needless to say, the things are neither cheap nor very energy efficient. All that hardware could probably be replaced by a single Cell chip running at 2-3GHz. The software developers wouldn’t really care, since the code has to be specially written for the hardware anyway, and the hardware could be substantially smaller and cheaper. Ultimately, I think these places are where you’re going to see Cell take off, not in the general purpose computing sector.
Cell is great but what OS do they plan on using to run on cell? and where are all the software developers rushing to the new Cell architecture? MCA all over again?
Don’t you read the news? I mean seriously, I don’t think you have. They are going to have native linux support for the Cell chips in the mainline kernel. IBM already stated as such.
Don’t you read the news? I mean seriously, I don’t think you have. They are going to have native linux support for the Cell chips in the mainline kernel. IBM already stated as such.
Wow thats a huge mistake.
Presumably IBM has ported Linux to Cell? I think it is a little scarey that IBM has targeted military AND medical for first generation applications of Cell. This is a first gen chip with a brand new OS port. This does not sound like a recipe for high reliability right out of the box. What are they thinking???
Presumably IBM has ported Linux to Cell?
Yes. With a pseudo file system for accessing the SPEs.
This is a first gen chip with a brand new OS port. This does not sound like a recipe for high reliability right out of the box.
Good point. It also remains to be seen how keen software vendors are to redesign their apps to fit them to the Cell’s peculiar memory model.
“It also remains to be seen how keen software vendors are to redesign their apps to fit them to the Cell’s peculiar memory model.”
http://www.businessweek.com/technology/content/feb2006/tc20060208_0…
“Raytheon plans on using Cell in its entire family of sensor-based products.”
Is just one example.
Sony,Toshiba,IBM,Mercury System,Raytheon are just the first batch of know companies migrating to Cell.
That makes a lot of sense for Raytheon, given that they likely need Cell to do a lot of signal processing in an embedded sensor application. However, that amounts to using Cell as a glorified DSP. Where are all the programmers rushing to use Cell in general-purpose applications? When is Adobe going to release Photoshop Cell Edition?
Edited 2006-02-08 23:54
That’s easy to answer – ANYTHING that currently runs on POWER/PowerPC. Duh. People seem to forget that Cell is just a PowerPC with a bunch of vector processors. You can run PPC software unchanged on it. IBM and Sony have been working on libraries to make more/better/full use of the vector processors, but it isn’t necessary to use them – you just won’t get full speed until you do.
AIX, linux, and Windows NT all have PowerPC variants. I don’t think there’s any issue with “what will run on Cell” outside a few folks here who aren’t familiar with the Cell architecture.
Cell will run general PowerPC code — poorly. If you don’t use the vector coprocessors, there is absolutely no point in using a Cell CPU. The PPE itself will perform like a 3.2 GHz Pentium-1 (actually, worse given that it has 3x the branch prediction penality).
I really hope you aren’t suggesting that because they’re both in-order 2-issue processors, because that would be a really inaccurate basis for such a comparison.
Inaccurate, sure, but I was going for an order-of-magnitude approximation here. With no OOO, simple branch prediction, etc, MIPS (issue rate * clock rate) is a much more reliable indicator of performance than it is for modern processors. The point is that Cell at 3.2 GHz won’t be like the G5 at 3.2 Ghz, or even the P4 at 3.2 GHz. The basic design is much simpler than that.
I’ll be eager to benchmark these versus the latest and greatest 64 bit sweetness from AMD at the time these come out. I guess only time will tell