IBM’s Blade Agenda: Cell Chip, InfiniBand

Thom Holwerda 2006-02-04 IBM 18 Comments

IBM is expected to demonstrate a blade server next week based on the Cell processor the company is developing with Toshiba and Sony. The Cell processor, which also is the brains of Sony’s upcoming PlayStation 3 videogame console, has a PowerPC processing core supplemented by eight special-purpose cores to boost the chip’s calculation abilities.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

18 Comments

2006-02-05 6:35 pm

JustAnotherMacUser
Not many people know, but Apple didn’t specifically say they ruled out using IBM processors.

Supposely it was considered and rejected because of the added difficulty subjecting developers to a brand new architecture in addtion to the Intel switch.

Perhaps Apple will take care of it themselves under the banner of the “universal binaries” with X-Code.

I just don’t see how Apple, which makes a ton offering high powered machines for the cpu intensive media market, will ignore something like the Cell.

2006-02-05 7:22 pm

truckweb
I don’t think that Apple will switch back anytime soon.

The only quick switch I could see Apple do is with AMD. They could have had 64Bits everywhere, right now, with AMD. On the desktop and Server (Athlon64 or Opteron) and laptop (Turion64). Even with DualCore for the Athlon and Opteron, and i’m sure AMD will be there soon with a Turion64 X2…

Why move back to PPC?
2006-02-05 9:06 pm

rayiner
Cell would suck as a PowerMac CPU. It’s performance running general purpose code is going to be very low, even at 3.2 GHz. Universal binaries would never take advantage of Cell’s power, since Cell doesn’t appear to the software as a conventional SMP system. Code has to be specially written to submit work to the SPEs in a batch-oriented fashion, and the power of the Cell will be wasted without such specially written code. At that point, it makes more sense to use the GPU as a coprocessor hidden behind an API like CoreImage/CoreVideo.

2006-02-06 7:28 am

jonas.kirilla
If Apple’s CoreImage/CoreVideo can tap the power of the Cell SPEs without exposing applications to Cell specifics, why wouldn’t it be suitable? Are you saying a GPU is better suited? These APIs look like Apple is trying to free itself of the single SIMD/platform, by providing a blessed system interface to GPUs (+ FPGAs & DSPs ?) and perhaps Cell. What do you think, rayiner?

2006-02-06 2:30 pm

rayiner
I think Cell would be very suitable for CoreImage/CoreVideo, and I think you’re absolutely right that Apple (and the industry in general) is trying to free itself from SIMD tied to the main CPU. However, I think that all the talk about Apple using Cell as the CPU for a PowerMac is quite silly — right up there with using the GPU as the CPU in the machine!

2006-02-05 8:02 pm

sp29
I wonder what Apple Machine will run the “rumored” Avid killer, Final Cut Extreme?

Anyways, the Cell chip is a pretty awesome chip. So it doesn’t heat up like the 970?
2006-02-05 11:18 pm

ceo1
The big trouble with Cell : Only IBM and Mercury shipping them.

Cell must become widely licensed (by hardware vendors) to become widely used (by software vendors & end users).

HP has been touting their relationship with Intel and the Itanium2 and would rather go bankrupt than license a chip design from their main competitor (IBM, that is). Mercury, on the other hand, only worries about differentiation, and as such can easily switch from one architecture to the other as long as it is not ‘commonly available’.

Much more likely, the “HPs & the Dells” (whoever that might be, beyond HP & Dell) will start shipping servers with FPGA’s in them (ref SGI’s FPGA addon product), providing acceleration for special purposes when and if needed. Obviously, the FPGA’s carry a fairly hefty price, but these are at least commonly available from various vendors, with no worries about being locked up with a single vendor and platform.

-CEO

Edited 2006-02-05 23:22
2006-02-06 12:17 am

tbcpp
“It’s performance running general purpose code is going to be very low, even at 3.2 GHz.”

I beg to differ. Really, name one task performed by the avarage workstation that needs both high computation and a single thread. There arn’t any. What are the major tasks performed by workstations?

Word processing (no extreme power needed here, a 1 Ghz can do this fine)

DTP (Cell would be able to accelerate this greatly)

Video Editing (same here)

3d work (same thing)

Computer Programming (could benifit from the Cell)

Server Admin (could benifit as well)

My therory is those tasks that cannot be turned into multi-thread don’t need the power offered by multi thread. There are simply no High powered single thread tasks these days! Unless you’re calculatiing pi or somthing like that.

2006-02-06 3:37 am

rayiner
I beg to differ. Really, name one task performed by the avarage workstation that needs both high computation and a single thread.

It’s not just the single-threading issue. It’s the combination of the weird memory model and poor integer performance. On top of all that, applications have to be explicitly recoded to work on the Cell architecture. The motivation for doing that is very minimal, given how expensive it is to develop workstation applications, and how relatively small the target market is.

Word processing (no extreme power needed here, a 1 Ghz can do this fine)

Eh. Word for OS X is not exactly snappy even on my 2.3 GHz G5. It lags very heavily on a 1.6 GHz G5. I shudder to think what it’d be like on Cell.

DTP (Cell would be able to accelerate this greatly)

What makes you think that? A lot of the good page layout algorithms (eg: TeX’s algorithm for paragraph layout) are formulated in single-threaded terms. Indeed, page layout is a pretty single-threaded task, since the layout of successive elements depends on the layout of previous ones.

Video Editing (same here)

True, although, this assumes that apps would be recoded to fit the Cell architecture.

3d work (same thing)

It depends. The Cell is hobbled by the fact that it’s only got 256KB of local store, and extremely high latency to main memory. Cell could accelerate the rendering of scenes using simple algorithms, but then again so could a GPU. Doing more advanced stuff like raytracing would be complicated, given the size of the datasets for complex scenes. The parallelization could be done, but you’d need very clever code to work around the memory bottleneck. The memory wall is surprisingly significant. Despite its great FPU performance, the G5 runs apps like POVray a good deal slower than a comparably-clocked Opteron, because the fairly random access patterns into the large scene database causes a large performance hit.

Computer Programming (could benifit from the Cell)

Simple compilers could be parallelized to run concurrently on the SPEs, but the lack of dynamic branch prediction, the long memory latency, and the long pipeline would kill performance. More sophisticated whole-program compilers are hard to parallelize, since they use optimization algorithms that are hard to formulate in parallel terms.

Server Admin (could benifit as well)

How?

My therory is those tasks that cannot be turned into multi-thread don’t need the power offered by multi thread. There are simply no High powered single thread tasks these days!

You’re theory is wrong. Moreover, it doesn’t matter whether there are lots of tasks that could theoretically be parallelized. The simple fact is that most current implementations are not highly parallelized, and the incentive to reprogram a huge amount of existing code is minimal, especially in the workstation realm. In my field (aerospace engineering), we use FORTRAN programs dating from the 1970’s and 1980’s. There are huge bodies of existing code that are in desperate need of modernization, but are not updated because they have well-understood characteristics and generate well-understood results. Nobody is going to rewrite all that code just for Cell.

2006-02-06 4:56 am

somebody
I love this. One side bashes, one side praises. Shouldn’t you both wait for this thing to actualy come out? So far only speculations exist.

It’s not just the single-threading issue. It’s the combination of the weird memory model and poor integer performance. On top of all that, applications have to be explicitly recoded to work on the Cell architecture. The motivation for doing that is very minimal, given how expensive it is to develop workstation applications, and how relatively small the target market is.

I love this one. poor integer performance. Against what. Did you test it? Poor integer performance can have two meanings, against single ops (which are optimized on Cell) or against other architectures like Intel and AMD? Both completely different story, one could mean insane fast, second insane slow. Not even one word was out what reality says.

Eh. Word for OS X is not exactly snappy even on my 2.3 GHz G5. It lags very heavily on a 1.6 GHz G5. I shudder to think what it’d be like on Cell.

So, acording to your dream benchmarks it is just a bit slower than G3 233MHz. And acording to parent post is equal to the new definition of speed of light.

What makes you think that? A lot of the good page layout algorithms (eg: TeX’s algorithm for paragraph layout) are formulated in single-threaded terms. Indeed, page layout is a pretty single-threaded task, since the layout of successive elements depends on the layout of previous ones.

Since when was page layouting a performance problem? Photoshop on the other hand always is. And this is what parent ment, and this one can be optimized to insanity on Cell.

It depends. The Cell is hobbled by the fact that it’s only got 256KB of local store, and extremely high latency to main memory.

Did dream benchmarks showed that too? Not even a single fact has been available yet. But from wikipedia

Cell contains a dual channel next-generation Rambus XDRcontroller that is incorporated on-die. Using 16-bit wide data bus and one channel, the overall peak memory bandwidth is 2.6 GB/s (1 channel × 1 devices per channel × 1 bytes per device × 2.6 GHz). The system interface used in Cell, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 “lanes,” each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point path are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency

Pentium 4 tops 3.97GB/s, single sided only.

AMD HyperTransport 3.2GB/s in both directions.

How the hell can this crazy fast memory access be slow. But then again, nobody tested it yet.

You’re theory is wrong. Moreover, it doesn’t matter whether there are lots of tasks that could theoretically be parallelized. The simple fact is that most current implementations are not highly parallelized, and the incentive to reprogram a huge amount of existing code is minimal, especially in the workstation realm. In my field (aerospace engineering), we use FORTRAN programs dating from the 1970’s and 1980’s. There are huge bodies of existing code that are in desperate need of modernization, but are not updated because they have well-understood characteristics and generate well-understood results. Nobody is going to rewrite all that code just for Cell.

Not any of the code is Cell optimized, yes. But it seems you plan to live in ’70s and not move forward driving math with abacus balls.

Nobody is going to optimize all that code. Yes. You don’t need it. Optimize bottlenecks.

What will definitely be optimized, wheter you start doing it or not? At least 3d libs, audio libs, video libs. Way more than sufficient for what parent said. Except compiler stuff. From how it works, this arch has no better approach than usual PPC and PPC sucks on compilers.

I for one am all for modifying my code to fit new approach if it shows progress. And in my eyes Cell does show progress, because it seems like gods gift to my needs (that is, after reading speculative papers).

2006-02-06 5:49 am

rayiner
I love this. One side bashes, one side praises. Shouldn’t you both wait for this thing to actualy come out? So far only speculations exist.

You can glean a lot from a just a theoretical consideration of the chip. The Cell’s SPE’s are two-issue in-order designs with no dynamic branch prediction and minimal integer resources. They have a 3-cycle latency to the register file, and a 7-cycle latency to what is effectively their L1 cache. From a theoretical consideration, the original Pentium will have better per-clock performance than an SPE.

I love this one. poor integer performance. Against what. Did you test it?

I don’t have to. If you build a car with square wheels, you can already get a good idea of what it will perform like. Cell might very well be excellent for pushing a lot of single-precision floating-point operations within a limited power envelope. That’s what it was designed to do. However, it will not be a good general purpose workstation processor, and that was not what it was designed to be. Beyond that, there are already initial reports that suggest the PPE’s integer performance is very lacking. The PPE is highly-sophisticated design compared to the SPE. Both are markedly inferior to even decade-old PowerPC chips (604e), in the context of general-purpose integer code.

Since when was page layouting a performance problem?

Ever run TeX on a massive document? I’d presume InDesign is the same way.

Photoshop on the other hand always is. And this is what parent ment, and this one can be optimized to insanity on Cell.

Plugins and image effects, perhaps, but again, special purpose code will have to be written for it. However, Photoshop isn’t all plugins. You’ve got lots of logic for history manipulation, UI, etc. That code will run like crap on Cell. Given all that, it makes more sense to accelerate plugins using something like CoreImage — the GPU can certainly offer performance comparable to Cell on such a task.

Did dream benchmarks showed that too? Not even a single fact has been available yet.

The SPE’s memory model is fairly well-documented. See ArsTechnica’s series of articles on that subject.

How the hell can this crazy fast memory access be slow. But then again, nobody tested it yet.

Your “facts” about the Pentium4 and the Opteron are wrong. The Opteron’s memory bandwidth is 6.4 GB/sec. The P4 can hit up to 8.5 GB/sec. Of course, on memory-intensive benchmarks, the Opteron is usually faster despite the P4’s higher bandwidth. Bandwidth is far from the only consideration. Cell has a lot of bandwidth, but the SPE’s also cannot randomly access memory. Instead, they operate out of their 256KB of local store, and DMA data blocks from memory as needed. The DMA operation is very high-latency, despite the fact that its high-bandwidth. The primary issue here is that Cell’s local store design is completely alien to current programs. Current code (and indeed most current algorithms) are designed with the assumption of memory as a linear, randomly-accessible space. Large, fast caches are used to make real memory approximate this ideal. The Cell has no such abstractions, which means code not specially written for its memory model will suck. Some code (some media algorithms), will be adaptable to this memory model. A lot of code will not be.

Not any of the code is Cell optimized, yes. But it seems you plan to live in ’70s and not move forward driving math with abacus balls.

The world is what it is. Hardware exists to serve the needs of software, and programmers are much more expensive than computers. Most engineering firms will not care if a $2000 Cell workstation can give them the performance of a $10000 traditional machine, if they’re going to need to pay several $70k/year programmers and engineers to change the existing code.

Nobody is going to optimize all that code. Yes. You don’t need it. Optimize bottlenecks.

Cell isn’t set up that way. It’s not set up to just accelerate your “inner loop”. In order to get good performance from a Cell program, you have to buy into its weird SPE model. Some fields (eg: signal processing), will be fine with this — they’ve had to program specialized DSPs for years to do their jobs. Most fields won’t be. Hell, is their even a FORTRAN compiler for the SPEs?

At least 3d libs, audio libs, video libs.

Again, the Cell is not set up to just accelerate audio/video libraries. Cell changes the fundamental data flow model within the program. If you try to plug Cell into an existing design (eg: upload all your image data to the SPEs each frame, download it afterwards), you’re performance is going to be sub-par. Heck, the initial reports suggest performance is sub-par even with specially designed code — apparently, the PPE isn’t fast enough to keep the SPE’s busy. The thing is hard enough to program even when designing code specially for it — retrofitting existing code will be a massive PITA.

Way more than sufficient for what parent said.

The parent suggested putting this in a PowerMac. In a PowerMac, the Cell will appear, and perform, like a single G4-class CPU.

I for one am all for modifying my code to fit new approach if it shows progress. And in my eyes Cell does show progress, because it seems like gods gift to my needs (that is, after reading speculative papers).

Have fun wasting your time, because as soon as the “next big thing” comes out, you’ll have to modify your code all over again. Meanwhile, Intel and AMD have the right idea, designing processors that can run general purpose code quickly.

2006-02-08 4:30 am

somebody
Ever run TeX on a massive document? I’d presume InDesign is the same way.

You assume wrong. People still use G3 233 without any performance loss on Quark or other layouting software.

Again, the Cell is not set up to just accelerate audio/video libraries. Cell changes the fundamental data flow model within the program. If you try to plug Cell into an existing design (eg: upload all your image data to the SPEs each frame,…

In a simple translation, current model doesn’t work for Cell. A well known fact. I never said use same app. I said, optimize.

The parent suggested putting this in a PowerMac. In a PowerMac, the Cell will appear, and perform, like a single G4-class CPU.

If that is predicted non-optimized performance, then Cell already won for my basic needs. My HP needs fit to Cell as baby in the cradle (but read next part too before answering).

Have fun wasting your time, because as soon as the “next big thing” comes out, you’ll have to modify your code all over again. Meanwhile, Intel and AMD have the right idea, designing processors that can run general purpose code quickly.

So far, they are both well under my current needs. I don’t mind to rewrite if customer is prepared to pay. So why would I need to look at this as a waste of time. I will either be paid or won’t do it. In the end customer will decide to throw money in development and hardware.

General purpose? As I say what I’m interested and need are its special purposes. In other case I would not look at the other arch.

Second option I’m considering is moving on Power completely. And since I got time. I’ll just wait and see what time will bring.

Edited 2006-02-08 04:48

2006-02-06 7:32 am

nimble
I love this. One side bashes, one side praises.

I didn’t see any “bashing”, all reasonable arguments.

Shouldn’t you both wait for this thing to actualy come out? So far only speculations exist.

No. If the marketing departments hype it up before it’s released and fans are happy to cheer it on, then of course people are entitled to criticise the thing, even if the criticism has to be based on the incomplete information that IBM & Co provide.

2006-02-08 4:40 am

somebody
I didn’t see any “bashing”, all reasonable arguments.

Ok, need to be a bit clearer here, my bad.

One is considering Cell as next Gods revelation of speed.

Second is talking about thr same thing, but saying it will be slow as 486 DX33

I don’t like either. Wait and see. Then bitch or praise. But until then, speculate without posting speed results.

No. If the marketing departments hype it up before it’s released and fans are happy to cheer it on, then of course people are entitled to criticise the thing, even if the criticism has to be based on the incomplete information that IBM & Co provide.

Yeah, I don’t mind that. I meant bechmarking specification papers, or at least to stand away from speaking about speed. Personaly, I like what Cell provides. But, I plan to test it whether my personal fanboyism was for real or not. This is why I don’t like people speaking about tech before it was even out with the same tone as they would be intensivly testing this thing for the last year.

2006-02-06 12:41 pm

chemical_scum
This thread has been taken off topic by all the Mac fanboys arguing about whether it would be good or not for running an OSX desktop. The article is about IBM’s new cell Blade Servers.

These machines are not for desktops and workstations nor are they likely to run OSX. They are intended for the datacentre and they will run Linux. “Linux is popular in computing clusters, and IBM, Sony and Toshiba researchers released a version of Linux for Cell. That version includes the ability to load and communicate with simple programs in the special-purpose cores… the special-purpose engines use different instructions, so IBM built support for the chip into the GCC used in software development.”.

I would be interested in a discussion about this and what sort of tasks these blade servers would be useful for in the datacentre and the associated technical advantages and disadvantages.

Edited 2006-02-06 12:47

2006-02-06 1:39 pm

nimble
Linux is popular in computing clusters, and IBM, Sony and Toshiba researchers released a version of Linux for Cell. That version includes the ability to load and communicate with simple programs in the special-purpose cores…

Presumably that means spufs, the pseudo file system for accessing the SPEs. It’a fairly primitive interface that provides detailed control but little convenience.

the special-purpose engines use different instructions, so IBM built support for the chip into the GCC used in software development.

This refers to a GCC backend for generating SPE code. Afaik it does not mean that GCC itself runs on the SPEs. (Given the memory architecture of the Cell and the complex structure of GCC that would be a VERY difficult thing to do.)

With limited integer performance of the Cell’s PPE, cross-compiling is probably the way to go.

what sort of tasks would these blade servers be useful for in the datacentre?

Good question. Number crunching is the only thing I can think of.

Web servers and their like would need to be specially redesigned and rewritten to make use of the SPUs with their distributed memory model. And even if you did that, the vector and floating-point units would go largely unused, so you’re much better off with something like Sun’s Niagara.

2006-02-06 2:54 pm

trapexit
From all i’ve read… many of the dumbed down aspects of the Cell have been done to limit the complexity so to save on transistors and instead simply increase the clock speed. I’d imagine that in the future, just as they have done with RISC from it’s early days, as technology gets better the Cell architecture will be able to reintroduce those more complex aspects.
2006-02-06 8:58 pm

Wes Felter
The Cell blade has been demonstrated montha ago, yet the summary didn’t mention the new information in the article like the rumored BladeCenter H.