IBM Corp. has revealed a prototype blade server board featuring the Cell microprocessor jointly developed with the Sony Group and Toshiba Corp. The company demonstrated the prototype in front of only a few clients at a hotel room outside Los Angeles, US, at the 2005 Electronic Entertainment Expo (E3), game tradeshow.
IBM Discloses Cell Based Blade Server Board Prototype
Submitted by Peter Stöckli 2005-05-27 IBM 35 Comments
I wonder if IBM is working on Apples PPC product or if Apple will switch to Cell Processors. I WONDER!
Wow … I’d love sony to release a linux kit for the ps3 || apple start selling cell-powered macs
it seem that the skill of the cell goes beyond multimedia processing. With the port of linux on cell based blade server it will be probably possible to do the same with sony ps3. I can’t wait 2006, i want my ps3 Now with linux on it !!!
I would say that the PS3 already runs Linux as it’s OS.
The cell based server board makes no sense, too much heat and too many low precision flops that are tuned for gaming engines.
It might be okay for a hollywood render farm but for scientific computing the industry wants 64b FPUs that can be useful. There is way too much hype in this, but what would you expect from a market driven by teens.
The Niagara, and Raza chips look more promising, no FPU but serious computing throughput. I am not yet convinced by the cell, the more I hear the more it turns me off with all the spamming done by its fanboys.
Exactly what is the intended purpose of this server? It is a multimedia server? Is a video or voice over IP server? What is the deal here because i kind wonder about the price/performance of a cell based server for hosting stuff like osnews.com (low bandwidth low graphics).
There is actually a company (i forget which) that is trying to run audio effects on the extra processing power of nvdia graphics cards. That is kind of cool and its sounds like cell might be perfect for a similar application in macs or pcs.
First, if you let the opinions of others, or how they express them, influence your judgement, then blame yourself, not the product. There is no rational reason for that sort of thinking.
Second, I disagree that a Cell based sever board makes no sense. A single Cell has a theoretical throughput of 20+ gigaflops doing double-precision IEEE 854 math (at 3.2GHz). A single G5 has a theoretical peak of 4.5 gigaflops doing double precision math (at 2.3Ghz). Sure, double precision math is a huge hit on the Cell (10x slower than single-precision math), but that’s still pretty damn fast. And I have no idea what you mean about “too much heat” — the Cell is designed to be low power, and should run at a Vdd of 1.0 to 1.1 volts. It shouldn’t dissipate any more energy at those settings than the original 130nm G5’s.
As for “too much heat”. If you look closely at the board, and if the article is correct about the Cell processors being the ones on the left in the first picture, then it’s interesting to note that its the south bridge chips that have the huge heatsink/fan. The Cell processors, if they were on the left in the first picture, are on the bottom in the second picture, have quite small heatsinks/fans, especially for something running at 2.4-2.8GHz!
Actually, check that, I’m a dumbass. The smaller heatsinks are actually on the RAM. Holy crap, that’s something new, I’ve never seen RAM that needed an active heatsink!
If you look carefully you can see that the XDR RAM has the small heatsinks and the Cells have the big heatsinks.
A 130nm PowerPC 970 can use up to 80W @ 2GHz according to my published measurements.
Do you think Cell RAM needs a heat sink because of the extreme speed of XDRAM? I believe it runs at 2Ghz on the PS3.
Also do the scale down the clock for server boards. The PS3 again runs at 3.2 Ghz. I wonder if it is fabbed at 130nm or 90nm.
Does anyone know what a dual POWER5 system can get Gflops wise, they say this dual cell box is 400Gflops But I think the POWER5 is more powerful and runs cooler, anyone know of any benchmarks you can compare this cell system to at this point?
I think Cell makes sense in servers mostly because of its flexible design. It will be easy and cheap for IBM to crank out all kinds of “different” Cell CPUs, which of course will differ in little more than number of Cells. This is great for covering a wide range of market with minimum effort. Plus the kind of games Sony and IBM envision aren’t that far away from real scientific workload either…
@Wes Felter: Yeah, I noticed that right after I posted the first time. Man, I’ve never seen an HSF on RAM before
@Andrewg: I’m pretty sure the Cell’s RAM runs at 400MHz. It transfers 8 times per cycle, though, so with a 64-bit bus, you get 400x8x8 = 25.6GB/sec (the published memory bandwidth for the PS3). That’s fast (DDR400 RAM runs at 200MHz), but it shouldn’t require a heatsink. I’ve got a 6600GT on my desk, whose DDR3 RAM runs at 500MHz (it’s DDR rating is 1GHz), and it doesn’t have any active cooling (or even a heatsink), on its RAM chips.
First, the Power5 does not run cooler. The Power4 was 125 watts at 130nm. Power5+ should be a lot less at 90nm, but it’s also a bit bigger and running at a higher clockspeed, so it won’t be *that* much less.
Second, the preformance depends on what you’re running. If you have some easily parallizable floating-point numeric problem that doesn’t use too much memory and doesn’t branch that much, then the Cell will beat Power5 (by a large amount). The Power5 has only 2 FPUs running at ~2GHz, and no vector units. Now, if you’re talking about branchy integer code that uses lot’s of memory, Power5 will win, again by a large margin. With huge caches, multiple wide cache busses, and a ridiculous memory bus, it’s built precisely for that sort of code.
Thanks for the explanation.
So basically in the specs when it states that the memory is at 3.2GHZ They mean 8×400 Mhz for an effective 3.2Ghz? The VRAM on the PS3 is DDR3 and listed at 700Mhz though. Is it listed at 700Mhz because it only transfers once per cycle? My understanding is that XDR is considerably faster than any other memory available.
Some info I found is below.
IGN facts on the PS3 state that
256MB XDR Main RAM @3.2GHz 256MB GDDR3 VRAM @700MHz
Engadget has XDR at 12X the speed of DDR400 extract follows
At transmission speeds up to 9.6GB per second, these chips will fly about 12 times faster than DDR400 memory.
An older link I found stated the following
Initially XDR DRAM will be offered at 3.2GHz with a roadmap to 6.4GHz and beyond, enabling memory system bandwidths up to 100GB/s
The XBOX specs state the following
512 MB of 700 MHz GDDR3 RAM
I’ve always hated the “X MHz DDR = 2X MHz” thing. But to answer your question, you’re right about the XDR, wrong about the DDR3. The DDR3 actually runs at 700MHz (the DDR3 in the G70 is projected to hit 900MHz). It transfers twice per clock, so it’s got an effective rate of 1.4 gigabits per pin. The PS3 uses a 128-bit implementation, so you’re talking about 22.4 GB/sec for the whole implementation.
Thanks for the quick lesson. Basically I now have a clue
Wondering why XDR is not run at a higher clock and also on a 64-bit implementation. Probably has to do with trade-offs and XDR may have more room for growth.
I believe that XDR has better latency than DDR3. Is that true? If so, is the lower latency a result of transferring 8 times per cycle instead of 2 times per cycle.
The 64-bit implementation is probably to keep costs down. God nows with a 220M transistor CPU, plus a 300M transitor GPU, plus 512MB of very expensive RAM in a $500 box, IBM and Sony have to cut corners somewhere. Same reason they use a 128-bit memory interface for the GPU instead of the new-standard 256-bit interface.
As for XDR having low latency, you’re post was the first time I heard that. I looked around some more, and sure enough, that’s what Rambus claims. If this is true, that would be awesome. I actually expected it to have very high latency. Serialization (the transferring 8 times per cycle over narrow links), tends to hurt latency rather than help it.
Wasn’t one of the selling points of the Cell architecture that it could use other Cells on the network to do supplemental processing? It seems to me that this would be a good reason to have Cell servers.
I actually expected it to have very high latency.
That was what I thought because the old RDRAM had higher latency.
Seems that the reason for lower latency is FlexIO technology and in particular FlexPhase.
Here is an extract I found
FlexPhase will allow precise on-chip alignment of data with clock. This doesn’t mean as much to consumers as it does motherboard manufactures. The manufactures don’t need to be worried about PCB trace lengths matching and PCB timing constraints. Even wonder why when you look at motherboards the traces seems to run in funny patterns? This is because current memory technologies need the entire tracer lengths from the memory to the memory controller needs to be the same length. If all the tracers are the same length, then every bit of info will arrive in the right order. If the tracers are different lengths, data with the shorter route will get there faster then data from a longer tracers. For consumers this means that motherboards can be a lot simpler, thus smaller, and maybe cheaper
bingo, it is supposed to be easy to network together into a cluster system. and thats exactly what blade racks are designed for, clusters.
allso, iirc from the article about the cell over on arstechnica the vector units of the cell looks to be easy to replace with other units if ibm feels like it. so they can build other kinds of specialized cells for diffrent enviroments, or so it seems.
Sorry here is a better link
about those fans. this is a prototype rig. if i understand the article right they are pushing the cell beyond its published specs in the lab. and as all overclockers know, overclocking needs a lot of cooling
Hey, thanks for the link, I hadn’t seen that before. There’s an innaccuracy on page 2 (DDR2 is only 2 transfers per cycle, externally, not 4 like the article says). However, its the third page that’s really interesting. They cite a latency spec for XDR that looks like:
Now, I don’t know what that’s supposed to represent, but it looks a whole lot like timings to me! In comparison, really good DDR400 RAM (TCCD rated at 2-2-2-5 with a cycle time of 5ns) would be rated 10/10/10/25ns.
Seems almost too good to be true (which means it probably is), but man, that’d be awesome.
I knew it, it was too good to be true Those numbers don’t seem to be the timings. I *did*, however, find some actual timings. It’s here: http://www.rambus.co.jp/events/Track1.4_Rambus_Echevarria.pdf on page 4. It lists:
tRP = 15ns, tRAS = 25ns, tRCD-R = 12.5ns, tRCD-W = 2.5ns
Now, normally memory timings are given in this format:
Where each number is in clock-cycles. Really good DDR400 is rated 2-2-2-5, while good value RAM is rated 3-3-3-8. Since DDR400 has a 200MHz clock with a 5ns cycle time, we’re talking about a rating of (for the value RAM):
tRP = 15ns, tRCD = 15ns, tRAS = 40ns.
Note that while the XDR memory has different timings for read (tRCD-R) and write (tRCD-W), the regular SDRAM just has one timing for both (tRCD). So XDR really does look like it has lower latency than DDR, which is good news, in a way. It has lower latency in absolute time, but not relative to its clock-speed. In cycles, it looks like XDR’s timings are:
? – 6 – 5/1 – 10 (note two values for tRCD and unknown CAS).
So in absolute terms its faster, but the latency/bandwidth ratio continues to get worse. Of course, this’ll probably get better as time goes on. High-clockspeed memory is difficult to operate at tight timings, at least initially. IIRC, DDR400 started at timings of 3-4-4-8 or higher originally (almost twice as high as good new chips).
I have two seperate theories here on why there’s some pretty massive coolers on the RAM chips. Heck, they look bigger than the original heatsinks on early pentium 133mhzs but…anyways.
Correct me if Im wrong but cell is ment to operate with its own external SRAM cache isnt it? 256kb per chip correct? Or is that just the PS3 implimentation? Granted, there’s far more chips than that 256kb could account for but one would think that onboard SRAM acting at a much lower latency/chipspeed would require additional cooling.
More realistically though its just the RDRAM runs hot. The original RDRAM required heatspreaders and got pretty toasty to the touch and that was transfering far less data per second. Wouldn’t suprise me now if it requires active cooling above and beyond just a heatspreader.
The real interesting thing though is that its a really inefficient heatsink design since they’re attaching heatspread to the RAM chips, then heatsink to the spreader but the heatsink barely covers half the spreaders. If they actually increased the size of the heatsink to cover the rest of the spreader they might very well be able to get away without the fan.
Granted, it is a prototype so…;)
No. Each SPE has its on internal 256KB SRAM, and that’s per SPE (or 2MB per Cell). Also, its XDR DRAM, not RDRAM. But you’re probably right about it running hot. 400MHz might get pretty toasty.
That Rambus link was a really interesting read. Not that I understood it all, but still a relatively accessible technical document.
can ibm do, with less work, make an evolution of the actual chell? like replace the actual ppe with a more powerful dual core power5 ppc derivate?
If Cell achieves all that it promises, then I predict further woes for Intel (and AMD as well).
With the Cell on the PS3 as well as PowerPCs on Xbox360 and – in all probability – on Nintendo’s next console, IBM and its partners have made a brillant strategic move. Of course, one never knows how things might turn out, but it’ll sure be exciting to watch from the sidelines.
If folks want low latency DRAM, the only company delivering it is Micron/Infineon with RLDRAM II (I assume it is delivering). I am told the premium is 50% over regular DRAM but its only used by the networking guys so far.
RLDRAM II can start command cycles every 2.5ns period to 8 independant banks which each have a full 20ns sram like simple latency. This is for 256M,512M parts. The IOs are also DDR so thats 800MHz on 8/16/36 IO bits. The whole issue of bank/cas/ras/precharge is reduced to a much simplified single command on a 23bit address path. The IOs can be common or split depending on useage patterns.
As long as each request does not hit the same bank within 20ns of a previous hit the request can proceed. Now thats the kind of DRAM I would like to see get used in 4way threaded cpus and is getting used by the Raza MIPs chip. The 2.5ns command clock cycle can only fall slowly over time, but more interesting is that the core latency or full random cycle is falling to 15ns and even better the no of independant banks could go up significantly, networking customers have asked for 128 banks. Ofcourse the DRAM core is not the same as your commodity SDRAM core, much shorter bitline ie 32 bits per line v say 128 or 256 bits per line in SDRAM.
The XDR and the RDRAM AFAIK are both mostly similar to SDRAM cores, only the interfaces have been moving to faster & faster serial schemes which increase bandwidth but hurts relative latency some extent. Rambus I believe does not do DRAM design, they do interface design although they probably have some input to the core. XDR atleast has 8 banks but I don’t get the impression they can be conccurent the same way the RLDRAM interface allows. IIRC RLDRAM II can hit theoretical 3.2GBytes/sec (400*2*4) using DDR and 32bit bus v the 4GB I saw in the XDR ref above. I’d far rather have the 2.5ns command rate with a matched threaded cpu to hide the 8 cycles of DRAM latency. But the XDR signal matching scheme looks interesting too.
My conclusion remains that the Cell will be 1 almighty media/vector engine and if you can make your app look like a vector or a highly scheduled app, then it could get a big boost over current cpus. Oh yea you heard about the Toshiba story of 40mepgs (HDTV?) running at same time, kinda proves the pt.
Funny thing though, none of my chip design work looks like it can be vectorized not that I use any FPU. I suspect my random pattern int code will kill it the same way it kills my xp2400 assuming I could ever could get a dev platform to compare.
Thanks for the great summary of RLDRAM. I’ve seen vague references to it, but nobody seems to go into much detail with it when discussing new memory technologies.
XDR atleast has 8 banks but I don’t get the impression they can be conccurent the same way the RLDRAM interface allows.
I don’t think XDR has it, but Rambus has a technology called microthreading that allows for concurrent access as well as scatter-gather access from memory.
IIRC RLDRAM II can hit theoretical 3.2GBytes/sec (400*2*4) using DDR and 32bit bus v the 4GB I saw in the XDR ref above.
If my math is right, on a 32-bit bus (XDR goes up to 128-bits), XDR hits 12.8 GB/sec at 400MHz.
My conclusion remains that the Cell will be 1 almighty media/vector engine and if you can make your app look like a vector or a highly scheduled app, then it could get a big boost over current cpus.
There is an interesting dimension to Cell that most people don’t touch upon. Yes, it’s going to be awesome if your app processes 128-element vectors in parallel, but that’s not the only sort of concurrency it’ll work with. It should be pretty good for simulations too. I worked on a network simulation that’d map pretty well to Cell, even though a lot of it is integer logic. While an SPE might not be a demon at integer code, that’d be more than made-up-for by the fact that eight network nodes could be simulated in parallel.
The purpose of IBM’s cell blade is too bring to market a ultra-high performance blade server. Quite frankly I was totally caught off guard. I am very suprised how fast the progress development of the cell processor is. What is really amazing is that this only the first generation of the cell. We are going to see the 16 core iteration of the cell come out later on.
But I am waiting to see if Microsoft decides to support NT on PowerPC again.
No wonder Apple and Sony have been getting along lately! Maybe secretly Apple is part of the developing party. It would be typical for Apple not to mention or IBM not to mention Apple’s involvement till the very end until it’s released in the mac.