Niagara I Out the Door, Time for Niagara II

Thom Holwerda 2005-12-08 Oracle and SUN 25 Comments

If you thought Sun’s chip division had already gone mad when they announced and built the Niagara (the UltraSPARC T1), you’ll be happy to know that with the first Niagara servers out the door, they haven’t exactly been resting on their laurels. Niagara II is on its way: like the T1, it has 8 cores, but now with 8 threads each instead of 4, adding up to a total of 64 threads (the T1 has 32, logically). And, instead of the much-critizised one floating point unit per processor, the Niagara II will feature one floating point unit per core. The chip is set to be released in 2007, at an initial speed of 1.4Ghz.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

25 Comments

2005-12-08 8:56 pm
Anonymous
1.4ghz pah!
my celeron beats that by a mile

2005-12-08 9:37 pm
CaptainPinko
There is no appropriate mod to mod this post down. If you chose “I disagree with this poster’s opinion” you are informed that you should instead reply to their post andcorrect them. Well frankly yeat another condemnation of the MHz-myth isn’t going to help anyone, –and we’ve all seen it explained a milliontimes before– and not in the forum’s best interest I will start modding these types of posts off-topic.

2005-12-08 9:47 pm
jeremy
It could also be a joke. I believe that option needs to be added.

2005-12-08 9:58 pm
Anonymous
10 points to jeremy 🙂

2005-12-08 10:29 pm
voidlogic
CPU freqency does not directly corespond to preformance
For example at a given speed, say 1.5 GHz for a specific task you might see something like this:
Worse____________________________Better*
Celeron < Pentium 4 < Athlon 64 < Itanium
They are many other factors such as the memory controller, cache sizes, pipline size, internal registers, etc that affect preformance
Also, the higher the CPU clock speed the more heat generated, so for thermal reasons it might be wise to run a chip at a lower speed so you can make it bigger.
*Note: I do not want to aruge the preformance shown above, its just a concept example.

2005-12-08 8:57 pm
Anonymous
“much-critizised”!?
Anyway, I hope it’s not too little, too late. I myself had (and eventually have) the pleasure of working with a handful of UltraSPARC workstations and they were really neat. Sun kind of missed the momentum to build critical mass on their own platform and went x86/Opteron. Stupid mistake, IMO.
But then again, here’s Apple going Intel… And IBM building console processors.
2005-12-08 8:58 pm
Anonymous
This is a small speed bump for 2007. Wouldn’t 65 or 45 nanometer make it easier to bump up the speed – or are they furiously trying to keep the pipeline short?

2005-12-08 9:14 pm
_Hob_
I was always under the impression that they were trying to sell this on sheer load and reliability. If that’s the case, why bother with the investment?
I’d be interested to see non-Sun benchmarks on the Niagara, and how well it stacks up against faster, but not so core/thread endowed, competitors.
2005-12-08 9:33 pm
Anonymous
clock speed is not everything, you’re just too used to Intel, plus SUN wants their processors to remain cool and consume less power. the space gained by introducing 65 or 45 nanometers will be used to add the extra FPU and the extra threads.
in a word, impressive
congrats SUN, please make 2006 as great as 2005
2005-12-08 9:34 pm
CaptainPinko
There isn’t really a reason to push much higher since RAM latency isn’t really changing at all. Going out to RAM costs approximately 100cycles. So if every operation was a memory access then a 100MHz computer and a 10GHz CPU would be equally fast; expect that the 10GHz would be spinning it’s tires waiting for the memory.
Thus in a lot of applications memory is the bottle neck.
If you look at the details behind AMD’s power rating you’ll see that often the just bump the cache size instead (or along with) increasing the cache size to limit the memory.
If you look at CPU over-clocking favourites like Super-Pi I’m assuming those things are small enough to fit inside the CPU’s cache and thus do not need to access memory nicely isolating the CPU itself.
However for many real world workloads caching is very difficult and sometimes not very practical if not completely useless. If you are walking through an array or grabbing ‘random’ (from the perspective of memory address that the CPU sees)like in a database or web/file server.
In those cases more cycles gives us practically nothing. If you get a die shrink you are probably better investing in methods to better minimise memory latency rather than in methods to increase frequency.
If you have a parallisable algorithm then the frequency does not matter but thorough-put does. Take ((the number of cycles per second)*(number of instructions per second))/(a factor to approximate how often you are doing nothing at all since every CPU thread is waiting for I/O) and try to maximise that.
Sun could probably push their clock rate a lot higher if they wanted to, but would it improve performance? With more threads per core this should virtually guarantee that every core will be able to be processing something during any cycle.
Also, another added benefit is a reduction in heat. The faster you run a processor the hotter it gets. When you have a lot of servers the hat adds up both in the cost of the electricity to power the cores, and the cost to cool them all down.
Frankly, until memory latency decreases by a factor of at least 10 if not a 100 increasing clock rates will yield practically no benefit to almost all applications.

2005-12-08 9:39 pm
helf
Thanks Cap’n! Someone that knows what they are talking about and can explain it well
MHZ != Performance

2005-12-08 9:44 pm
JonAnderson
I was always under the impression that they were trying to sell this on sheer load and reliability. If that’s the case, why bother with the investment?
I’d be interested to see non-Sun benchmarks on the Niagara, and how well it stacks up against faster, but not so core/thread endowed, competitors.
I can’t actually tell if you are joking or not. This topic is about N2, not the released T1 chip. N2 has substantial improvements over T1 and is not just a speed bump. And can you explain which benchmarks you
would like to see? T1 is only going to do well on
thread rich loads. It’s an 18 wheel artic. not an
F1 car.

2005-12-08 11:06 pm
_Hob_
Um, I was referring to the parent’s suggestion of moving to 45/60nm for increased speed. My point is why bother when they can create a server cpu that can handle process more data in parallel – exactly what I want out of a server.
And as the first generation of Niagara is already with us, I’d love to hear real world examples of how useful it is; and that would give me a better idea of what to expect from N2.

2005-12-08 11:39 pm
butters
For the past several years, Intel et al have been doubling over their transitor count doing things like micro-op fusion, branch prediction, massive instruction windows, and double-pumped adders. Moore’s Law has been holding with regards to transistor count and gate length, but what about real-world performance? Has anything besides Quake 3 Arena framerate scaled with Moore’s Law?
I think Sun is realizing that CPU design is not as complicated as Intel thinks it is. Start with a good architecture, add a robust and scalable memory subsystem, and pile on the pipelines. As many pipes and memory channels as the customer care to pay for. Keep the pipes full of code and data, and no one can touch you in performance/watt/price.
If your workload isn’t parallelized, then the best bang for your buck might truly be to rewrite your software.

2005-12-09 1:34 am
Anonymous
“I think Sun is realizing that CPU design is not as complicated as Intel thinks it is. ”
Exactly!
Actually processor design is much easier when you start with the memory system and design it to handle trully random memory accesses at rates closer to these “slower cpu clocks” ie a few ns rather than the 100-300ns typically seen on x86 for full cache TLB misses.
Once you have this memory design, using those memory slots as an L2 cache thats as big as yer usual DRAM isn’t very difficult esp with 4+ MTA. I am surprised they are moving to 8 way threading though. I thought the current consensus was 4 way is optimal. 4 way costs less to implement and should clock faster with the smaller SRAMs needed for the register files, but then 8 way threading will make the main DRAM look relatively 2x faster but will now further split the cache over more threads.
I also like the idea of a shared FPU which can be designed to go as fast as possible in a dumb fashion, 1 entry per clock, many cycles of latency before result is available or ready to be accepted. Just like memory! Quite a few threads can easily share an FPU and all think they have their own private FPU box that runs just fast enough.
transputer guy

2005-12-09 2:02 am
kaiwai
About the FPU; maybe this is their vision as well; as the eventual replacement for SPARC VI, so that one can have big SUN systems with thin clients loaded off it; couple that with the eventual improvement and scalability of desktop based software, Niagara will be well placed to take advantage of the future direction of software.
Hopefully SUN over the next several years will spend time threading every possible aspect of their software stack as to take advantage of their new line of processors.

2005-12-09 1:44 am
kaiwai
include support for multiprocessor servers and probably arrive in 2007.
Imagine going off to SUN and purchasing a a 64 way Niagara II processor; 4096 threads running simultaneously – now thats what I call a data cruncher, and hopefully with the vastly improved FPU, its customer scope will widen out further to include those in the scientific community etc.
2005-12-09 2:16 am
Anonymous
It’s definately a bold stroke by SUN to go massively multithreaded in their processor designs. They’re sacrificing single thread performance and instead choosing throughput. The processor is aimed at the server market where throughput is most likely the desired performance measurement. The data passing through are going to be a lot of independent threads which is just perfect for this kind of processor. The likelyhood that one program can utilize all these cores and threads is low but put a lot of instances of programs in there instead to serve multiple users and you’ll be quite happy.
2005-12-09 2:25 am
Anonymous
I’d be interested to see non-Sun benchmarks on the Niagara, and how well it stacks up against faster, but not so core/thread endowed, competitors.
Spec benchmarks have to be submited for independent validation with full disclosures. Go to the spec site and read the benchmark results for yourself.
Appserver
http://www.spec.org/jAppServer2004/results/res2005q4/
JBB
http://www.spec.org/jbb2005/results/res2005q4/
The webserver one does not appear to be published yet but they could not have mentioned it in their press releases if it had not been at least submited.
The point of the spec benchmarks is that if you configure everything the same way as the say in the disclosures you should be able to get the exact same result.
2005-12-09 7:58 am
l3v1
8x8threads, “one floating point unit per core”
I don’t know, maybe it’s just too early in the morning for me, but I find this stuff fairly cool (and not in the sense of low power concumption, but that will probably be true as well).
I think recently (~1 year) Sun has pushed out quite a number of great stuff, hardware and software wise. It seems to me that they managed quite well to gather some speed, and I once again have the feeling about Sun that it’s a dynamic company instead of a slow aged granny.
2005-12-09 10:24 am
Anonymous
Making a smaller core means too you have more space to fill with more cores or more cache. Then it would be a good thing. But just going smaller just to speed up… I think it’s not a good idea. Anyway, how long are “we” expecting a 4Ghz Intel/AMD CPU? Half the CPU speed with double (of more) the cache size it’s a better way.

2005-12-09 10:49 am
JonAnderson
T1/Niagara is not as sensitive to cache size as you might
think. Sure, it helps, but cache is predominantly to
solve the latency problem on single threaded behemoths.
If a load stalls Niagara just does something else. So,
the latency problem on Niagara becomes a bandwidth
problem. Fortunately, this is an easier problem to solve
than making memory faster.

2005-12-09 11:18 am
kaiwai
True, unfortunately for many, the assumption is that if one wishes to have a bandwidth of 2100mbps, it must be sitting at 266Mhz DDR – when in reality, SGI showed that 2100mbps could be achieved with their, be it proprietary, memory running at 66Mhz.
For me, I’m interested in seeing SUN push performance forward using their smarts in the engineering department rather than simplying doing the easy and inefficient half baked concept of constantly pushing the clock speed to higher rates.
We all now see the evitable side effects of higher clock rates; and for many, its now time where people are now willing to say that maybe SUN were right all along, clock speed alone isn’t the panacea to performance which Intel would like to make it out to be.
What I would like to see is a greater effort on SUN’s part, however, to either upgrade their workstation line, especially their Blade 150 or atleast offer to the general public a motherboard and processor kit for a modest price of around $600-$700 with a basic built in video card (something like a Radeon 9000 would do the trick), the atleast keep the eco system going rather than simply being relegated to a niche.

2005-12-09 2:01 pm
Wayne Abbott
Hi folks,
If you want to see a lot of information on how these new systems perform please review the various blog entries by Sun engineers. Rich mcDougal has them summarized at http://blogs.sun.com/roller/page/rmc#welcome_to_the_cmt_era1 – I think you’ll find many of the entries very informative. In particular read http://blogs.sun.com/roller/page/bmseer which talks about performance on typical server apps like SAP, appservers, etc.
2005-12-10 4:15 am
Drumhellar
Niagra-2 will also integrate an 8x PCI Express, instead of using a J-Bus connection to another chip for PCIe. It will also support SMP, so dual Niagra-2 systems are possible. It’ll use a different FB-DRAM for memory, and will probably have roughly 50GB/sec(*) main memory bandwith. Also, multiple 10Gbit ethernet connections are likely(*).
Currently, the FPU on a Niagra has a 40 cycle latency, due to access syncronization and the need to move bytes around the chip. With each core having it’s own FPU, the latency should drop quite significantly, so expect a really big gain in floating point stuff.
If all this comes to pass, all I can say is, well, nothing, since my jaw will be on the floor.
*(This is speculation, suggested at http://www.aceshardware.com )