Sun Open Sources UltraSPARC T1 Processor

Submitted by Sebastian Schildt 2005-12-07 Oracle and SUN 40 Comments

Sun announced plans to publish specifications for the UltraSPARC-based chip, including the source of the design expressed in Verilog, a verification suite and simulation models, instruction set architecture specification (UltraSPARC Architecture 2005) and a Solaris OS port. The goal is to enable community members to build on proven technology at a markedly lower cost and to innovate freely. The source code will be released under an Open Source Initiative (OSI)-approved open source license. The ‘older’ SPARC architectures were also open.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

40 Comments

2005-12-07 10:10 am

amjith
There are some really good comments this time in slashdot about this move by SUN. Now, there are always some thick heads out there, who are complaining that SUN still holds the patents for most of the technology they used for producing their processor.

My stand on this is, they didn’t open-source design for Intel and AMD to copy them and release it as their own product. I think this move from SUN is to help the compiler writers and operating system designers to utilize the full effect of their processor’s special features.

This is an excerpt from my <a href=”http://amjith.blogspot.com“>blog.

2005-12-07 10:48 am

mario
My stand on this is, they didn’t open-source design for Intel and AMD to copy them and release it as their own product. I think this move from SUN is to help the compiler writers and operating system designers to utilize the full effect of their processor’s special features.

You don’t need Verilog (or VHDL) listings for a particular CPU, in order to be able to write compilers for it. In fact, they are almost useless for such a purpose. Have you seen a Verilog listing for anything more complex than a flip-flop? I didn’t think so.

2005-12-07 9:34 pm

Anonymous
yep, the core-docs are “useless”. os-devs’d need chipset documentation, which sun isn’t providing (yet). [Just docs, not circuit info to etch your own boards..]

2005-12-07 11:32 am

chekr
“SUN still holds the patents for most of the technology they used for producing their processor.”

If they end up licensing under a model similar to CDDL it will include a patent grant in the license. This in my mind is what makes CDDL the best (go on…flame away) license out there, I mean how can you argue with that, a patent grant in a license…who would have thought?

“they didn’t open-source design for Intel and AMD to copy them and release it as their own product.

I don’t think their competition are going to pick up the designs and go to fab any time soon…there is a human condition called ego which must be overcome before something like that will happen.

2005-12-07 10:46 am

mario
…I think this is a dumb move. With Niagara, Sun finally had a clear edge in the CPU arena, against it’s competitor. An 8 core, 4 HW threads per core, 64-bit architecture (and these are general-purpose CPU cores, not like the cell architecture); I mean, this thing super-duper-rocks for database and websurfing. And Sun just gives it away?

2005-12-07 11:09 am

Anonymous
“I think this is a dumb move”

It is a long way from a Verilog Description to a working processor comming out of the fab.

Furthermore with this, SUN might get a firm grasp on universities using this design in their lectures.

And even if some chinese company starts to sell el-cheapo T1 like devices it will only be godd for SUN at this moment, because anything that broadens the SPARC User base is good for SUN.

For me, I think it’s Xillinx turn now: I neeed waaaaaay bigger FPGAs to put a stripped down T1 version into it

It will certainly be interesting to see what comes out of this
2005-12-07 3:36 pm

Matt Giacomini
Sun has the Rock processor coming out soon. The T1 is nice in that it puts sun back on the map, but the processor that is going to bring there systems back to life will be the Rock. Why not give out Niagara, and like one poster said, get back into the university’s mind, and show that you are committed to giving people the knowledge and support they need to use your product as a platform for innovation.

I think that sun is very smart in making all these moves. This is the first time in years that I have found myself feeling good about sun again. I used to be a die hard sun guy 10 years ago, and now I feel those feeling coming back
2005-12-07 7:08 pm

Anonymous
“…I think this is a dumb move.”

I would just love for everyone to copy this design and let x86 finally die! And guess what, so even if AMD and Intel copy this chip, Sun still benefits from the support they can make from Solaris as Solaris fits on Niagria like a glove.

Go Sun! I look forward to seeing more of their other great products.

2005-12-07 11:17 am

Snake007uk
I know the webcast mentioned APP server and Web server usage, but i remeber reading that the chip only has one floating point unit so it cant do work on heavy loads, something along those lines, can someone clear this up.

Thanks

2005-12-07 11:36 am

Anonymous
From what I understand it is not designed for floating point performance. It is designed for server apps which rely more on integer performance.

You might find more details on Sun’s website:

UltraSPARC T1 (aka Niagara):

http://www.sun.com/processors/UltraSPARC-T1/

T2000 server:

http://www.sun.com/servers/coolthreads/t2000/

Benchmark results:

http://www.sun.com/servers/coolthreads/t1000/benchmarks.jsp

2005-12-07 11:40 am

Anonymous
Surely they don’t automatically synthesize their processor cores from Verilog!? Although that would explain the Niagara’s disappointing clock rates.

Part of the secret of the Alpha’s speed was that the engineers would manually design and tweak it down to gate and even transistor level.

2005-12-07 1:14 pm

mario
That’s silly. Verlog, like VHDL, is just a hardware description language. It -can- be used to synthesize directly through FPGA or such, but it doesn’t have to. It is just like a schematic, except that you can simulate it, something you can’t do with a schematic directly.
2005-12-07 3:16 pm

JonAnderson
Niagaras clock rates are part of the total design not a

restriction of the process. You clearly don’t understand the design principles involved. Why are you dissapointed

by the clock rate? Are you one those numpties who

believe high clock rate = fast cpu?
2005-12-07 3:46 pm

pica
The UltraSPARC T1 has a 7 stage pipeline, whereas the Pentium 4 has more than 30 pipeline stages. Given that the clock rate is state of art.

pica

2005-12-07 4:13 pm

nimble
The UltraSPARC T1 has a 7 stage pipeline, whereas the Pentium 4 has more than 30 pipeline stages. Given that the clock rate is state of art.

Never mind the P4, the Athlon pipeline gets to 2.8 GHz with 12 stages, even though with out-of-order execution it’s a lot more complex.

And what’s stopping Sun from adding a few more stages anyway? The more stages you have, the more the core is going to benefit from SMT. And the Pentium M shows that you can still have decent power consumption at 2 GHz.

2005-12-07 4:48 pm

JonAnderson
…. and you end up with a 5m2 die which requires 500w

to run and needs 1000gb/s memory bandwidth to keep it

going. You are completely missing the point of Niagara,

Clock rate is insignificant – go and see the benches if

you don’t believe me. Your trying to compare a man in an

F1 car and scythe cutting a field of wheat with a combine

harvester.

The point was to balance memory access latency and

throughput with execution throughput. Increasing the

frequency requires comprises in other areas such as

memory speed, power consumption cooling etc.
2005-12-07 4:48 pm

Arun
Never mind the P4, the Athlon pipeline gets to 2.8 GHz with 12 stages, even though with out-of-order execution it’s a lot more complex.

And what’s stopping Sun from adding a few more stages anyway? The more stages you have, the more the core is going to benefit from SMT. And the Pentium M shows that you can still have decent power consumption at 2 GHz.

You clearly don’t understand TLP designs at all. All the cpus you mentioned at higher clock rates are plagued by wait times for memory on a load miss. So you get a single stream of instructions done and wait for memory which is hundreds of clock cycles away. By increasing clock rates all you have done is reduced the processing time but the memory wait is still hundreds of clocks away.

A TLP processor like Niagara, switches to another thread while a stall occurs and can switch to three more before the first threads data arrives. So in effect getting more work done( throughput) while the faster clock rate OOOE processors wait for memory. The complexity of this design is not in the core but the highly parallel memory subsystem and caches. Niagara has to have high memory bandwidth in order to supply the 8 cores with data. It has a banked L2 cache and 4 dual channel (144 bits) memory controllers to achieve that end.

Niagara is intended for netwrok facing loads like webservices and transaction processing. It isn’t a HPC processor. And it blows away the other chips you mentioned in doing what is was designed for. A single Niagara chip will have better TPC numbers than 1 or 2 cpu xeon of opteron boxes in the same pricepoint from IBM or DELL at 1/4th or 1/3rd the power consumption. So a rack of these will blow away a rack of dells and Ibms and data centers can squeeze more of these systems in an enviroment to grow as performance demand increases.

Edited 2005-12-07 16:50

2005-12-08 12:53 pm

viton
I would like to have a cpu with 32 ARM Cortex cores @ 1Ghz+ and <15W power consumption instead of Niagara.

How about 3000 realtime mp3 streams decoding in software? =)

2005-12-07 4:52 pm

Anonymous
Clock speed isn’t everything, anyways i doubt they be able to keep it to 70W’s if it was 8cores running at 2GHz instead of 8cores at 1GHz 😉

The bandwidth on it is quite impressive, 220GBytes/s for the internal crossbar and the 4 DDR2 memory controllors (on die) add up to over 20GBytes/s of memory Bandwidth.

What will be interesting is what SUN eventually add into the mix, given that the recent release of Solaris has a new TCP/IP stack it be interesting to see if they produce an offload engine for it etc.
2005-12-08 4:15 pm

Anonymous
The more stages you have, the more a pipeline bubble is going to negatively impact your performance. 30 stages is probably excessive. A branch at the wrong time will invalidate 30 stages of the pipeline. Better hope you’ve got a really good compiler.

SPARC chips have always been about RISC over CISC. Anyone who has done much UltraSparc assembler would know this.

2005-12-07 3:03 pm

Anonymous
It’s a long way from netlist to workable layout.

the endless hand-tweaking of the Alpha was part of its

demise from a business point of view.
2005-12-07 3:22 pm

bact
fyi, if you look here

http://dmoz.org/Computers/Hardware/Components/Processors/Open_Sourc…

you will found links to several open source SPARCs,

like LEON SPARC (GPL) or Sun’s own MicroSPARC-IIep (Sun Community Source License, not OSI-approved),

Sun already did open their CPU specs for years.

(and if you talking about “open spec”, this is the company you should look for .. from UNIX, TCP/IP, NFS, OpenBIOS, Java language, JVM stuffs and etc. they’re not all “open source”/OSI-approved but “open spec”/”open standard” is the core of this company from its beginning)

well, 6 years ago ?

… that’s about when Sun was at its peak

(before the .com bursted, unfortunately)

Hope Sun do it better this time, Go!

I’m a very happy guy right now
2005-12-07 4:36 pm

Anonymous
Unlike most people here, I love sun. As a matter of fact I just got the semen stains off my ultra 10 the other day and it’s been purring along quite nicely with solaris 10. Now, all I’ve got left is to install sun studio and all will be well.

2005-12-07 9:18 pm

Anonymous
“I just got the semen stains off my ultra 10”

your a sick nerd.

2005-12-07 6:25 pm

Anonymous
look impressive. really
2005-12-07 7:50 pm

suryad
So this processor cannot be beneficial then in apps like vdo editing and scientific visualization and folding and multithreaded game engines and so on?

2005-12-07 8:09 pm

crozier
Its sounds like this exact processor won’t, but I would hazard an uneducated guess that a great deal of the engineering for Niagara will be applicable to a similar chip that includes CPU cores with better floating point capabilities (provided such cores would “fit” on the die with good yields and whatnot).

2005-12-07 9:35 pm

Anonymous
“I just got the semen stains off my ultra 10”

your a sick nerd.

“you’re” a sick nerd.

heh heh
2005-12-07 10:14 pm

kamper
I’m gonna try to stay out of the discussion about the relative value to the community of actually having this released, since I don’t too much about hardware design but…

It seems like a cpu design isn’t of much use to the open source community. Obviously no one is going to take this an start making their own linux/bsd friendly cpus. The precedent is kinda interesting though, since there are other things that would be valuable to have released. There’s that open graphics chip design thing that could benefit tremendously from actual companies making their designs open (or the whole effort would be made needless, even better). I’m sure the same applies to many of the other pieces that are driver trouble spots: networking chips, raid controllers… Comparatively, cpu support seems easy.
2005-12-07 11:45 pm

butters
Look, the reason Sun released the RTL description and simulation models is so that developers and academics can simulate the whole T1 processor and poke and prod at it… at no cost other than a workstation to do the simulation.

As everyone here has suggested, there’s no *economically feasible* path from the RTL to the wafer fab. But that’s not how microprocessors are designed and improved anyway. CPU designs are conceived, born, and raised in RTL on a simulator. So while Intel/AMD can’t turn this information into a cheaper/better competing processor, anyone (with an advanced CE degree) can implement extensions/improvements/optimizations to the design, simulate/test/benchmark the modifications, and submit them back to Sun (possibly for profit).

Sun remains the only entity with enough of an initial investment to turn the design into a real chip, but they open the doors for collaboration on the design. It remains to be seen how productive this opportunity will prove to be, but it cannot conceivable harm Sun’s bottom line, and it will put some positive PR spin around the new line of CPUs.
2005-12-08 7:34 am

Anonymous
Cheap x86 are getting market from UltraSPARC in server space an d are used in the most of desktops. PowerPC, MIPS, ARM is used in embeded space in various SoC chips. Who except SUN is using UltraSPARC ? Who will port software to UltraSPARC ?

No one. If there is no software for UltraSPARC there is no need for UltraSPARC. They have no other way as open it, otherways it will die slowly.

2005-12-08 3:20 pm

Arun
Cheap x86 are getting market from UltraSPARC in server space an d are used in the most of desktops. PowerPC, MIPS, ARM is used in embeded space in various SoC chips. Who except SUN is using UltraSPARC ? Who will port software to UltraSPARC ?

You mean SPARC. All the other chips you quoted aren’t vendor specific. For example Power 4/5/970 are powerPC implementations from IBM. UltraSPARC is a SPARC implementation from Sun. FUjitsu also makes SPARC chips called SPARC64.

SPARC has plenty of installed base. More than IBMs Power in the UNIX market or even HP. Thier revenue hasn’t been up because they are discounting heavily to sell, much more than before. But rest assured thier unit volumes are higher than IBM. So more people are buying SPARC hardware than IBM hardware but are paying considerably less.

2005-12-08 2:17 pm

nimble
Thanks for the rather unnecessary lesson in TLP. Nothing in your post really addresses the points I was making.

I’m not saying that its design doesn’t make sense or that it isn’t very fast at what it’s designed for. I’m just saying that it’s disappointing that it isn’t even faster.

The Niagara would benefit a lot from higher clock rates precisely because it can hide memory latencies behind fast thread switches.

Let’s compare it with the Dual-Opteron 880 at 2.4GHz. In order to factor out the clock rate vs. pipeline stages issue, let’s look at instruction latency, i.e. number of stages divided by clock rate. (This isn’t a speed benchmark, it’s just looking at how long the pipeline logic takes to process an instruction.)

Opteron: 12 / 2.4GHz = 5ns

Niagara: 7 / 1.2GHz = 5.8ns

So an instruction takes longer to travel through the Niagara’s pipeline than through the Opteron’s. That’s in spite of the additional stages in the Opteron adding some extra latency due to extra latches, and more importantly in spite of the Opteron’s out-of-order pipeline being a lot more complicated.

This suggests that either the Niagara’s implementation isn’t as good as it could be, or Sun are actually holding back on the clock rate for other reasons.

And independently of that you’ve got to wonder why they didn’t add a few more stages to the pipeline anyway.

2005-12-08 3:10 pm

JonAnderson
The reasons why it didn’t clock higher are:

Power. More clock = more power = more heat = more power

again.

If you increase the core speed but not the memory speed

then your ability to hide memory access latency is

reduced. The architecture is about balance – balance

pipeline speed with memory access latency in order to

maximise the amount of time spent doing useful work.

Experiments were actually done with a higher clock but

performance wasn’t much increased because more time

was spent waiting for memory again. It’s just a case

of keeping the execution rate in line with the amount

of data that can be fed into the cores in order to have

the best efficiency.

2005-12-08 3:47 pm

nimble
Thanks Jon for a more thoughtful and interesting reply.

The XBox’s Xenon CPU has three similarly simple in-order cores but with a much longer pipeline at 3 GHz. So I guess the “cumulative” clock frequency is about the same while power consumption is in the same ballpark too, so that makes sense. (But does power consumption go up linearly with clock frequency?)

2005-12-08 3:36 pm

Arun
The Niagara would benefit a lot from higher clock rates precisely because it can hide memory latencies behind fast thread switches.

There your thinking traditional OOOE single thread cpus again. Niagara makes a latency problem a bandwidth problem. For the clocks on Niagara to get faster memory has to get faster.

So an instruction takes longer to travel through the Niagara’s pipeline than through the Opteron’s. That’s in spite of the additional stages in the Opteron adding some extra latency due to extra latches, and more importantly in spite of the Opteron’s out-of-order pipeline being a lot more complicated.

It doesn’t matter how long the instruction takes to go throught the pipeline.

Let’s say on the opteron an instruction takes 5 ns and memory access is 25ns. The opteron completed one instruction and stalled on the next and is waiting 25ns.

Niagara completed one in 5.8 ns and the next instruction stalled and memory is still 25ns away. Niagara switched to the next thread and exutes the instruction stream, assume that instructuion stalls as well. Niagara switched to the next and so on. So 25ns later the first instruction’s data arrived and Niagara can excute that instruction stream. So niagara has done more work while the opteron was waiting. Shortenting the instruction processing time whithout decreasing memory latency only means that if all threads stall then the core is waiting. It is delicate balance of efficiency. You could burn more watts for waiting (increase clock rate) or get work done and not waste energy.

And independently of that you’ve got to wonder why they didn’t add a few more stages to the pipeline anyway.

My guess is die space and other design constraints. Adding more stages to the pipeline increases core size for all the branch prediction and stall handling logic. Meaning less cores per chip, and less overall system throughput. So Niagara isn’t designed to be a fast single thread cpu. The T in the T1 is for Throughput. Sun isn’t marketting Niagara as single thread monster either. The T1 is supposed to be used for Transaction processing, Web service loads and largely mutlthreaded integer based loads. That is typical business processing.

Edited 2005-12-08 15:40

2005-12-08 4:57 pm

nimble
Niagara makes a latency problem a bandwidth problem. For the clocks on Niagara to get faster memory has to get faster.

That depends on the workload. If you’ve got a lot of cache hits or not enough threads, then latency remains the problem and higher clock rates would help.

But as Jon pointed out, experiments showed that the resulting increase in power consumption wasn’t worth it for the workloads Sun were targeting.

Let’s say on the opteron an instruction takes 5 ns and memory access is 25ns. The opteron completed one instruction and stalled on the next and is waiting 25ns.

Being out-of-order it can work around that to some degree, although I do understand your point.

But as I’d already pointed out, looking at minimum instruction latencies wasn’t meant as a speed benchmark but as a comparison of the pipeline logic implementations.

And that showed that the Niagara should be able to be clocked faster.

Adding more stages to the pipeline increases core size for all the branch prediction and stall handling logic.

Not by much. Mostly you need extra latches for separating the stages. And branch prediction stalls are handled in the same way as memory stalls: switch to another thread. Therefore I doubt whether the Niagara has much if any dynamic branch prediction hardware at all.

2005-12-08 6:59 pm

Arun
That depends on the workload. If you’ve got a lot of cache hits or not enough threads, then latency remains the problem and higher clock rates would help.

Not really. The L1 and L2 cache on Niagara aren’t big enough for that to occur on business process work loads with 8 cores. Niagara is not intended to be a high clock rate high single thread performance chip. It is designed for thoughput. If you are saying that Niagara could have been a better single thread chip. You are right it could have been. But that is not the goal.

You would get higher single thread performance but not enough throughput. Look at the benchmarks Niagara beats the single thread performance monsters in throughput at a much lower power consumption.

But as Jon pointed out, experiments showed that the resulting increase in power consumption wasn’t worth it for the workloads Sun were targeting.

You misunderstood what Jon said. To quote Jon:

“If you increase the core speed but not the memory speed

then your ability to hide memory access latency is

reduced.”

This is exactly what I said in my response to you.

Not by much. Mostly you need extra latches for separating the stages. And branch prediction stalls are handled in the same way as memory stalls: switch to another thread. Therefore I doubt whether the Niagara has much if any dynamic branch prediction hardware at all.

Exactly, Niagara doesn’t need the branch prediction logic because it isn’t trying to be a fast single threaded chip. So in exchange more cores could be squeezed on the chip for more throughput in lieu of the extra circuitry need for branch prediction.

What you are trying to say is adding more staged would increase performance because the clock can be increased.

What Jon and I are saying is not unless memory becomes faster and the chip isn’t waiting for data. The increased energy expenditure makes it wasteful.

2005-12-08 4:20 pm

Anonymous
Woah, slow down there Apples != Oranges. Your comparing instruction latency of Sparc code with x86 code. This is completely pointless.

“Yeah, my 86′ chevy citation is a superior piece of equipment to your Harley Davidson because I’ve got more tires than you.”

1 Sparc Instruction != 1 x86 Instruction.

Sorry. Also, SPARC forces stuff like branch prediction in the assembly, so instruction latency is less of an issue, since your less likely to have to flush the pipe. Pickup a computer hardware 101 book.

2005-12-08 5:05 pm

nimble
Your comparing instruction latency of Sparc code with x86 code. This is completely pointless.

Kneejerk reaction. You have made no effort to understand the point I was making.

Instruction latencies are obviously no benchmark for overall processor speed, but were used here to compare the pipeline implementations.