Linked by Thom Holwerda on Thu 26th Jul 2007 22:01 UTC
Oracle and SUN Sun's latest Niagara and Rock details have reached El Reg, and they confirm that the hardware maker is up to some very ambitious stuff. First off, Sun looks set for the imminent release of its first Niagara II-based servers - the T5120 and T5220 systems. Customers will see 1U and 2U boxes, respectively, each with one of the 'Niagara II' or (more formally) UltraSPARC T2 chips. It looks like the eight-core, 64-thread chip will arrive at 1.5GHz.
Order by: Score:
2048 threads.. in hardware..
by helf on Thu 26th Jul 2007 22:06 UTC
helf
Member since:
2005-07-06

omfg...

0_o

Browser: Mozilla/4.0 (compatible; MSIE 6.0; Windows 95; PalmSource; Blazer 3.0) 16;160x160

Reply Score: 2

JonathanBThompson Member since:
2006-05-26

How long before you add one of those beasts to your menagerie, helf? ;)

Where I could see that being utilized in a non-server space currently is with neural networks, if someone really wants to get into them. That, or ray-tracing, like POV-Ray, which AFAIK doesn't subdivide things up amongst threads in tiles currently, but can be used to render with one thread per frame, which, in that case, that'd be a great machine to play with ;) Of course, most people don't feel a need to do computer animation to that degree at home...

Of course, for other computer geeks, this would be an interesting system (2048 hardware threads?) to run a true microkernel OS in...

Reply Score: 3

Lazarus Member since:
2005-08-10

Well, I don't know about Helf, but I'd sell my house for a couple of those ;^)

Reply Score: 3

DoctorPepper Member since:
2005-07-12

I'm thinking you'd pretty much have to. Along with your car. Maybe your wife and first-born child too ;-)

Reply Score: 1

hobgoblin Member since:
2005-07-06

hell, running a microkernel on that may even get some on-par speed out of the kernel ;)

Reply Score: 2

chekr Member since:
2005-11-05

Browser: Mozilla/4.0 (compatible; MSIE 6.0; Windows 95; PalmSource; Blazer 3.0) 16;160x160

Sorry, but I just have to hit this on the head; why is it that I need to know what browser and platform a person is posting from?

Reply Score: 3

bsharitt Member since:
2005-07-07

I think it's some retarded thing built into the OS comments that automatically puts the useragent of mobile devices in the comments. While I can understand their desire to to keep stats on mobile devices, it is rather pointless to post the useragent on the end of every comment.

Reply Score: 5

smitty Member since:
2005-10-13

Yes, OSNews inserts it. But why? I've always wandered too.

Reply Score: 5

But...
by Xaero_Vincent on Fri 27th Jul 2007 00:17 UTC
Xaero_Vincent
Member since:
2006-08-18

Is it as powerful as a 150 HP Case steam tractor with 4000+ pounds of torque at zero RPM? :-P

Reply Score: 1

Market
by tony on Fri 27th Jul 2007 04:07 UTC
tony
Member since:
2005-07-06

Fascinating concept, but it will be interesting to see if it gets traction in the market. From what I can tell, the Niagras haven't exactly gone gangbusters yet.

Reply Score: 1

RE: Market
by flanque on Fri 27th Jul 2007 09:46 UTC in reply to "Market"
flanque Member since:
2005-12-15

As enterprise consolidation projects and virtualisation takes traction, I'd expect this sort of hardware to take traction.

Edited 2007-07-27 09:46

Reply Score: 3

RE: Market
by kaiwai on Fri 27th Jul 2007 12:15 UTC in reply to "Market"
kaiwai Member since:
2005-07-06

Fascinating concept, but it will be interesting to see if it gets traction in the market. From what I can tell, the Niagras haven't exactly gone gangbusters yet.


On what basis is that assumption being made - sure, Sun doesn't run around everytime they make a sale, but I am sure, given the sales volume so far, that things are going well.

Its going to take a while for Sun to get back on track - they made the first good move, getting rid of Scott, and now their focus is on products and addressing customer needs rather than senselessly bashing Microsoft.

Reply Score: 2

RE[2]: Market
by tony on Fri 27th Jul 2007 13:16 UTC in reply to "RE: Market"
tony Member since:
2005-07-06

n what basis is that assumption being made - sure, Sun doesn't run around everytime they make a sale, but I am sure, given the sales volume so far, that things are going well.


I haven't been able to find any sales numbers on the T1s, but even in Sun's blogs I don't see much about it, nor do I see Sun enthusiasts going crazy about them (they seem primarily focused on all things [Open]Solaris. In the general market, I haven't seen any deployment, nor have I seen anything in the non-Sun sphere regarding them. Anecdotal, to be sure, but it still seems that it hasn't given any existing platform (non-Sun or Sun) a run for it's money. Yet, at least. It may be an idea ahead of its time.

Reply Score: 1

RE[3]: Market
by kaiwai on Fri 27th Jul 2007 13:57 UTC in reply to "RE[2]: Market"
kaiwai Member since:
2005-07-06

I haven't been able to find any sales numbers on the T1s, but even in Sun's blogs I don't see much about it, nor do I see Sun enthusiasts going crazy about them (they seem primarily focused on all things [Open]Solaris. In the general market, I haven't seen any deployment, nor have I seen anything in the non-Sun sphere regarding them. Anecdotal, to be sure, but it still seems that it hasn't given any existing platform (non-Sun or Sun) a run for it's money. Yet, at least. It may be an idea ahead of its time.


I'd say its a deliberate decison on Suns part to make the operating system itself sell the hardware - if they started going on about their hardware they risk zapping any possible momentum out of OpenSolaris.

They'll mention the new hardware but you'll find that most of the emphasis will be on the operating system and what it brings to the customer as a whole package rather than just singling out the hardware for special attention.

Reply Score: 2

RE[4]: Market
by Robert Escue on Fri 27th Jul 2007 14:24 UTC in reply to "RE[3]: Market"
Robert Escue Member since:
2005-07-08

Is that why Sun has this site:

http://cooltools.sunsource.net/

It takes more than the OS and nice hardware to make an application perform well.

Reply Score: 2

RE[5]: Market
by kaiwai on Fri 27th Jul 2007 14:52 UTC in reply to "RE[4]: Market"
kaiwai Member since:
2005-07-06

Hence the reason I said its the 'whole package' - its more than just hardware and operating system.

Reply Score: 2

RE[6]: Market
by Robert Escue on Fri 27th Jul 2007 15:06 UTC in reply to "RE[5]: Market"
Robert Escue Member since:
2005-07-08

Unfortunately, it's not that simple. I just went through a three month debacle with a customer who said they were experiencing performance problems with their application on our hardware. Until they brought in an outside consultant who deemed the hardware we were using was not he issue they insisted that we jump through hoops to fix their problems. As it turns out the performance problem was due to a complex security policy that they created and had not taken into consideration how much of a performance hit the security policy caused.

While Sun sells products and services and can assist in these cases, it is also up to the people who write and maintain applications to assist in the performance improvement process. The OS and hardware vendor can only go so far.

Reply Score: 2

RE: Market
by DoctorPepper on Fri 27th Jul 2007 18:35 UTC in reply to "Market"
DoctorPepper Member since:
2005-07-12

While I certainly can't vouch for the total sales numbers of the Sun systems with the Niagra processors in them, I can say that my company has 120 Sun T-2000 servers in production right now, each of them with an eight-core Niagra processor in it. I don't know how many T-2000's we have that aren't customer-facing, just the 120 in our datacenters.

They run pretty darned well too!

Reply Score: 1

i can't remember where i read it but
by matthekc on Fri 27th Jul 2007 11:52 UTC
matthekc
Member since:
2006-10-28

I read it in one of this months linux magazines at a local barnes and nobles. If anyone can find it there was an interesting article about the potential power of stream processing with Gpu's. You have to use special compilers to optimize the code but the benifits were in the 2x and 3x factors with one gpu. with a good computer and 2 graphics cards you may not have to mortgage the house to get good computing power.

Reply Score: 1

I don't know why I should be impressed.
by aliquis on Fri 27th Jul 2007 12:33 UTC
aliquis
Member since:
2005-07-23

Since it's still probably not very fast (the 64 thread one that is but anyone of them at all compared to anything else similair speced/with as many sockets). I don't care very much that each core can run multiple threads if they are run very slow anyway.

Reply Score: 1

Robert Escue Member since:
2005-07-08

This is the result of running sysbench on MySQL on a "try and buy" T2000 I had for testing. This is without the benefit of using Sun's CoolTools or the tweaks mentioned by Luojia Chen in this article:

http://developers.sun.com/solaris/articles/mysql_perf_tune.html

sysbench v0.4.4: multi-threaded system evaluation benchmark

No DB drivers specified, using mysql
WARNING: Preparing of "BEGIN" is unsupported, using emulation
(last message repeated 39 times)
Running the test with following options:
Number of threads: 40

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases)
Using "BEGIN" for starting transactions
Maximum number of requests for OLTP test is limited to 10000
Threads started!
Done.

OLTP test statistics:
queries performed:
read: 140196
write: 50070
other: 20028
total: 210294
transactions: 10014 (404.07 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 190266 (7677.38 per sec.)
other operations: 20028 (808.14 per sec.)

Test execution summary:
total time: 24.7827s
total number of events: 10014
total time taken by event execution: 981.9074
per-request statistics:
min: 0.0402s
avg: 0.0981s
max: 1.1433s
approx. 95 percentile: 0.1733s

Threads fairness:
events (avg/stddev): 250.3500/1.53
execution time (avg/stddev): 24.5477/0.04

This is with MySQL on the root disk (mirrored) and MySQL compiled with --prefix=/usr/local/mysql(version) --enable-thread-safe-client and sysbench compiled with defaults using gcc 3.4.3 that ships with Solaris 10.

Do I think the performance could be better, yeah I could probably tweak a few more transactions out of it. The bottom line is the CPU wasn't even breathing hard to produce these results, sar -u data from the test is below:

17:05:00 21 9 0 70
17:10:00 25 9 0 67
17:15:00 25 9 0 67
17:20:00 26 8 0 65

My testing got cut short by an outside event, but in any case the results proved to me that the UltraSPARC T1 is a serious processor. In the buy we made for updating one of our networks I asked for two T1000's for web and database servers. Any future purchases we make, these machines will be at the top of out list for consideration.

Do I think the T2 CPU's will perform any better, you bet! Just because they don't run at 3+ GHz doesn't mean they don't perform well.

Reply Score: 3

Luminair Member since:
2007-03-30

I don't know how that compares to anything else, but the size is impressive!

Reply Score: 1

Slow but many cores?
by Kebabbert on Fri 27th Jul 2007 13:26 UTC
Kebabbert
Member since:
2007-07-27

Why wouldnt you be impressed of many but slow threads? The answer is simple:

Every CPU will have cache misses and therefore uses lots of cache logic and increased size, etc. Studies made by Intel shows that an normal server idles around 60% under full load, because of cache misses.

Sun deals with this ancient problem in another way in its new Niagara CPUs. A thread is run until a cache miss, and then immediately it switches to another thread in ONE clock cycle, which a normal CPU cannot do. Therefore a Niagara idles around 5% or less. Therefore the Niagara at 1.4GHz outperforms easily dual opterons at 2.5GHz in threaded apps, like web servers etc. For a few threads, the Niagara sucks. But at many threads, it excels. Which benchmarks and reports confirm.

The Niagara has generated quite many sales, and generates lots of money for Sun.

Reply Score: 5

RE: Slow but many cores?
by diegocg on Fri 27th Jul 2007 13:50 UTC in reply to "Slow but many cores?"
diegocg Member since:
2005-07-08

This machine however has 64 threads per core. I fail to see how to SMT can imrpove much the situation here.

I mean, you can already keep Niagaras busy with the current number of threads. With 64, most of the threads are going to be waiting for a cache miss from the other threads.

Reply Score: 2

RE[2]: Slow but many cores?
by Luminair on Fri 27th Jul 2007 15:36 UTC in reply to "RE: Slow but many cores?"
Luminair Member since:
2007-03-30

That 64 thread hardware is being released in the future... just like increasingly threaded software is being deployed in the future? ;)

OpenSolaris Xen development is active. One might expect a focus on virtualization when 2048 threads ship ;)

Reply Score: 2

RE[3]: Slow but many cores?
by Robert Escue on Fri 27th Jul 2007 16:21 UTC in reply to "RE[2]: Slow but many cores?"
Robert Escue Member since:
2005-07-08

Hopefully the new hardware will support Logical Domains (LDoms) like the UltraSPARC T1's do now:

http://www.sun.com/bigadmin/hubs/ldoms/

Reply Score: 3

RE[2]: Slow but many cores?
by Wes Felter on Fri 27th Jul 2007 22:12 UTC in reply to "RE: Slow but many cores?"
Wes Felter Member since:
2005-11-15

Niagara II has 8 cores and 8 threads per core. Because the Niagara II core is twice as wide as the Niagara core, they doubled the number of threads per core from 4 to 8, maintaining a ratio of 4 threads per integer execution unit.

Reply Score: 3

RE: Slow but many cores?
by FunkyELF on Fri 27th Jul 2007 17:57 UTC in reply to "Slow but many cores?"
FunkyELF Member since:
2006-07-26

Sun deals with this ancient problem in another way in its new Niagara CPUs. A thread is run until a cache miss, and then immediately it switches to another thread in ONE clock cycle, which a normal CPU cannot do.

Is that what Intel's Hyperthreading chips do? Switch between 2 threads in one clock cycle?

Reply Score: 1

RE[2]: Slow but many cores?
by bariole on Fri 27th Jul 2007 19:07 UTC in reply to "RE: Slow but many cores?"
bariole Member since:
2007-04-17

It's conceptually similar technology and used for a same end effect.

But it is better implemented in Niagara, as instead being an afterthought it is Niagara's modus operandi from day one.

Edited 2007-07-27 19:08

Reply Score: 2

Questions about new SPARC archtecture
by james_parker on Fri 27th Jul 2007 20:34 UTC
james_parker
Member since:
2005-06-29

Some questions I have around the new SPARC architecture:

- To take advantage of the massive number of new threads, how much will software architecture and design have to change? The more code needs to be written to this architecture, the harder it will be to build portable software.

- How is single cycle thread switching accomplished? I would think that the new thread must share a lot of address space (in its current working set) with the old thread, or else the number of TLB registers must have been increased substantially.

- With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?

- How are the Ln caches architected to promote effective cache synchronization? Will this synchronization result in a lot of wait states or is some form of lazy evaluation available/possible to minimize cache synchronization activity?

I don't know if these questions have publicly available answers yet, but I would at least hope that any relevant Sun folks reading this think about them, and how to provide good, truthful answers to them.

Reply Score: 2

renox Member since:
2005-07-06

I'm not from Sun, but at least two question of yours are not very useful: the answer of these being 'it depends'.

>how much will software architecture and design have to change?

Obviously software must be threaded to use efficiently this kind of computer (which is also true for multicore CPU), some problems are already 'embarassingly parallel' so no design change is needed other have to be recoded from scratch.

>With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?

Well, once memory latency is not the bottleneck anymore (due to thread interleaving), then the next bottleneck is memory bandwith or CPU usage or IO, etc.
It's not possible to answer to your question in a general way as it depends on the cache usage of the code: if it has high locality, CPU becomes the bottleneck otherwise it's memory bandwith..

Reply Score: 2

james_parker Member since:
2005-06-29

<quote>I'm not from Sun, but at least two question of yours are not very useful: the answer of these being 'it depends'. </quote>

While I certainly agree that the short answer to both of these questions is "it depends," I disagree about their usefulness.

The first could have been expressed "How and how much..."; it was intended as an opening to draw out the general guidelines and principles involved (beyond "write concurrently as much as possible"). Creating a large number of concurrent threads might perform well with this many cores but perform very poorly on other architectures (including other Sun platforms).

Those of us who must design code to perform efficiently on all supported platforms, with a minimum of "#ifdef" code will need to understand how to do this, preferably without having to discover it all independently.

In addition, the cost of mutexes relative to other operations may change, and understanding this (as well as perhaps using alternate atomic instructions that may work better, e.g., a single "spinlock" instruction that could trigger a thread switch on waiting, might be available.

The second question, on memory scalability, gets to the interface between cache and TLB registers, since larger memory can result in a greater stress on TLB registers and/or TLB miss handling (which could be a TLB/cache operation without accessing memory). It is also connected to selection of new threads in case of cache misses; it is better, for example, if the CPU schedules a new thread with low probability of having immediate TLB register loads, which will make the switch more than a single cycle.

Generally, I/O is not in this scope, other than the need to flush caches to memory before the I/O occurs; I don't see new issues here (although it could be a blind spot on my part.

These are both the sorts of questions that lead to dissertations as responses. If folks at Sun have already done that research and written the dissertations (or equivalent), that greatly adds to the value of these new CPUs.

They are also the sort of questions I have run into in writing highly concurrent portable Unix software (specifically a main-memory DBMS with very small latency restrictions in soft real time), which is why I thought them worth asking.

Reply Score: 2

Wes Felter Member since:
2005-11-15

For a glimpse of the problems this will cause, read about Azul Vega programming. They have 384 threads today, and Java programs have to be modified in non-trivial ways (e.g. lock striping) to use all those threads. I suspect such optimizations will degrade performance on more conventional systems.

Reply Score: 4

james_parker Member since:
2005-06-29

Thanks, Wes. There are certainly some overlap here -- Sun is suing Azul Systems over IP issues. They have avoided (it appears) one problem, however; by limiting themselves to Java VMs, they avoid all the address space management issues I raised.

Reply Score: 1

Arun Member since:
2005-07-07

- To take advantage of the massive number of new threads, how much will software architecture and design have to change? The more code needs to be written to this architecture, the harder it will be to build portable software.


That's where virtualization comes in. You can run many smaller virtual machines with fewer number of threads per OS instance. You can consolidate many boxes into a 1U or 2U server.

- How is single cycle thread switching accomplished? I would think that the new thread must share a lot of address space (in its current working set) with the old thread, or else the number of TLB registers must have been increased substantially.

Each core has a shared L1 cache and there is a shared L2 cache for the whole socket. This is on an UlltraSPARC T1. Each core has a TLB shared by the threads in the core. Each thread looks like a SPARC cpu to the OS, there is only shared address spaces if the MMU partition ID is the same for a TLB entry. Each core runs 4 threads, threads are switched on a long pipeline stall, like a cache miss or tlb miss.


- With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?

Don't follow.

- How are the Ln caches architected to promote effective cache synchronization? Will this synchronization result in a lot of wait states or is some form of lazy evaluation available/possible to minimize cache synchronization activity?

See above.

I don't know if these questions have publicly available answers yet, but I would at least hope that any relevant Sun folks reading this think about them, and how to provide good, truthful answers to them.

I work for Sun and on the CMT CPUs and LDoms virtualization technology. We are very open about our CMT processors, so much so that we even open-sourced the design and RTL.

http://www.opensparc.net/

You can download the RTL source code for the T1 processor and get all the specifications at the opensparc page.

Edited 2007-07-28 01:20

Reply Score: 4

bariole Member since:
2007-04-17

To take advantage of the massive number of new threads, how much will software architecture and design have to change? The more code needs to be written to this architecture, the harder it will be to build portable software.

It depends on load and type of software being run.

For example, typical app or web server will scale linearly with amount of available hardware threads. Maya or Word will not. That's why Niagra is server cpu and marked as throughput processor.

With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?

Yes, that's true. That's why Niagara has massive memory bandwidth and extreme number of pins to implement that wide bus.

However never forget principal idea of this design. CMT and SMT are used, not to improve performances of single unit of work, but to improve useful usage of memory. Idea is to run multiple units of work and idle any of them when data is unavailable. Next time when stalled thread is dispatched to run, enough time should pass that data is surly available. As memory is slowest part of system, this design should lead to better overall performances when compared comparable conservative design with similar memory throughput.

Niagara wasn't first piece of hardware build around these ideas. Cray had barrel shift CPU with 256 threads (SMT) and no cache.

Edited 2007-07-28 16:51

Reply Score: 1

Wes Felter Member since:
2005-11-15

For example, typical app or web server will scale linearly with amount of available hardware threads.

Only if you have a huge number of concurrent requests and no lock contention, both of which are questionable assumptions.

Reply Score: 2

bariole Member since:
2007-04-17

Only if you have a huge number of concurrent requests and no lock contention, both of which are questionable assumptions.

That’s true. I certainly won’t deny that.

However as both Niagara and Niagara II have only 8 cores the whole limitation is more theoretical than practical – in practice at any given moment there always will be more than 8 threads which are ready to run.

And truthfully even 2k+ requests at any given moment seems as next to nothing for any a typical web or app server in production and locking in those systems usually happens on database write – a quite rare event in those environments.

Reply Score: 1

fortress
by broken_symlink on Sun 29th Jul 2007 22:58 UTC
broken_symlink
Member since:
2005-07-06

isn't fortress the language that guy steele, one of the inventors of scheme, is working on? it will be interesting to see how scheme/lisp influence fortress.

Reply Score: 1