To read all comments associated with this story, please click here.
I'm not from Sun, but at least two question of yours are not very useful: the answer of these being 'it depends'.
>how much will software architecture and design have to change?
Obviously software must be threaded to use efficiently this kind of computer (which is also true for multicore CPU), some problems are already 'embarassingly parallel' so no design change is needed other have to be recoded from scratch.
>With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?
Well, once memory latency is not the bottleneck anymore (due to thread interleaving), then the next bottleneck is memory bandwith or CPU usage or IO, etc.
It's not possible to answer to your question in a general way as it depends on the cache usage of the code: if it has high locality, CPU becomes the bottleneck otherwise it's memory bandwith..
<quote>I'm not from Sun, but at least two question of yours are not very useful: the answer of these being 'it depends'. </quote>
While I certainly agree that the short answer to both of these questions is "it depends," I disagree about their usefulness.
The first could have been expressed "How and how much..."; it was intended as an opening to draw out the general guidelines and principles involved (beyond "write concurrently as much as possible"). Creating a large number of concurrent threads might perform well with this many cores but perform very poorly on other architectures (including other Sun platforms).
Those of us who must design code to perform efficiently on all supported platforms, with a minimum of "#ifdef" code will need to understand how to do this, preferably without having to discover it all independently.
In addition, the cost of mutexes relative to other operations may change, and understanding this (as well as perhaps using alternate atomic instructions that may work better, e.g., a single "spinlock" instruction that could trigger a thread switch on waiting, might be available.
The second question, on memory scalability, gets to the interface between cache and TLB registers, since larger memory can result in a greater stress on TLB registers and/or TLB miss handling (which could be a TLB/cache operation without accessing memory). It is also connected to selection of new threads in case of cache misses; it is better, for example, if the CPU schedules a new thread with low probability of having immediate TLB register loads, which will make the switch more than a single cycle.
Generally, I/O is not in this scope, other than the need to flush caches to memory before the I/O occurs; I don't see new issues here (although it could be a blind spot on my part.
These are both the sorts of questions that lead to dissertations as responses. If folks at Sun have already done that research and written the dissertations (or equivalent), that greatly adds to the value of these new CPUs.
They are also the sort of questions I have run into in writing highly concurrent portable Unix software (specifically a main-memory DBMS with very small latency restrictions in soft real time), which is why I thought them worth asking.
- To take advantage of the massive number of new threads, how much will software architecture and design have to change? The more code needs to be written to this architecture, the harder it will be to build portable software.
That's where virtualization comes in. You can run many smaller virtual machines with fewer number of threads per OS instance. You can consolidate many boxes into a 1U or 2U server.
- How is single cycle thread switching accomplished? I would think that the new thread must share a lot of address space (in its current working set) with the old thread, or else the number of TLB registers must have been increased substantially.
Each core has a shared L1 cache and there is a shared L2 cache for the whole socket. This is on an UlltraSPARC T1. Each core has a TLB shared by the threads in the core. Each thread looks like a SPARC cpu to the OS, there is only shared address spaces if the MMU partition ID is the same for a TLB entry. Each core runs 4 threads, threads are switched on a long pipeline stall, like a cache miss or tlb miss.
- With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?
Don't follow.
- How are the Ln caches architected to promote effective cache synchronization? Will this synchronization result in a lot of wait states or is some form of lazy evaluation available/possible to minimize cache synchronization activity?
See above.
I don't know if these questions have publicly available answers yet, but I would at least hope that any relevant Sun folks reading this think about them, and how to provide good, truthful answers to them.
I work for Sun and on the CMT CPUs and LDoms virtualization technology. We are very open about our CMT processors, so much so that we even open-sourced the design and RTL.
http://www.opensparc.net/
You can download the RTL source code for the T1 processor and get all the specifications at the opensparc page.
Edited 2007-07-28 01:20
It depends on load and type of software being run.
For example, typical app or web server will scale linearly with amount of available hardware threads. Maya or Word will not. That's why Niagra is server cpu and marked as throughput processor.
Yes, that's true. That's why Niagara has massive memory bandwidth and extreme number of pins to implement that wide bus.
However never forget principal idea of this design. CMT and SMT are used, not to improve performances of single unit of work, but to improve useful usage of memory. Idea is to run multiple units of work and idle any of them when data is unavailable. Next time when stalled thread is dispatched to run, enough time should pass that data is surly available. As memory is slowest part of system, this design should lead to better overall performances when compared comparable conservative design with similar memory throughput.
Niagara wasn't first piece of hardware build around these ideas. Cray had barrel shift CPU with 256 threads (SMT) and no cache.
Edited 2007-07-28 16:51







Member since:
2005-06-29
Some questions I have around the new SPARC architecture:
- To take advantage of the massive number of new threads, how much will software architecture and design have to change? The more code needs to be written to this architecture, the harder it will be to build portable software.
- How is single cycle thread switching accomplished? I would think that the new thread must share a lot of address space (in its current working set) with the old thread, or else the number of TLB registers must have been increased substantially.
- With many more CPU cores and similar levels of memory sharing, the real memory needs of the machine will grow substantially. How well does this scale?
- How are the Ln caches architected to promote effective cache synchronization? Will this synchronization result in a lot of wait states or is some form of lazy evaluation available/possible to minimize cache synchronization activity?
I don't know if these questions have publicly available answers yet, but I would at least hope that any relevant Sun folks reading this think about them, and how to provide good, truthful answers to them.