Linked by Eugenia Loli on Sun 29th Oct 2006 05:58 UTC
Hardware, Embedded Systems As Moore's Law continues to hold, IC designers are finding that they have more and more silicon real estate to play with. David Chisnall hazards some guesses as to what they might do with it.
Thread beginning with comment 176703
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE: not bad
by Phloptical on Sun 29th Oct 2006 23:50 UTC in reply to "not bad"
Member since:

Wow! Could you translate that into english? Then again, don't bother. Reading stuff like that is like listening to a conversation between theoretical physicists waxing on about string theory and bubble's just best left not understanding and appreciating the fact that people like you have probably forgotten more than people like me will ever know.

Reply Parent Score: 1

RE[2]: not bad
by transputer_guy on Mon 30th Oct 2006 05:21 in reply to "RE: not bad"
transputer_guy Member since:

I like to watch Nova as much as the next guy and am just as baffled by string theories and the so called quantum dot computers that pop up every so often.

Actually its really not too hard to understand at all. Look at my bio and get the paper I gave at the wotug conference on parallel computing regarding building a modern Transputer in FPGA. This design is mostly memory first then processor throughput second. Google for wotug fpga transputer R16. A modern Transputer with the best CSP thinking from the old design can take advantage of modern RLDRAM (Micron Inc) and FPGAs and multiple RISC (Sparc/Niagara like) processor elements.

However there is a complete widespread belief that the Memory Wall (google, wiki that too) can not be solved, only managed be ever increasing cache sizes. This is now known to be a complete fallacy as SRAM is a good way to build chips that get hot while DRAM pretty much doesn't because the leakage has been designed out of it. There is also a fallacy spread by us chip designers that DRAM is many orders slower than SRAM but thats also mostly nonsense these days for "large" arrays. If you want Megabytes of RAM at high speed, the most important thing is the interface speed and to use separate I/O data paths with no muxed interfaces, hence DRAM peripheral logic can go just as fast as SRAM interfaces. DRAM arrays though can be highly banked and each run out of step and used by multiple slower interleaved threads that are latency hiding.

// long sentences warning
The essential idea is that if you have an even load of atleast 30-40 threads to run on a multithreaded processor pretty much constructed any way on any instruction set, there arises the possibility that most all DRAM references (which occur on load and store ops typically every 5-8th opcode) can be handled by a special MMU. This MMU does associative mapping via a hash on an object ID (sort of a file/memory handle) v array index and using a special DRAM from Micron called RLDRAM that can do 8 interleaved accesses every 8 DRAM clocks ie 1 every 2.5ns for a 400MHz bus. The effect is that a small DRAM chip of 32MBytes with 2x the cost of regular DRAM can have much better effective performance than L2 SRAM because most all 8 banks can be kept continuosly in flight over any 20ns window. In regular DRAM, typically only 1 bank goes into flight and thats over a 60ns period and when you add in typical x86 MMU, TLB and OS overhead, true random access is closer to several 100ns for worst case cache misses. Regular TLBs use real associative CAMs to do the translation and are limited to 256-1K ways range of possible mappings. A hash can be almost entirely asociative but requires pretty fast ram cycle rates but doesn't require low latency if the accesses are for multiple threads.

There is bank contention and also the hash collision penalty but those can be managed so that 2/3 of the accesses are useful. I am modelling the project on a regular PC to see how an OS and apps would be like to write on it so in effect the hardware MMU C model is also a private software memory management package. Its also worth looking up persistant storage too.

I am really not the first or last to get into this, people 20-30 years before me already did some of this but they sure never had access to RLDRAM and FPGAs so it never really took off. Any rational communications or DSP guy would do it this way but you do have to deal with 30+ threads per MMU to get the latency well hidden.

I will probably be taking this thesis to my grave though before anyone else builds such a machine.

Reply Parent Score: 2