Linked by Thom Holwerda on Mon 23rd Jan 2012 11:29 UTC
Hardware, Embedded Systems "The CPU design firm Venray Technology announced a new product design this week that it claims can deliver enormous performance benefits by combining CPU and DRAM on to a single piece of silicon. We spent some time earlier this fall discussing the new TOMI (Thread Optimized Multiprocessor) with company CTO Russell Fish, but while the idea is interesting; its presentation is marred by crazy conceptualizing and deeply suspect analytics."
Permalink for comment 504272
To read all comments associated with this story, please click here.
scary, maybe
by transputer_guy on Mon 23rd Jan 2012 21:57 UTC
transputer_guy
Member since:
2005-07-08

Before I even read the article I was thinking about the Forth chips and Chuck Moore.

The last section though was pretty scary but the Futurologists like Ian Pierson make it sound pretty lame stuff.

There are DRAMs that are literally 20 or more times faster than regular DRAM, so they can start full almost random accesses every 2.5ns, not the usual 60ns of todays commodity chips.

With Micron RLDRAM, you can sustain certain types of compute processing at up to 400M I/Os per sec. It is based on 8 concurrent banks of 8cycle 20ns latency DRAM blocks sharing a split I/O bus structure in a 1Gbit DRAM process. It has full address and data I/O lines on dedicated pins like an SRAM with modern DDR pin speeds. The networking industry uses them, in fact Atiq Raza the Nextgen/AMD architect used these RLDRAMs in a custom network processor at RMI now NetLogic.

The question is can you build a useful general purpose computer that can get 5 operations or so for each memory cycle at 2000M ops/sec.

You can only do this on highly threaded designs, and you have to pay for the effect of making the 8 banks look like one address space.

In practice with an FPGA you have to use the slower version at 300MHz, and the penalty for the single address space is about 1/3 of memory bandwidth is lost. So you are left with about 1000M ops/sec and it takes around 20-40 odd threads that will need some communication between them and other nodes. Such a processor can be built in FPGAs like Virtex series and you can effectively get 40 simple 25Mips cores per node. The 40 threads actually are spread on 10 or so 4 way cores.

Is that useful to anyone, probably not to usual punters, but I wouldn't mind having one. The big advantage is that every memory cycle has no effective Memory Wall, you get a big Thread Wall instead. If you can deal with that then you can also expand the system up many times, more Thread Wall though.

If you could implement the RLDRAM and processor on the same chip, then the clock rate can go up a few times, and the whole node replicated as DRAM capacity allows. The processor can then get decent FPU as well.

In my MVC analysis of graphics apps I have written, I know that the more complex Control part needs very few cycles and can happily run with a few MIPs, the Model part usually needs cycles. The View part can usually be partitioned quite nicely over dozens of small tiles, it is a question of organizing the graphics into parallel pipelined structures.

Since we already have a Thread wall with typically 4 x86 processors, might as well go full hog. I have 2 Intel Quad PCs and 99.xx% of the time those spares are never used.

Perhaps Venray is thinking along the same lines, dunno.

Reply Score: 2