Linked by Eugenia Loli on Mon 14th Aug 2006 05:12 UTC, submitted by Shean Terry
Hardware, Embedded Systems A start-up called Movidis believes a 16-core chip originally designed for networking gear will be a ticket to success in the Linux server market. The Revolution x16 models from Movidis each use a single 500MHz or 600MHz Cavium Networks Octeon CN3860 chip.
Thread beginning with comment 152559
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[4]: unlocking multicore
by RandomGuy on Tue 15th Aug 2006 13:29 UTC in reply to "RE[3]: unlocking multicore"
Member since:

Hmm, sounds really interesting.
I read a few sites on the topic but I somehow fail to understand how an improved addressing scheme could force anybody to use slower threads.

Could you please explain this in greater detail or point me to a site where it's explained? That would be great!

Reply Parent Score: 1

RE[5]: unlocking multicore
by transputer_guy on Tue 15th Aug 2006 18:16 in reply to "RE[4]: unlocking multicore"
transputer_guy Member since:

google<R16 Transputer wotug> and <RLDRAM CSP occam>,

Those take you to a paper on a processor design that exploits Threads in processor and memory to knock out the Memory Wall and a no of posts to various groups.

The idea is essentially quite simple
Slow the cpu down and speed up the memory till they match close enough that ld/st/br are just so so slower than register opcodes (about 3x) for all addresses.

RLDRAM barely satisfies the memory side and gives indpendant 8 memory cycles over 8 banks limited by the interface bus currently at 400MHz going to 533MHz.

Processor Element slow down is done by 4 way multithreading very similar to Niagara. 10 PE cores now give around 1000-1500 int mips but over 40 or so threads so each thread is like a 25-35 mips engine.

Such Processor Elements are very cheap and do not need OoO or SS design, R16 only uses 500 FFs for a small Arm like ISA. Actually any ISA can be used even x86 at more cost. Since PEs are cheap, the performance is high for the logic used. The main idea is to use the PEs to load up the MMU, it is okay for PEs to go idle, they are cheap.

The paper describes how this all works in an FPGA design. The key is really the MMU which uses hashing in an inverted page type of search. As long as the memory is not excessively full, the MMU effectively has full associativity over the DRAM address space (unlike your typical TLB).

All banks must all be equally loaded for this to work and they must be concurrently operated. The MMU resolves bank collisions by reordering threads as needed sorting on 3b of the physical address.

This can be speeded up in an on chip smaller SRAM version to run the MMU bus at current clock speeds. Start with a 1Mbyte L1 cache split into M banks with N of those working concurrently over L cycles. M should be many times N to reduce bank collisions and idle banks. N should be > L so that the MMU can keep N issues in flight. Ofcourse no single threaded processor could ever keep N issues in flight but 40 or so threads on 10 PEs can keep such a memory quite busy. Relatively small caches are still used for register files and I queues.

This isn't the 1st processor to work like this but it may well be the 1st practical design with off the shelf parts. The Threads must be explicitly programmed with something like an occam or CSP based Par C using message passing objects. The paper only describes the hardware side of the work, much remains to be done.

In DSP speak, its all just commutation.

Reply Parent Score: 3