Linked by Thom Holwerda on Mon 11th Jan 2010 15:50 UTC, submitted by PLan
Hardware, Embedded Systems "Among the many great chips that have emerged from fabs during the half-century reign of the integrated circuit, a small group stands out. Their designs proved so cutting-edge, so out of the box, so ahead of their time, that we are left groping for more technology cliches to describe them. Suffice it to say that they gave us the technology that made our brief, otherwise tedious existence in this universe worth living."
Order by: Score:
Chuck Moore, of Forth Fame
by sarahannalien on Mon 11th Jan 2010 16:13 UTC
Member since:

Here's what Chuck Moore is working on these days...

144 core CPU, natively runs colorForth in hardware, claims 3mA per node... sounds like fun.

Reply Score: 1

no transputer?
by kamil_chatrnuch on Mon 11th Jan 2010 16:50 UTC
Member since:

no transputer = list fail.

Reply Score: 4

RE: no transputer?
by PLan on Mon 11th Jan 2010 17:14 UTC in reply to "no transputer?"
PLan Member since:

I was half expecting the Transputer as well. But although computer geeks thought the Transputer might shake the world it never quite lived up to its potential did it ?

Reply Score: 2

by transputer_guy on Mon 11th Jan 2010 19:11 UTC
Member since:

Almost all of those chips I was familiar with and well deserve to be on that list.

It seems the list is mostly US centered, except Micronas from Europe and Nand Flash from Japan.

So the Sh-Boom was the first to patent the use of PLL/DLL to boost the internal clock, but never went anywhere itself. Something fishy about that.

The Transputer ran off a 5MHz xtal clock but executed at 20-25MHz internally in 1984 so it predates Sh-Bang a couple of years too. Perhaps the PLL/DLL clock block wasn't patented well enough. The Transputer also had the full 32b FPU on board long before Intel or anyone else too. And it had the DRAM controller on board about 2 decades before Intel. So by those counts it should be on the list. The rest is probably known too by the other fans here.

I still think the Transputer could have done better but failure to execute a decent working replacement and some inherent flaws in the original design like the variable length byte codes were destined to make faster versions a difficult prospect. I was reading the paper by Andrew Tanenbaum on that subject and was pushing those ideas around the office. I wish we had gone with a variable length 16b instruction word instead, far easier to do later versions and just as code dense. At the time we had no clue that would become a huge hindrance, it just seemed that 8b coding would work well enough. The high cost of the T800 and the special CMOS process didn't help either. Inmos never saw the coming of open fabs like TSMC that could have made it so much nimbler.

In my Transputer work around 2005, the variable length 16b instruction decoder was incredibly simple and could run at 300MHz in FPGA. With 10 simple processor tiles it executed around 40 threads at about 25 MIPs each for a total of about 1000 MIPS. It used RLDRAM rather than SRAM cache with SDRAM for no Memory Wall. Each memory access added only 1 or so micro cycles to Load, Store and Branch codes, if only I'd had more lab resources.

Thanks Thom for the link

Reply Score: 6

v RE: Transputer
by tylerdurden on Tue 12th Jan 2010 03:10 UTC in reply to "Transputer"
RE[2]: Transputer
by transputer_guy on Tue 12th Jan 2010 04:42 UTC in reply to "RE: Transputer"
transputer_guy Member since:

If you have questions you can write offline, email is in the contact.

Reply Score: 2

RE[3]: Transputer
by tylerdurden on Thu 14th Jan 2010 20:46 UTC in reply to "RE[2]: Transputer"
tylerdurden Member since:

Or you can simply point at the project publication/repository site.

Some of us actually work on computer architecture. You claim to be getting over 300MHz in an FPGA implementing a full Transputer pipeline in 2005. That was a big red flag among others.... esp. since the only other project I know which was implementing a transputer in FPGA got it to run around a few tens of MHz circa the same time frame.

I would be fairly interested in reading about your results, esp. if you were doing that in 2005 you certainly managed to get under the radar of most people in the computer architecture community.

Edited 2010-01-14 20:47 UTC

Reply Score: 1

RE[4]: Transputer
by transputer_guy on Fri 15th Jan 2010 15:30 UTC in reply to "RE[3]: Transputer"
transputer_guy Member since:

I gave a paper on the R16 at CPA in 2005 in Eindhoven. Just google for RLDRAM R16 Transputer. My OSNEWs post gave the RLDRAM, FPGA clues and my bio mentions the same thing.

You are certainly right about getting the original Transputer to run at 300MHz in FPGA, it would be impossible. One Japanese Prof did implement most of a T800 as a straight clone at about 25MHz. Also ST did an ASIC redesign at very good speeds for the set top market, not sure if became product. Indeed top of the line Vertex CPUs typically only run classic MIPs like ISAs at 100MHz (MicroBlaze or NIOS2) or so and I had no interest in doing another one of those with a GNU software on top, the antithesis of Transputing.

For me the whole concept about the Transputer was to implement a processor drawing upon the general concurrency ideas and to avoid all things that slowed it down such as the byte codes, the push pop stack as well as that memory model using SRAM caches and regular DRAM.

In 2001 the RLDRAM came to my attention which made me realize that a barrel or threaded processor could be built with it that would have very good performance. The RLDRAM is able to take memory issues up to about 500MHz with an ASIC controller but the Xilinx Vertex Pro could still drive it at 300MHz or so. The RLDRAM runs 8 banks concurrently with a shared SRAM like clocked interface with 8 cycles latency per bank.

My time spent in DSP cores naturally led me to go with a 4 way barrel design. It is possible to get a processor that runs the core at 300MHz or so but actually is executing 4 thread instructions every 8 clocks with a 16b data and 16b variable length instruction path. The 3ns cycle is constrained to 3 LUT levels of logic which means that every 8 clocks, we have up to 24 LUTs of logic depth which is more than enough to resolve all dependencies. A LUT is usually worth 2-3 simple gates of logic depth.

By placing 10 of these PEs with the MMU, allows for the RLDRAM bandwidth to be matched well to the PEs demand for Load, Store, Branch and instruction buffer refills. Since the PEs latencies are matched to the 8 cycle latency of the RLDRAM, the PEs appear to have almost no Memory Wall for GB sized address space. Register codes usually take 1 microcyle (2 clocks). Memory and Branch codes 2 or more microcycles. So 10 PEs gives 40 peak ops every 8 clocks or 5 IPC. In practice it is much closer to 3 IPC allowing for memory and branch codes. Since around every 5th opcode is memory related, the memory bus load is 60% which works well with the hash page mapping. The pages are only 8 words each which means different threads interleave evenly across the 8 banks. With hashing the MMU also supports a more object store with new and delete in hardware.

The PEs are really quite simple, they only use about 300 LUTs each so the FPGA is hardly touched. The PE was hand placed next to each BlockRam. The design of course has no FPU either.

The logic design was created in Verilog/cycle C, and edited in the Webpack tool. I did not complete the final MMU in Verilog though and my hardware was actually limited to a Spartan board with tiny SRAM on board. Getting PCBs made up for an RLDRAM was getting beyond my resources. I was also working on the C compiler too so I was rather stretched. I could have modeled the RLDRAM inside a tiny 1MB SRAM.

This project was certainly under the radar, but the SUN Niagara was doing much the same thing on the processor side but not on the memory side. I still think the use of RLDRAM could have been a big game changer for CPU architecture.

So who do you work for?

Reply Score: 2

RE[5]: Transputer
by transputer_guy on Fri 15th Jan 2010 15:45 UTC in reply to "RE[4]: Transputer"
transputer_guy Member since:

Forgot to say that doing this in full CMOS would be a walk in the park. I read that Atiq Raza went on to do a 1600MHz threaded processor for the networking market, it probably also used RLDRAM but not in the way I suggested.

I would speculate that the RLDRAM nPE model proposed could also work several times faster with an equivalent L1 8 way split SRAM on chip, each bank only needs 8 clocks to perform, fronting the main L2 RLDRAM memory, with regular DRAM behind that.

All threads would still appear to be free of Memory Wall, trading instead for a Thread Wall.

Reply Score: 2

RE[6]: Transputer
by tylerdurden on Fri 15th Jan 2010 16:28 UTC in reply to "RE[5]: Transputer"
tylerdurden Member since:

No offense, but that is a far cry from what you claimed in your previous post. My initial skepticism was well founded then. The abandon with which you make assumptions and extrapolations is not granted given the little which was produced in terms of actual implementation

Having some design specs on paper, and implementing a few structures on a Spartan. Is a far cry from having full transputer tiles running at 300+ MHz.


Reply Score: 1

RE[7]: Transputer
by transputer_guy on Fri 15th Jan 2010 18:21 UTC in reply to "RE[6]: Transputer"
transputer_guy Member since:

I never said it was a reimplementation of the original, only that it was inspired by it. If you want a Transputer to run at such speeds, it has to look very different.

Any assumptions and extrapolations I made were no different from those usually made in such texts as the Hennessy book on computer architecture.

What was implemented in the Virtex was the entire integer and instruction fetch decode unit and did meet the timing at 300MHz. The Spartan version just runs so much slower. The memory interface was left for later and did not contain any known critical paths on the PE side. Getting the physical RLDRAM side up would have been "challenging" though as well as the PCB.

The Verilog/cycle C PE model ran small compiled programs although the memory cycle logic was still procedural. Before the Transputer came out there was one silicon prototype the S42, many things were missing there too, I don't recall it doing much more.

As for the occam support, the process scheduling, message support and so on, none of those things had to run particularly fast since they are used infrequently. There was still a lot of architecture to work out, what could run as firmware and what minimum of hardware to add.

I think you missed the main point about the whole exercise, the use of RLDRAM. The ability to run a large number of threads with the ability to start a full address space memory cycle every 3ns or so and how to make use of that bandwidth for 40 threads or so. As soon as you use conventional DRAM with SRAM cache architecture, everything goes down a well trodden path and processors today have none of the support for occam processes either.


Reply Score: 2

Uh, Transmeta? Really?
by Delgarde on Mon 11th Jan 2010 20:27 UTC
Member since:

Surprised to see Transmeta and their Crusoe chip on the list. How does that count as "world shaking" - despite a great deal of hype, the only thing of note they achieved was to give Linus Torvalds a salary for a few years...

Reply Score: 2

RE: Uh, Transmeta? Really?
by fatjoe on Tue 12th Jan 2010 09:01 UTC in reply to "Uh, Transmeta? Really?"
fatjoe Member since:

Actually, the hardware emulation (or whatever you call it) that Crusoe used have revolutionized the industry. It has been the main technology used in all later Pentium versions (starting with P3 or P4 I think) and have allowed multi-gigaherz and low-power devices.

(If you dont believe me, read the P4 programmers guide, specially the section about instruction decoder and register renaming)

Too bad Transmeta did not make any money, but I hope they can get some from the Intel lawsuit.

Reply Score: 2

RE[2]: Uh, Transmeta? Really?
by John Bayko on Tue 12th Jan 2010 14:21 UTC in reply to "RE: Uh, Transmeta? Really?"
John Bayko Member since:

No, the Transmeta used software emulation, not hardware emulation. That has been tried many times, including emulating x86 on Alpha, 68K on PowerPC, PowerPC on x86-64, but usually only for legacy compatibility. Also experimental run-time optimisation in HP Dynamo and IBM DAISY projects, which never really went anywhere. It's only real "mainstream" application was the Java Virtual Machine (JVM) "Just-In-Time" (JIT) recompilation, and to a lesser extent the Microsoft Common Language Runtime (CLR), neither of which are performance oriented (the virtual machines are used for security and stability).

Transmeta's emulation suffered from the same VM limitations compared to native hardware, namely excessively slow start-up times, before runtime profiling and optimisation could be done. It works best for servers where code can remain running for hours, but Transmeta chips were aimed at low power notebook and similar applications with mostly short-lived code execution (web browsers and email are the only applications that typically run long enough for optimisation to occur, but perception happens in the first few minutes, disappointing most customers).

In any case, Java JIT (and for that matter, LISP and Smalltalk JIT, and Macintosh 68K JIT emulation) predated the Transmeta, so despite it's impressive technology, it wasn't really that influential. The low-power aspects of its design were what had an impact, completely reversing the direction of Intel's low-end designs.

Reply Score: 2

And 2001-2009?
by Eddyspeeder on Tue 12th Jan 2010 17:07 UTC
Member since:

Major strides have been made in the 70's and 80's; there's no doubt about that. But have we really not had any remarkable progress after 2000? Because there isn't anything from the past decade on that list! That seems quite unlikely.

Reply Score: 1

RE: And 2001-2009?
by transputer_guy on Tue 12th Jan 2010 21:23 UTC in reply to "And 2001-2009?"
transputer_guy Member since:

Chips in themselves are never huge always getting obsoleted in years.

I will nominate a few technology entries that impress me greatly because of their sheer elegance, some may not work but any that do will be huge, some are just downright crazy.

While being a total chip head, I also follow most display technologies, most of them are unfamiliar here since they never worked out.

Unipixel is to LCD what CMOS is to NMOS at least in terms of power consumption but in complexity the reverse. Samsung has a license and is supposed to be out this year.

In LCD the pixel cell is now quite complex taking about 120 manufacturing steps and although the modern LCD is quite beautiful to look at, it's power consumption and pixel response are not that great (forget plasma). Panels get too warm for what light they emit.

What Unipixel TMOS does is to replace 3 inefficient liquid crystal switches for an ideal single optical shutter that is about 10-20x more efficient. That means an order magnitude less power for the light source needed and the overall display and at any size from cell phone to jumbo TVs. Also rapidly frame switched RGB so very fast frame rates.

And of course everyone knows OLED, a fall back if Unipixel is late. It is all up to Samsung on which comes first. OLED has a blue wear out liability for now.

Besides that EESTOR is interesting I bet 50/50 on that. Neither is a chip proper but both overlap the chip space and have a huge impact on energy savings.

For the crazy I don't quite bet on Blacklight Power which turns physics on its head giving us the hydrino which makes for an interesting side show.

On batteries, keep an eye out for Nickel-Lithium which uses 2 separate anode cathode chemistries separated by a type of glass barrier giving best case 10x power of Li-Ion. There are a couple of others like Lithium-Air, Zinc-Air, silicon nanotube enhanced Li-Ion, and more I'm sure. I think battery tech will now go on a Moores law but doubles every 5 years till its done. Capacity can be traded for cost. In 20yrs we will be good to go completely electric transport.

For fission power I am following the return of Thorium, the nuclear energy source that was set aside in the 60s in favor of Uranium which is inherently friendly to Plutonium and weapons. Thorium on the other hand is free of Pu and is 4x more common than Ur and also many other common metals. Enough to power the world for millenia. Look for Thorium LFTR and perhaps Energy Amplifier. Uranium will still make a big comeback against stiff resistance from anti nukes and those that don't understand the perils of diffused renewable land use.

For fusion power I follow the work of Prof Bussard (RIP, Startrek NG gave him good credit) for his Polywell reactor, and Dr Lerner for his Focus Fusion reactor. Both of these are really quite elegant and being modestly funded but could work this decade with luck. The first is backed by the US Navy and could power large ships and the latter could be an energy source that scales down. Both sort of fit on an industrial table top. Both also produce electrical power straight from the fusion energy, no intermediate heat to power steam turbine needed. Both use the Proton-Boron cycle which is almost free of neutron contamination and waste.

I have pretty much turned my attention away from chips to the technology needed to power those chips. I don't follow hydrogen, dead end, and solar and wind and esp biofuels are just too diffused for the TeraWatts needed by the world economy. Some interesting reading can be found in "Sustainable Energy — without the hot air" by Prof Mackay.

Reply Score: 3