“Among the many great chips that have emerged from fabs during the half-century reign of the integrated circuit, a small group stands out. Their designs proved so cutting-edge, so out of the box, so ahead of their time, that we are left groping for more technology cliches to describe them. Suffice it to say that they gave us the technology that made our brief, otherwise tedious existence in this universe worth living.”
25 Microchips that Shook the World
About The Author
Follow me on Twitter @thomholwerda
2010-01-11 5:14 pmPLan
I was half expecting the Transputer as well. But although computer geeks thought the Transputer might shake the world it never quite lived up to its potential did it ?
Almost all of those chips I was familiar with and well deserve to be on that list.
It seems the list is mostly US centered, except Micronas from Europe and Nand Flash from Japan.
So the Sh-Boom was the first to patent the use of PLL/DLL to boost the internal clock, but never went anywhere itself. Something fishy about that.
The Transputer ran off a 5MHz xtal clock but executed at 20-25MHz internally in 1984 so it predates Sh-Bang a couple of years too. Perhaps the PLL/DLL clock block wasn’t patented well enough. The Transputer also had the full 32b FPU on board long before Intel or anyone else too. And it had the DRAM controller on board about 2 decades before Intel. So by those counts it should be on the list. The rest is probably known too by the other fans here.
I still think the Transputer could have done better but failure to execute a decent working replacement and some inherent flaws in the original design like the variable length byte codes were destined to make faster versions a difficult prospect. I was reading the paper by Andrew Tanenbaum on that subject and was pushing those ideas around the office. I wish we had gone with a variable length 16b instruction word instead, far easier to do later versions and just as code dense. At the time we had no clue that would become a huge hindrance, it just seemed that 8b coding would work well enough. The high cost of the T800 and the special CMOS process didn’t help either. Inmos never saw the coming of open fabs like TSMC that could have made it so much nimbler.
In my Transputer work around 2005, the variable length 16b instruction decoder was incredibly simple and could run at 300MHz in FPGA. With 10 simple processor tiles it executed around 40 threads at about 25 MIPs each for a total of about 1000 MIPS. It used RLDRAM rather than SRAM cache with SDRAM for no Memory Wall. Each memory access added only 1 or so micro cycles to Load, Store and Branch codes, if only I’d had more lab resources.
Thanks Thom for the link
2010-01-12 3:10 amtylerdurden
2010-01-12 4:42 amtransputer_guy
If you have questions you can write offline, email is in the contact.
2010-01-14 8:46 pmtylerdurden
Or you can simply point at the project publication/repository site.
Some of us actually work on computer architecture. You claim to be getting over 300MHz in an FPGA implementing a full Transputer pipeline in 2005. That was a big red flag among others…. esp. since the only other project I know which was implementing a transputer in FPGA got it to run around a few tens of MHz circa the same time frame.
I would be fairly interested in reading about your results, esp. if you were doing that in 2005 you certainly managed to get under the radar of most people in the computer architecture community.
Edited 2010-01-14 20:47 UTC
2010-01-15 3:30 pmtransputer_guy
I gave a paper on the R16 at CPA in 2005 in Eindhoven. Just google for RLDRAM R16 Transputer. My OSNEWs post gave the RLDRAM, FPGA clues and my bio mentions the same thing.
You are certainly right about getting the original Transputer to run at 300MHz in FPGA, it would be impossible. One Japanese Prof did implement most of a T800 as a straight clone at about 25MHz. Also ST did an ASIC redesign at very good speeds for the set top market, not sure if became product. Indeed top of the line Vertex CPUs typically only run classic MIPs like ISAs at 100MHz (MicroBlaze or NIOS2) or so and I had no interest in doing another one of those with a GNU software on top, the antithesis of Transputing.
For me the whole concept about the Transputer was to implement a processor drawing upon the general concurrency ideas and to avoid all things that slowed it down such as the byte codes, the push pop stack as well as that memory model using SRAM caches and regular DRAM.
In 2001 the RLDRAM came to my attention which made me realize that a barrel or threaded processor could be built with it that would have very good performance. The RLDRAM is able to take memory issues up to about 500MHz with an ASIC controller but the Xilinx Vertex Pro could still drive it at 300MHz or so. The RLDRAM runs 8 banks concurrently with a shared SRAM like clocked interface with 8 cycles latency per bank.
My time spent in DSP cores naturally led me to go with a 4 way barrel design. It is possible to get a processor that runs the core at 300MHz or so but actually is executing 4 thread instructions every 8 clocks with a 16b data and 16b variable length instruction path. The 3ns cycle is constrained to 3 LUT levels of logic which means that every 8 clocks, we have up to 24 LUTs of logic depth which is more than enough to resolve all dependencies. A LUT is usually worth 2-3 simple gates of logic depth.
By placing 10 of these PEs with the MMU, allows for the RLDRAM bandwidth to be matched well to the PEs demand for Load, Store, Branch and instruction buffer refills. Since the PEs latencies are matched to the 8 cycle latency of the RLDRAM, the PEs appear to have almost no Memory Wall for GB sized address space. Register codes usually take 1 microcyle (2 clocks). Memory and Branch codes 2 or more microcycles. So 10 PEs gives 40 peak ops every 8 clocks or 5 IPC. In practice it is much closer to 3 IPC allowing for memory and branch codes. Since around every 5th opcode is memory related, the memory bus load is 60% which works well with the hash page mapping. The pages are only 8 words each which means different threads interleave evenly across the 8 banks. With hashing the MMU also supports a more object store with new and delete in hardware.
The PEs are really quite simple, they only use about 300 LUTs each so the FPGA is hardly touched. The PE was hand placed next to each BlockRam. The design of course has no FPU either.
The logic design was created in Verilog/cycle C, and edited in the Webpack tool. I did not complete the final MMU in Verilog though and my hardware was actually limited to a Spartan board with tiny SRAM on board. Getting PCBs made up for an RLDRAM was getting beyond my resources. I was also working on the C compiler too so I was rather stretched. I could have modeled the RLDRAM inside a tiny 1MB SRAM.
This project was certainly under the radar, but the SUN Niagara was doing much the same thing on the processor side but not on the memory side. I still think the use of RLDRAM could have been a big game changer for CPU architecture.
So who do you work for?
2010-01-15 3:45 pmtransputer_guy
Forgot to say that doing this in full CMOS would be a walk in the park. I read that Atiq Raza went on to do a 1600MHz threaded processor for the networking market, it probably also used RLDRAM but not in the way I suggested.
I would speculate that the RLDRAM nPE model proposed could also work several times faster with an equivalent L1 8 way split SRAM on chip, each bank only needs 8 clocks to perform, fronting the main L2 RLDRAM memory, with regular DRAM behind that.
All threads would still appear to be free of Memory Wall, trading instead for a Thread Wall.
2010-01-15 4:28 pmtylerdurden
No offense, but that is a far cry from what you claimed in your previous post. My initial skepticism was well founded then. The abandon with which you make assumptions and extrapolations is not granted given the little which was produced in terms of actual implementation
Having some design specs on paper, and implementing a few structures on a Spartan. Is a far cry from having full transputer tiles running at 300+ MHz.
2010-01-15 6:21 pmtransputer_guy
I never said it was a reimplementation of the original, only that it was inspired by it. If you want a Transputer to run at such speeds, it has to look very different.
Any assumptions and extrapolations I made were no different from those usually made in such texts as the Hennessy book on computer architecture.
What was implemented in the Virtex was the entire integer and instruction fetch decode unit and did meet the timing at 300MHz. The Spartan version just runs so much slower. The memory interface was left for later and did not contain any known critical paths on the PE side. Getting the physical RLDRAM side up would have been “challenging” though as well as the PCB.
The Verilog/cycle C PE model ran small compiled programs although the memory cycle logic was still procedural. Before the Transputer came out there was one silicon prototype the S42, many things were missing there too, I don’t recall it doing much more.
As for the occam support, the process scheduling, message support and so on, none of those things had to run particularly fast since they are used infrequently. There was still a lot of architecture to work out, what could run as firmware and what minimum of hardware to add.
I think you missed the main point about the whole exercise, the use of RLDRAM. The ability to run a large number of threads with the ability to start a full address space memory cycle every 3ns or so and how to make use of that bandwidth for 40 threads or so. As soon as you use conventional DRAM with SRAM cache architecture, everything goes down a well trodden path and processors today have none of the support for occam processes either.
2010-01-15 10:38 pmtylerdurden
I did not missed anything, however when you say “… doing the design in CMOS would be a walk in the park….” I really stopped trying to even attempt to take you seriously. You do realize that is exactly the hardest part of every design, right?
Every design looks great in paper. However, one should be more careful when making the wild claims you were making. Esp. since it is clear all you have to go with are some fairly extreme extrapolations. Checking out your paper, it is nothing but a list of specifications. And all you have produced are some structures running on a Xilinx Spartan. That is a far, far, far cry from having a tile of transputers running at 300+ MHz as you claimed initially.
In theory, theory and practice are the same. In practice they are not. The fact that you wrote that paper over 5 years ago, and you are yet to have working implementation of said paper… should have sort of pointed out towards that fact.
2010-01-15 11:25 pmtransputer_guy
Lets just say I did CMOS circuit design, layout and architecture, modeling for some 20 years, even used to sleep on it. Going to FPGA has a few advantages, no masks or huge upfront costs and fast turnaround. But FPGAs typically run processor like pipelines at about 1/5 the speed of true CMOS speeds or any type of circuit technique, they are simulations of logic. So using FPGAs lets you do stuff at 300MHz and down, while any CMOS team with USMC or TSMC or better process would call that bottom. Running a pipeline with a maximum of 3 LUTs of logic depth which is about 9 gate levels is not considered difficult esp so when the size of a design is small.
Lets set aside the processor architecture and the above statement. All the PEs are actually doing on one side of the system is generating memory cycles for which the RLDRAM has to sustain the load evenly across 8 banks. Since the 40 odd threads are not typically correlated, they each submit base[index] pairs to be hashed over the RLDRAM address bus. Only bottom 5 bits are linear, any other 3 bits select the bank and the rest select a micro page in the bank. Since the threads are uncorrelated, the 3 bit bank select is essentially a random sequence and can be in order sorted to line up with the finishing banks. Net result is that 40 slowish threads get to see a large flat address space with little apparent Memory Wall.
The hashing technique is well known and there is no magic there, but it does consume about 30% of the bandwidth and also requires tag space.
This is relevant to a Transputer in that you would wants lots of threads anyway and now they can run quite fast too. If each thread computes at about 25MIPs, then it is overall about 40x faster than the T800. I stopped working on it because I got such a tepid response and went on to other projects. Not that many people seem to understand all the fields involved, software and hardware types don’t mix much.
Surprised to see Transmeta and their Crusoe chip on the list. How does that count as “world shaking” – despite a great deal of hype, the only thing of note they achieved was to give Linus Torvalds a salary for a few years…
2010-01-12 9:01 amfatjoe
Actually, the hardware emulation (or whatever you call it) that Crusoe used have revolutionized the industry. It has been the main technology used in all later Pentium versions (starting with P3 or P4 I think) and have allowed multi-gigaherz and low-power devices.
(If you dont believe me, read the P4 programmers guide, specially the section about instruction decoder and register renaming)
Too bad Transmeta did not make any money, but I hope they can get some from the Intel lawsuit.
2010-01-12 2:21 pmJohn Bayko
No, the Transmeta used software emulation, not hardware emulation. That has been tried many times, including emulating x86 on Alpha, 68K on PowerPC, PowerPC on x86-64, but usually only for legacy compatibility. Also experimental run-time optimisation in HP Dynamo and IBM DAISY projects, which never really went anywhere. It’s only real “mainstream” application was the Java Virtual Machine (JVM) “Just-In-Time” (JIT) recompilation, and to a lesser extent the Microsoft Common Language Runtime (CLR), neither of which are performance oriented (the virtual machines are used for security and stability).
Transmeta’s emulation suffered from the same VM limitations compared to native hardware, namely excessively slow start-up times, before runtime profiling and optimisation could be done. It works best for servers where code can remain running for hours, but Transmeta chips were aimed at low power notebook and similar applications with mostly short-lived code execution (web browsers and email are the only applications that typically run long enough for optimisation to occur, but perception happens in the first few minutes, disappointing most customers).
In any case, Java JIT (and for that matter, LISP and Smalltalk JIT, and Macintosh 68K JIT emulation) predated the Transmeta, so despite it’s impressive technology, it wasn’t really that influential. The low-power aspects of its design were what had an impact, completely reversing the direction of Intel’s low-end designs.
Major strides have been made in the 70’s and 80’s; there’s no doubt about that. But have we really not had any remarkable progress after 2000? Because there isn’t anything from the past decade on that list! That seems quite unlikely.
2010-01-12 9:23 pmtransputer_guy
Chips in themselves are never huge always getting obsoleted in years.
I will nominate a few technology entries that impress me greatly because of their sheer elegance, some may not work but any that do will be huge, some are just downright crazy.
While being a total chip head, I also follow most display technologies, most of them are unfamiliar here since they never worked out.
Unipixel is to LCD what CMOS is to NMOS at least in terms of power consumption but in complexity the reverse. Samsung has a license and is supposed to be out this year.
In LCD the pixel cell is now quite complex taking about 120 manufacturing steps and although the modern LCD is quite beautiful to look at, it’s power consumption and pixel response are not that great (forget plasma). Panels get too warm for what light they emit.
What Unipixel TMOS does is to replace 3 inefficient liquid crystal switches for an ideal single optical shutter that is about 10-20x more efficient. That means an order magnitude less power for the light source needed and the overall display and at any size from cell phone to jumbo TVs. Also rapidly frame switched RGB so very fast frame rates.
And of course everyone knows OLED, a fall back if Unipixel is late. It is all up to Samsung on which comes first. OLED has a blue wear out liability for now.
Besides that EESTOR is interesting I bet 50/50 on that. Neither is a chip proper but both overlap the chip space and have a huge impact on energy savings.
For the crazy I don’t quite bet on Blacklight Power which turns physics on its head giving us the hydrino which makes for an interesting side show.
On batteries, keep an eye out for Nickel-Lithium which uses 2 separate anode cathode chemistries separated by a type of glass barrier giving best case 10x power of Li-Ion. There are a couple of others like Lithium-Air, Zinc-Air, silicon nanotube enhanced Li-Ion, and more I’m sure. I think battery tech will now go on a Moores law but doubles every 5 years till its done. Capacity can be traded for cost. In 20yrs we will be good to go completely electric transport.
For fission power I am following the return of Thorium, the nuclear energy source that was set aside in the 60s in favor of Uranium which is inherently friendly to Plutonium and weapons. Thorium on the other hand is free of Pu and is 4x more common than Ur and also many other common metals. Enough to power the world for millenia. Look for Thorium LFTR and perhaps Energy Amplifier. Uranium will still make a big comeback against stiff resistance from anti nukes and those that don’t understand the perils of diffused renewable land use.
For fusion power I follow the work of Prof Bussard (RIP, Startrek NG gave him good credit) for his Polywell reactor, and Dr Lerner for his Focus Fusion reactor. Both of these are really quite elegant and being modestly funded but could work this decade with luck. The first is backed by the US Navy and could power large ships and the latter could be an energy source that scales down. Both sort of fit on an industrial table top. Both also produce electrical power straight from the fusion energy, no intermediate heat to power steam turbine needed. Both use the Proton-Boron cycle which is almost free of neutron contamination and waste.
I have pretty much turned my attention away from chips to the technology needed to power those chips. I don’t follow hydrogen, dead end, and solar and wind and esp biofuels are just too diffused for the TeraWatts needed by the world economy. Some interesting reading can be found in “Sustainable Energy â€” without the hot air” by Prof Mackay.
Here’s what Chuck Moore is working on these days…
144 core CPU, natively runs colorForth in hardware, claims 3mA per node… sounds like fun.
no transputer = list fail.