As expected processor licensor ARM Holdings and Handshake Solutions NV, a Royal Philips Electronics subsidiary, have developed an asynchronous processor based on the ARM9 core. The ARM996HS is thought to be the first commercial clockless processor and is being described as particular suited to use as an automotive microcontroller. Because clockless processors consume zero dynamic power when there is no activity, they can significantly extend battery life compared with clocked equivalents.
“consume zero dynamic power when there is no activity” Sounds really good.
Now combine this with the “raw chip” technology, and we have the ultimate CPU.
Tom
Now combine this with the “raw chip” technology, and we have the ultimate CPU.
What’s that?
Never heart of it? Well it’s pretty exciting. There are two types of processors: All-purpose CPUs like from AMD or Intel, which can pretty much do anything – it just depends on the software. In contrary to that, there are specalized processors, intended for one single purpose, like a sound chip on your sound card for example. A “raw chip” combines both: There’s one layer of transistors which run the software. Instead of hard-wired connections between these transistors (like today) there’s a second layer of transistors, and these transistors can form any desired wiring between the first layer of transistors. In other words, the chip is not hard-wired, it’s dynamic wired.
A compiler would not translate a C++ program (as an example) into machine code, it would translate a C++ program into an optimized wiring-scheme for the chip. If you start an application, say Firefox, then you would effectively have a “Firefox chip” in your computer, which is hardware-optimized to run Firefox. If you start a second application, then some free space on the chip can be used to run this second application and so forth. Well I hope you know what I mean (sorry English is not my native language).
The chip would dynamically and constantly transfer itself to be optimized for just any running application. A chip that is designed on the hardware level to run a certain application, can run this application faster than an all-purpose chip who can run anything, but has to do everything in software. I’ve read about this in “Spektrum der Wissenschaft”, the German ister-publication of “Scientific American”. They said that even though the “compiler” they used to create this dynamic wiring was totally un-optimized, this new chip ran all tests faster than a comparable (MHz-like) standard CPU.
Tom
It seems you have discovered FPGAs and Reconfigurable Computing. You most definitely can not compile Firefox into a chip, nor any old C++. You can compile or synthesize some specialized C languages which are really crippled C with some cripled HDL into FPGA. Most engineers would use a HDL like Verilog (C’ish) or VHDL (Ada’ish) but there is some interest from software types who avoid learning real electronics and try to make a lazy pass with C like HDLs, I am not a big fan of those.
Raw chip doesn’t mean anything really, it might mean unpackaged dies but nobody uses that term.
The Scientific American article to which you refer is vastly overstating the case for RC & FPGAs, it is much more even when you try to do math on a couple of fast P4s v the same $ worth of FPGA hardware. Thats because general purpose cpus can run about 10-20x faster and have agreat floating point capacity. FPGAs today have zero FP capacity other than using up a lot of FPGA resources to get a poor mans FPU. On the other hand FPGAs can be very good at mostly logical operations that don’t map onto cpu such as crypto, pattern matching, and almost anything DSP and integer, esp if the word sizes are odd.
See Xilinx, Altera, Lattice, Atmel for vendor sites.
Hope that helps.
Okay in the context of your 1st post if you have a couple of $M you can indeed buy some Cray super computers (Octiga Bay originally) that basically combine some Opterons with you guessed it some Xilinx FPGAs, and also note that both are fully syncronous clocked systems. FPGAs will never be async and wouldn’t work well with the clockles ARM in the main article. Infact the ARM hasn’t made much of a dent in FPGAs, Altera used to and barely still includes an older ARM core in one of their FPGA families, Xilinx uses the PPC core but only at about 300MHz with no FPU.
Sounds interesting, But how would it compare (performance wise) to a clocked processor?
I feel a little dubious at this technology. Sure it will save lots of power, but there’s never really a time that a CPU is completely idle. Sure this is being used for an automotive microcontroller, but I’m sure that It’ll *try* to make it into the PC market. I guess we’ll have to wait and see the progress of the tech.
ZaNkY
at the very least it will make its way into handheld devices (where ARM have a virtual monopoly), and that is a place where the ability to save power is very important.
I don’t think they’ll ever try to go for the PC market.
What they don’t tell you is that asynchronous designs all suffer from a lot of startup delay. These chips are probably not meant for interactive applications like we use the PC for – so in those cases it’s toally fine for them to take a little time to ramp up
That’s exactly what I was thinking. The whole startup process would be the kicker. I mean, don’t get me wrong, I think it would be great if it could get into PC tech, but I don’t see it happening with major adjustments.
Good point (a post above) that this is targeted mainly at mobile devices, and most probably will stay that way.
And I must say, even if it appears your CPU is idle, it’s is nearly ALWAYS computing something. Whether it’s detecting minuscule mouse movements (mouse interrupt calls), or its processing background dhcp requests or other TCP stack stuff, it’s always working.
We could say that the CPU may at times be doing negligible work, but I doubt it ever really “stops”.
ZaNkY
We could say that the CPU may at times be doing negligible work, but I doubt it ever really “stops”.
It does. Perhaps only for 10 milliseconds or so, but there are huge swaths of time where there is nothing to do, and thus the idle thread is usually run. Typically the idle thread in most OSes does a little bit of housekeeping and then just halts the cpu until the next interrupt (network packet, keypress, timer, etc). On mobile devices this is doubly important, to save power.
but, iirc, only in windows NT or maybe 2000 did microsoft add proper support for this. in windows 9x and older, they would just send dummy commands to the cpu when there was nothing to do…
A very basic difference between synchronous and asynchronous (here self-timed, I guess) logic is that the first one is polling its inputs at every clock’s tick, while the second is strobed by inputs directly (each data signal is a “clock”).
So, for asynchronous circuits there is no notion of explicit “stops”, “idle states” or “waking up” because these were invented to stop or limit “polling” activity in synchronous circuits. Asynchronous circuits respond to their input events instantly (i.e. a latency depends on processing time only). They are active only when there is something to do and the level of their activity adjusts automatically to the activity of input signals, temperature, supply voltage and process skew. It basically means that (at least for CMOS implementation) you can control the maximum processing power and maximum operating temperature by adjusting the supply voltage _only_. Internal handshaking structures will take care of the signal integrity, synchronisation and providing maximum performance.
So what is the problem about deploing these circuits more widely? There are at least two:
– circuit overhead – present CMOS synchronous circuits are ~2x smaller. Also processes are tuned to synchronous logic architectures. This results in often better performance although the theory says opposite.
– design methodology – synchronous logic is very simple to design. There are numerous problems with VLSI implementation (mostly related to clock and power management) but with time we learned how to cope with that. Again withouta robust design tool-chain, IP macros and support there is no resonable time-to-market what disqualifies asynchronous logic in typical applications.
I feel a little dubious at this technology. Sure it will save lots of power, but there’s never really a time that a CPU is completely idle.
You betcha there is time that a cpu is completely idle. Depending on the use of the system of course, but most cpus in the world spend most of their time doing nothing. They’re usually halted, waiting on the next interrupt.
sounds interesting
well, it sounds usefull for laptops/pda’s and other battery operated computers
The University of Manchester developed a clockless (asynchronous) ARM processor which was fabricated in 1993 – http://www.cs.manchester.ac.uk/apt/projects/processors/amulet/ Leading the development was a principal designer of the ARM processor, Steve Furber – http://en.wikipedia.org/wiki/Steve_Furber .
The University of Manchester developed a clockless (asynchronous) ARM processor which was fabricated in 1993 – http://www.cs.manchester.ac.uk/apt/projects/processors/amulet/ Leading the development was a principal designer of the ARM processor, Steve Furber – http://en.wikipedia.org/wiki/Steve_Furber .
“the first commercial clockless processor” with emphasis on quote
Edited 2006-04-08 20:56
I wonder how they made flip-flops without a clock. . .
>I wonder how they made flip-flops without a clock. . .
Flip-flops are synchronous elements. So there ain’t much use for them in asynchronous design. What I can think of is either to genereta per flip-flop clock from some other signals or to use latches instead.
it’s possible to build a clockless state machine; and a CPU is basically a complex state machine.
before the era of chipsets, a cpu connects to the outside world using IO chips such as intel 8255 http://developer.intel.com/design/archives/periphrl/docs/7190.htm
which is an asynchronous design. I’ve built in college a hardware emulator of the 8255.
Asynchronous designs are very hard to design, because they can go into meta states and become unstable. synchronous designs, on the other hand, are very stable because meta states happen between clock edges.
I forgot to mention that they use two NAND/NOR gates with feedback to create a clockless fliplop.
>I forgot to mention that they use two NAND/NOR gates with feedback to create a clockless fliplop.
That sounds like a latch. Add enable signal with edge detector and you’ll get a flip-flop.
HOw would such a processor tell the time? Or periodically measure input – such as engine temperature?
Well, if something other than basing time on passing CPU clock cycles is used, that’s not a problem. Don’t most CPUs do that anyway*? A separate clock generator, or dedicated diagnostic equipment, would handle timing.
The point is that most processors ‘waste’ time, power, and transistors keeping everything together with the clock cycles.
* I’m not being rhetorical. I don’t know, but it seems like a common sense way to do it.
HOw would such a processor tell the time?
With a separate real-time clock circuit. That can have a much lower frequency, and thus lower power consumption, than your usual processor clock.
Systems with clocked processors often use an RTC as well, because that allows to switch off the processor clock when there’s nothing to do. And they can be more accurate too.
To tell time, they poll the (ultra low-power) clock chip, use maybe 100 cycles to paint a new face on the
Yes, CPUs -do- have times when they do nothing; your multicore CPU surely often has the second in ‘chill mode’ unless you absolutely must have badly written Flash webpages updating in the background…perhaps in a bid to have the most ephemeral P2P network?
Many yarns ago, Java was going to take 0 CPU and run on little JAVA chips you sort of poured next to the graphics cards; and you would have good metrics on the watts those used and multiple VMs without paying $32k (about 20 euro…oh, thousand euro, yes) for a specially compiled JVM, because that would be in Java. Perhaps when the Java chips (and mayhaps this is it) become necessary to support the heatsinks in 4U and 8U or Tower configurations, next week, they will be introduced commercially.
The chip would periodically do things by taking an signal or from a watchdog timer (i.e. clock.) Alternately it could be fabbed with great analog peripherals so it only woke up when the temperature input changed sufficiently for it to need to do something!
Years ago, when I was actually designing digital circuits, I used asynchronous logic every chance I got. That is, I used latches instead of clocked flip-flops, and the latches only switched when their latch-enable signals toggled. But I worked in a company where most engineers believed synchronous design was THE RIGHT WAY, so they looked askance at me.
Now along comes ARM and produces a whole clockless processor. YES! I don’t know exactly what Handshake Solutions’ clockless IC design technology is, but it’s got to be using individually-enabled latches.
Although reduced power consumption, due to the lack of clock circuitry, is one benefit the clockless design also produces a low electromagnetic signature because of the diffuse nature of digital transitions within the chip.
That make perfect sense. The outputs of some latches feed the logic that produces the next tier of enables, so the signals ripple through the circuit. They don’t all switch at once.
Years ago, when I was actually designing digital circuits, I used asynchronous logic every chance I got. That is, I used latches instead of clocked flip-flops, and the latches only switched when their latch-enable signals toggled. But I worked in a company where most engineers believed synchronous design was THE RIGHT WAY, so they looked askance at me.
As well they should have. Using the asynchronous logic as you were probably doing invites obscure bugs and maintenance nightmares. I’ve recently joined the workforce and had to help update some circuits where asynchronous logic was used in small areas, and we have discovered numerous bugs. In fact the most common comment of my coworkers was “Why, its a miracle it worked at all!”
There is a way to do asynchronous logic that is perfectly safe, but its completely different from most other logic design and its very difficult because there are NO tools to help with the design so you must lay out transistors yourself.
If you’d like to see how asynchronous design should be look at the cornell ece 574 website:
http://www.csl.cornell.edu/courses/ece574/
(a course I took a few years ago).
…and its very difficult because there are NO tools to help with the design so you must lay out transistors yourself.
You may have put your finger on the problem. It’s not that asynchronous design is bad, it’s tool availability, which of course is driven by demand, which is of course is driven by tool availability. This is the way we do it. Why? Because this is the way we do it.
My designs used discrete and programmable logic, so my most important tool was my brain. I never got to layout transistors. And I’m not aware of anybody finding logic bugs in my designs later.
>My designs used discrete and programmable logic, so my most important tool was my brain. I never got to layout transistors. And I’m not aware of anybody finding logic bugs in my designs later.
Of course it is easy to do logic for something asynchronous but it is much harder to take glitches into account.
I over-clocked mine to 0.5 MHZ.
Heh.
How do you make sure that it doesn’t over heat or stuff like that?
Just lower the voltage.
There are several variants of clockless design but in a cpu scheme I would venture clocks are still used throughout the chip but are just locally generated.
One can argue about whether the storage element is a true latch with a short enable window or a true edge trggered register, either can be used in such schemes, as long as the clock is being conditionally generated when needed. Even in regular clocked systems, either can be used althoughs latch based design has always been more tricky.
That brings up edge triggering detection circuits. Either a chip can be full of these and use changes on inputs of a block of asyncronous logic to produce a carefully delayed clock signal (or latch enable) that is guaranteed to always be after the logic block has stablilized. That by definition can be difficult, how do you know when A+B is done, it depends on the position of the final carry ripple stage. Either you don’t care and use a worst case delay which most fixed clock systems do or you try to figure that out locally so 1+1 always produces a strobe much earlier than 1000+1000 etc. That can complicate the logic, even double it up as essentially you have 3 level logic requiring 2 wires for every signal (00/01/10) instead of 1 (0/1). In good async design only certain paths have to be designed with twice as much logic so the impact may only be 5% on the logic area.
BTW this may be the 1st “clockless” commercial cpu offered but there have been many prototypes beside Amulet, Philips has been a big fan of this for embedded use for some time and demonstrated async 8051s.
Also most SRAMs have traditionally used this technique looking for address,oe,ce edges and producing internal clocks from there to rush the data through to outputs and then low power itself to avoid having an explicit clock, but thats relatively easy by comparison to cpus.
This async clockless logic though is going to have very limited appeal in the long run even if it can be done in a proveable fashion, and the reason is very simple. Very few ASIC designs are being done any more, nobody except the largest companies can afford the mask sets and the NREs involved.
Also the tools for doing this style of design are not readily available for your typical ASIC process and it remains a boutique design flow limited to a few companies much more in the vein of full analog design almost.
Further with the switch of many ASIC designs to FPGAs there is no room for this async desgign since FPGAs are not at all friendly to that method.
So these aysnc designs will be limited to systems which are totally power sensitive and low chip count ie PDAs cell phones etc. As soon as you add other regular clocked parts into the eqn, the power savings will disappear.
There are several variants of clockless design but in a cpu scheme I would venture clocks are still used throughout the chip but are just locally generated.
Though I think thats the scheme the U. of Manchester used, I think these guys are using something closer to what Cornell and CalTech was researching, which is fairly simple to design, but implementation is a bugger (as I said above, there are no tools). I think they’re using that style of async design because of the involvement of Handshake Technologies, and the part of the design process involved Handshaking Expansions (ok maybe thats a stretch connection, but as a student of that style I think its superior anyway ;p )
Edit: Hmm… looking at the Handshake Technologies website, it looks like they may be doing that self-timed circuitry crap after all. Oh well, I guess its too much to hope that everything I studied in college would be commercially successful.
Edited 2006-04-09 08:14
One can argue about whether the storage element is a true latch with a short enable window or a true edge trggered register, either can be used in such schemes, as long as the clock is being conditionally generated when needed. Even in regular clocked systems, either can be used althoughs latch based design has always been more tricky.
You are right about latch based designs being more tricky. But the difference between a latch and a register is clear. The register requires a circuit that is edge driven, while a latch can be easily implemented in straight combinatorial logic. Using ‘*’ for AND, ‘+’ for OR, and ‘!’ for NOT:
OUTPUT = (INPUT * LE) + (OUTPUT * !LE) + (INPUT * OUTPUT)
The tricky part is that the last term is redundant, so some software will eliminate it and not even tell you. Of course LE must rescind sufficiently before INPUT changes, which can also be tricky if INPUT and LE are being generated from the same event.
“Because clockless processors consume zero dynamic power when there is no activity, they can significantly extend battery life compared with clocked equivalents.”
It seems completely misleading.
Power = Dynamic Power + Leakage.
– When there is no activity (CPU waiting for an event), clocks are usually gated in a synchronous design, and this dynamic avantage disapears.
– Amount of extra logic needed to do signal handshaking in an asynchronous CPU not only impacts frequency but implies that the desgin is a few times bigger than its synchronous equivalent (lets check die sizes). And this is where leakage power appears : it is linear on design size. On recent silicon process (starting 90nm, worse in 65nm), half of consumed power is leakage.
Having said that, there is interesting opportunity for asynchronous logic in some areas outside a CPU, or even in a “secure” CPU (defeates power/timing analysis, but is it woth the cost ?).
I was just thinking that self-clocking at a local logic block level (adder/flip-flop/latches etc) should be easy. You generate your local clock when the two rails of a signal do not share the same value.
HALF-ADDERs and NOT gates are easy to build using double rail logic.
Caltech was first.
1989, Caltech Asynchronous Microprocessor:
http://www.async.caltech.edu/cam.html
1998, Caltech MiniMIPS:
http://www.async.caltech.edu/mips.html