Linked by Tony Bourke on Thu 22nd Jan 2004 21:29 UTC
Benchmarks When running tests, installing operating systems, and compiling software for my Ultra 5, I came to the stunning realization that hey, this system is 64-bit, and all of the operating systems I installed on this Ultra 5 (can) run in 64-bit mode.
Permalink for comment
To read all comments associated with this story, please click here.
another side
by JJ on Fri 23rd Jan 2004 02:17 UTC

You can get an even better feel for these things by sitting down and designing a cpu core. Thats not very practical if you chase the big guys for 1-3GHz cycle rates but if you pull down to 200MHz its doable in FPGA now. I will refer my comments to Spartan3 family being introduced by Xilinx. I am using a free download Webpack tool to synthesize my design in Verilog. I am limited to the smaller parts upto about 400K logic gates equiv which is more than enough to hold several small 16-64b cpu cores.

The core I am developing is parameterized so that I can choose the cpu width, register file hight, and how deep the alu pipeline is as well as the size of internal sram memory (might be cache or fixed).

The main limiting factor of most cpus that is very hard to work around is the speed of 1st level memory or cache. In FPGA case it can cycle about 200MHz on random access and is also dual ported. Each independant block ram can be 512 by 32 or 16K by 1 or somewhere in between. Using the wider widths allows more bandwidth so I use the 32b form. Further the smallest FPGA may have 4 of these (sp3-50 about $3) and the largest about 104 (sp3-5000 about $100). Prices are in high volume. Block rams can be ganged into super blocks to make bigger rams with little speed impact. But for N cpus, I would want to limit 1 ram per cpu core. A more parallel uber core might use these block rams for super deep register files, tlbs, mmus etc.


Next is the width of the alu or adder. In ASIC/VLSI design the width delay can reduce to a mostly log cost so a 64 v 32 is similar to 32 v 16 delay difference due to propagate generate schemes forming N way trees. This starts to break down a bit after 64b as the doubling widths becomes wire limited.

In FPGAs general logic is about 5x slower than those found in 1GHz level ASICs BUT built in logic thats is simply instanced can be just as fast since they are still circuit level designed by the FPGA company. In FPGAs, carries are almost always ripple carry type as propagate generate circuits are irregular and are not in the FPGA fabric.

So a 16b adder can cyle at 200MHz, 32b at 150MHz and 64b at 100MHz. IE the cycle time follows the width. In all these cases the latency through the datapath is the same in cycle count. One way to speed up the wider cpu is to break the carry every 16b and add another pipeline. You can see now why cpus start to get very deep. Now a cpu can perform at 200MHz for any width if extra N/16 latency cycles are accepted. But that introduces hazard headaches which in turn have to be compensated for by various schemes.

Hazards can be reduced by making the compiler work harder, or by adding hazard detection and register forwarding logic and finally by adding multithreading to make near opcodes independant. There are more costs associated with all of these as well, I will be using some of each of these.

The hight of the register file is likely to be 16 or 32. The penalty here is not speed but sheer no of logic cells that form dual ported 16x1 teeny rams. These can be ganged into 4 wayported regfiles 1w, 3r paths. A 64b cpu with 64registers sucks up 64x64x2x3/16 luts or 1536 cells out of a few thousand. If a cpu is going to have really large reg file w & h, then it should use the blockrams but there are few of those also.

The piece de resistance is that since the cpu is a message passing transputer core in spirit, it can be scaled up for N way supercomps with far less overhead than the shared memory designs. Now if each cpu node even a 64b cpu is close to $1 per instance, it begs the question, would I rather have N Opterons that do not scale in cost albeit 10x faster per node, or would I rather have 100x cheaper but 10x slower nodes. Theres far more to the story than that, ie FPUs are going to be fairly weak in FPGA and scarce and there is a compiler to support C,Occam & HDL etc....

Hope this tidbit is of interest.

johnjaksonATusaDOTcom