Linked by Tony Bourke on Thu 22nd Jan 2004 21:29 UTC
Benchmarks When running tests, installing operating systems, and compiling software for my Ultra 5, I came to the stunning realization that hey, this system is 64-bit, and all of the operating systems I installed on this Ultra 5 (can) run in 64-bit mode.
Order by: Score:
Compiling 64bits vs writing 64bits
by Uruloki on Thu 22nd Jan 2004 21:36 UTC

There's a huge difference between simply compiling 32bit software with a 64bit compiler and actually writing them so they make the best use of the enhance capabilities.
There is also no need at all to use 64bits ints in cases where a 16bit int would more than suffice ..

TheMatt likes 64-bit...
by TheMatt on Thu 22nd Jan 2004 22:01 UTC

Of course, I use quantum chemistry programs. Believe me, you can do a lot more interesting and bigger things with 64-bit compared to 32-bit. And the programs are much faster as the blokes who code these babies use every iota of register if they can.

And, I suppose, I'm colored by my main 64-bit exposure being on my Alpha XP1000 and Digital/Compaq/HP's F90 compiler. Oooh...so nice, so well made. But, I'm guessing Opteron will be my next...then it'll be gfortran (someday) or pgf90 for me.

While we're adding stuff! :)
by Dawnrider on Thu 22nd Jan 2004 22:05 UTC

This set of benchmarks is actually very nice, since it compares pure 32bit code on a system without too many differences between 64bit and 32bit implementations.

On x86-64, however, recompiling to 64bit also gives the compiler access to additional registers, so it doesn't have to waste cycles re-arranging contents in the limited number of x86 registers. This offers a performance boost beyond simple recompilation and excluding the benefits of 64bits on operations which need it.

Therefore, readers considering Opterons and Athlon64s should be aware that the figures here do not represent the results they would discover from 64bit recompilation. ;)

Additional 32/64 bit benchmarks
by nitrile on Thu 22nd Jan 2004 22:16 UTC

A synthetic test suite I've used for benchmarking unix systems is nbench ( http://www.tux.org/~mayer/linux/bmark.html ). It's available as source and has a small variety of different individual tests which, though are each detailed individually aren't reproduced here.

Ultra-10, USIIi 333MHz/2MB:

--------
gcc -s -static -Wall -O2 -mpcu=v9 -m32
nbench: ELF 32-bit MSB executable, SPARC32PLUS, V8+ Required, version 1 (SYSV), statically linked, stripped

==============================LINUX DATA BELOW===============================
CPU :
L2 Cache :
OS : Linux 2.4.21
C compiler : gcc-3.2
libc : ld-2.3.2.so
MEMORY INDEX : 1.076
INTEGER INDEX : 1.257
FLOATING-POINT INDEX: 1.760

--------
gcc -s -static -Wall -O2 -mpcu=v9 -m64
nbench: ELF 64-bit MSB executable, SPARC V9, version 1 (SYSV), statically linked, stripped

==============================LINUX DATA BELOW===============================
CPU :
L2 Cache :
OS : Linux 2.4.21
C compiler : gcc-3.2
libc : ld-2.3.2.so
MEMORY INDEX : 1.102
INTEGER INDEX : 1.203
FLOATING-POINT INDEX: 1.661

--------

Running tests multiple times produces some variation, but in general they are all fairly close.

Sun compiler
by Al Dente on Thu 22nd Jan 2004 22:16 UTC

Since 64 bit support is rather young in gcc it may have been more fair to use Sun's compiler for the tests.

Unless things have changed recently I believe that you can demo Sun's c compilers without buying them.

Still...
by BFG on Thu 22nd Jan 2004 22:18 UTC

It is nice to see the different binary formats when executed on the same system, this gives a good idea about how much overhead their is in 64 bit compiled binaries (vs. 32 bit) and how it effects performance given the same number of registers.

v7
by nitrile on Thu 22nd Jan 2004 22:19 UTC

Another factor of interest, and also one much more relevant to system performance is that Debian on sparc is actually compiled for SPARC v7, which is broadly equivalent for compiling an x86 distribution for a 386.

The most visibly represented inefficiencies are on floating point code, which is done under FPU emulation regardless of whether or not an actual FPU is present (as it effectively always is nowadays, even on the oldest generally available sparc systems second hand).

Typically it's not a major issue as integer arithemetic dominates the programs supplied with the packaging system, but those running debian on sparc v8 or above (SPARCStation 4-20, Ultras) can benefit from recominpiling OpenSSL, which will speed up anything that depends on it fourfold.

x86-64 bit binaries
by CaptainPinko on Thu 22nd Jan 2004 22:23 UTC

I'm pretty sure though that there will be a bigger difference when compiling x86 binaries as x86-64 binaries because there will be more register available to the compiler. I'm guessing the reason that the 32vs64 bit difference is so small in nitrile's case is because the SPARC was RISC already so the UltraSPARC just added 32 bits... the x86-64 IIRC adds new registers. So as interesting as this is I'd like to see the smae tests performed on Solaris x86 and x86-64 when it comes out.

DISCLAIMER: I didn't RTFA but scanned it. I've never looked at the UltraSPARC ISA but have programmed in both x86 and SPARC assembly.

when you benefit from 64bit
by Gandalf on Thu 22nd Jan 2004 22:29 UTC

Wonder why OpenSSL is faster in 64bits than 32bits? And why it's the oppiosite for the other applications?

One hint: floating point registers. They are usually 64bits and therefore applications that make heavy use of floating point variables/registers directly benefit from a 64bit binary.

Well, of course you'll benefit from 64bits if you are dealing with more than 4GB of RAM, or memory-mapped space (e.g. dealing with files larger than 4GB).

On the other hand using 64bit registers/pointers take up more memory. IE if you are deadling with 32bit integer variables, then you are essentially waisting 32bits of memory for each variable, and loading/storing from/to memory takes also more time. Therefore your applications slow down if you use 64bit builds that only deal with 32bit (integer) data.

OpenSSL is using lots of floating point calculations, therefore performance is better with a 64bits build: the overhead of the 64bit pointers is not as bad as the benefit from using 64bit floating point registers directly.

MySQL/gzip on the other hand don't use much floating point stuff, and therefore the overhead of using 64bit pointers is slowing it down.

Registers
by Chris on Thu 22nd Jan 2004 22:34 UTC

I'm sure this is mostly concerning the recent Athlon64, as it is the first 64 bit x86 chip that I have heard of.
The benefit with it isn't 64 bit integers, as stated above it's that it has 16 registers per unit where Athlon has 8. Pentium 4 has 8, and I believe I read somewhere it has 112 other invisible registers. Doubling the registers is definitely a good thing as one can lose clock cycles even when requesting information from cache; and with an 18/20 stage pipeline losing a cycle is a big deal.

Also, we have to move eventually as 32bits can only hold the epoch till 2038 if I remember correctly. There are of course other reasons why a 64bit machine is cool. And a less than 20% performance loss in those tests is a small price to pay, as die size decreases we will make up for those speed issues in increased clock speeds within a year or two.
The benchmarks seem pretty predictable though, and it's true that for most people's needs there is no point in having a 64bit integer unit. But in 5 years there most likely will be, so we might as well start upgrading now.
If possible, I'd love to see a correct benchmark of a Athlon64, since all I have seen so far have been 32bit comparisons. I want to see it's 64bit mode compared to it's 32bit compatibility mode.

RE: when you benefit from 64bit
by Inhibit on Thu 22nd Jan 2004 22:38 UTC

You might want to check the article again. I believe it shows 32bit compiles as being faster on SSH.

RE:when you benefit from 64bit
by JCS on Thu 22nd Jan 2004 22:39 UTC

Strictly speaking, almost all 32-bit processors support at least 64-bit floats - in fact, the P6 and above support 80bit floats internally to reduce rounding erros. What you do see with a "64-bit" system is a 64-bit *integer* (long long) type in hardware. I wouldn't be surprised if OpenSSL made use of that fact.

Apart from tricks like PAE and whatever is in Panther that does the same thing, your memory addressing comment is quite correct. An Opteron running Linux x86-64 is a 64-bit system. A Xeon with 8 GB of RAM - and using it - is not.

But if the software isn't optimized for 64 bit...
by sifu on Thu 22nd Jan 2004 22:49 UTC

It's irrelevant.

Responses
by Rayiner Hashem on Thu 22nd Jan 2004 22:50 UTC

@Uruloki A 64-bit machine doesn't necessarily have "enhanced capabilities" over a 32-bit machine. It can use 64-bit integers, and address > 4GB of memory. Unless you are doing either of those, there is no real way to write your program to be faster on a 64-bit machine. Precisely what sort of code changes are you thinking about? Besides, in this day and age, you should almost never micro-optimize your code for a specific CPU architecture. Unless you are comfortable holding all the details of a 20-stage pipeline, different latencies for dozens of instructions, complex branch prediction, and the state of 128 rename registers in your head at once, the compiler will do a better job than you. And all your micro-optimizations will be useless when the next Intel chip comes out with different performance characteristics.

PS> Using a 16-bit integer is often a bad idea. CPUs like word-aligned data. 16-bit integers are quite often slower than 32-bit integers unless you are working with them in a way where the CPU can load two of them at a time. Often, if you put a 16-bit integer in a struct, the compiler will ignore you and pad it out to 32-bits, to maintain alignment for the fields after it.

@Gandalf
What you said is not true. On everything >= SuperSPARC and everything >= i387, the FPU is at least 64-bits. So whether or not you use a 64-bit build, double-precision (64-bit) floating-point math will run at the same speed as single-precision (32-bit) floating-point math.

@sifu
by Rayiner Hashem on Thu 22nd Jan 2004 22:51 UTC

Praytell, how do you "optimize" for 64-bit processors?

Thnx
by tumbak on Thu 22nd Jan 2004 22:51 UTC

thank you for a nice article, it gives noobs like myself good insight into the debate of 32v64 ;)

RE: when you benefit from 64bit
by Gandalf on Thu 22nd Jan 2004 22:53 UTC

> You might want to check the article again. I believe it shows 32bit compiles as being faster on SSH.

Yeah, sorry, I was wrong in that point.
But as you can see the difference in performance is smaller than for gzip. As for MySQL, there are obviously some operations that benefit from 64bit aswell (maybe he was a large database with more than 4GB), but things like 'connect' doesn't benefit at all (as you can see in the graphs).

An Introduction to 64-bit Computing and x86-64
by julienp on Thu 22nd Jan 2004 23:09 UTC

Very interesting article on 64bit computing http://arstechnica.com/cpu/03q1/x86-64/x86-64-1.html
gives some answers to the questions...

Re: when you benefit from 64bit
by Megol on Thu 22nd Jan 2004 23:30 UTC

You are wrong.
Almost all processors that supports floating point support native 64bit even in 32bit implementations (also 80bit for IA32/AMD64).

The reason that OpenSSL (and many other softwares that encrypt/decrypt data) can be faster is because it can use 64bit integer registers to implement the algorithms in fewer steps than with 32bit registers. There are very few algorithms besides encryption that needs 64bit operations therefore very few applications gain as much.
Example (Add two 64bit numbers):

32bit IA32:
add eax, ebx ; add lower half
adc ecx,edx ; add upper half

64bit AMD64:
add rax,rbx ; add whole number

Not only does the 32bit version take two instructions instead of one, the second instruction is dependent on the first so they can not execute in parallel. Another problem is that the 32bit version uses 4 registers to represent the numbers while the 64bit uses only two.
If we do the same thing on a RISC processor (MIPS-like in this example) the 32bit version would be even slower as we don't have hardware support for carry:

32bit MIPS-like:
addu r10,r10,r05 ; add lower half
subu r09,r10,r05 ; r09 is negative if carry was generated
srl r09,r09,31 ; shift MSB to LSB position (r09=1 if carry else 0)
addu r11,r11,r06 ; add upper half
addu r11,r11,r09 ; add "carry"

64bit MIPS-like:
addu r10,r10,r05 ; add whole number

The performance difference is even bigger if we compare 64x64->64bit multiplications that many encryption algorithms need.

Compiler
by dpi on Thu 22nd Jan 2004 23:37 UTC

"Also, building a 64-bit capable compiler can be an experience to behold. That is to say, it can suck royally. After spending quite a bit of time getting a 64-bit compiler built for (one of my many) Linux installs, I ended up just going with the pre-compiled version from http://www.sunfreeware.com"

Pre-compiled version of what. GCC?

Re: Responses
by Megol on Thu 22nd Jan 2004 23:39 UTC

"Besides, in this day and age, you should almost never micro-optimize your code for a specific CPU architecture. Unless you are comfortable holding all the details of a 20-stage pipeline, different latencies for dozens of instructions, complex branch prediction, and the state of 128 rename registers in your head at once, the compiler will do a better job than you. And all your micro-optimizations will be useless when the next Intel chip comes out with different performance characteristics."

Do you really belive that compilers model processors to that detail? They don't. There is no need to do that (and is in reality impossible) to get good performance out of the processor, and that applies for human coders also.

64-bit OpenSSL
by TonyB on Thu 22nd Jan 2004 23:41 UTC

You might want to read the article again, OpenSSL 32-bit actually beat OpenSSL 64-bit. The only place 64-bit beat 32-bit was a couple of the MySQL operations.

And while none of these applications seemed to be specially written for 64-bit (I couldn't find any), in many cases they do use 64-bit integers, depending on how the header files worked out. For PostgreSQL (I didn't end up doing benchmarking for that) I actually had problems compiling 32-bit because it kept trying to use 64-bit long long integers. I had to go in an manually edit the header files to get it to compile.

I also went out of my way I think to explain these tests were limited.

Compiler
by TonyB on Thu 22nd Jan 2004 23:45 UTC

Yeah, the compiler (explained a bit earlier in the article) was GCC 3.3.2 from Sunfreeware.com.

Are we there yet?
by root on Thu 22nd Jan 2004 23:58 UTC

To begin to witnessing the effect of 64-bit processing, you need a large CPU cache -- >=1MB, and a large amount of very fast memory -- >=2GB, wider and faster Buses, and most definitely SCSI subsystems. When we talk 64-bits, the amount of data we are manipulating is very large. The speed of manipulating/processing such data is not the problem. The speed of moving and storing such data is were the problem lies.

You better have a pretty souped up workstation to even begin thinking about 64-bit desktop. I'm talking gigabytes of Dual DDR2 RAM, SCSI RAID, the fastest processor in the market, and chip sets that supports > 800MHz FSB. Then we can begin talking 64-bit computing. I agree with Intel when they say 64-bit desktop computing is still way off, except of course you are a power user or gamer.

D'oh, rude :)
by TonyB on Fri 23rd Jan 2004 00:06 UTC

Sorry, I didn't mean to make my comments sound rude. I hate rude comments, so I apologize if they came off as such.

Re: Responses
by Gandalf on Fri 23rd Jan 2004 00:18 UTC

> What you said is not true. On everything >= SuperSPARC and everything >= i387, the FPU is at least 64-bits.

I didn't deny that. Actually quite the opposite.
Often it's 80bit or even 128bit.

FPU Datatypes (differ from platform to platform):

float: most often 32bit
double: >= 64bit (eg. on ix86 it's 80bit internally)
long double: 80bit or 128bit

But ... the load/store of the 64bit/128bit values is slower for 32bit builds than for 64bit builds, amongst other things (eg. conversation etc.)

The performance for >=64bit floating point arithmetic (at least double precision) is *slower* for 32bit systems than on 64bit systems. It's not a huge difference, but noticable.

Re: when you benefit from 64bit
by Gandalf on Fri 23rd Jan 2004 00:28 UTC

> You are wrong.

Ummm, I disagree. Read on.


I posted earlier:

> One hint: floating point registers.
> They are usually 64bits and therefore applications that make
> heavy use of floating point variables/registers directly
> benefit from a 64bit binary.

Note "that make heavy use of floating point variables/registers directly"

Maybe not the best wording; maybe I should've said
"that often directly use floating point registers"
or even
"that often load/store floating point registers"

Your examples are brilliant, as they show exactly what I wanted to express.

Replies
by ChocolateCheeseCake on Fri 23rd Jan 2004 01:05 UTC

RE: Chris (IP: ---.student.iastate.edu) - Posted on 2004-01-22 22:34:05
I'm sure this is mostly concerning the recent Athlon64, as it is the first 64 bit x86 chip that I have heard of.
The benefit with it isn't 64 bit integers, as stated above it's that it has 16 registers per unit where Athlon has 8. Pentium 4 has 8, and I believe I read somewhere it has 112 other invisible registers. Doubling the registers is definitely a good thing as one can lose clock cycles even when requesting information from cache; and with an 18/20 stage pipeline losing a cycle is a big deal.


True, however, RISC chips have historically aways had a huge number of registers, also, regarding the "112 other invisible registers", I assume you're referring to the renaming registers which are nice hack to providing more registers without actually providing more.

The thing with RISC is that, in theory, if an application is tweaked to the max, that is, using ever possible feature of the chip; the huge number of registers, ISA enhancements like VIS and use a top notch compiler, the RISC chip should perform better, however, in reality one cannot spend the amount of time required to actually get software performing that good.

Also, we have to move eventually as 32bits can only hold the epoch till 2038 if I remember correctly. There are of course other reasons why a 64bit machine is cool. And a less than 20% performance loss in those tests is a small price to pay, as die size decreases we will make up for those speed issues in increased clock speeds within a year or two.

One also has to take into account that we've reached a point of deminishing returns. Instead of the Intels of the world working on making their processor more efficient, that is, doing more work per-clock cycle, they push the pipe line out to the moon and back just so they have the ability to boast in the clock speed hype-a-thons that occur in the local computer rags like zdnet.

The benchmarks seem pretty predictable though, and it's true that for most people's needs there is no point in having a 64bit integer unit. But in 5 years there most likely will be, so we might as well start upgrading now.
If possible, I'd love to see a correct benchmark of a Athlon64, since all I have seen so far have been 32bit comparisons. I want to see it's 64bit mode compared to it's 32bit compatibility mode.


IIRC, x86-64 has compatibility and long mode. Longmode is a native x86-64 application. Long mode covers both 32bit and 64bit, meaning, one can have a long mode 32bit application featuring all the benefits of x86-64 ISA without the need to re-tune the application.

RE: JCS (IP: ---.schaferhsv.com) - Posted on 2004-01-22 22:39:30
Apart from tricks like PAE and whatever is in Panther that does the same thing, your memory addressing comment is quite correct. An Opteron running Linux x86-64 is a 64-bit system. A Xeon with 8 GB of RAM - and using it - is not.

What is not spoken about is the fact there are limitations and issues with PAE. Firstly applications must be able to understand PAE, without that, applications will still only see the max, also, there is a performance penalty for it too. btw, A Xeon with 8GB of memory uses PAE with 36bit addressing thus giving a max of 32GB of addressable memory. x86-64 on the other IIRC< addresses around 42-44bits.

@Gandalf
by Rayiner Hashem on Fri 23rd Jan 2004 01:54 UTC

But ... the load/store of the 64bit/128bit values is slower for 32bit builds than for 64bit builds
-------------
All CPUs >= SuperSPARC has 64-bit internal datapaths, even running in 32-bit mode. That means that loads/stores for the FPU take exactly the same amount of time. Conversions between integers and floating-point values are extremely rare, so the performance of conversions isn't a big factor.

There is a very good primer on 32-bit vs 64-bit SPARC performance at:
http://www.sun.com/sun-on-net/itworld/UIR951101perf.html

another side
by JJ on Fri 23rd Jan 2004 02:17 UTC

You can get an even better feel for these things by sitting down and designing a cpu core. Thats not very practical if you chase the big guys for 1-3GHz cycle rates but if you pull down to 200MHz its doable in FPGA now. I will refer my comments to Spartan3 family being introduced by Xilinx. I am using a free download Webpack tool to synthesize my design in Verilog. I am limited to the smaller parts upto about 400K logic gates equiv which is more than enough to hold several small 16-64b cpu cores.

The core I am developing is parameterized so that I can choose the cpu width, register file hight, and how deep the alu pipeline is as well as the size of internal sram memory (might be cache or fixed).

The main limiting factor of most cpus that is very hard to work around is the speed of 1st level memory or cache. In FPGA case it can cycle about 200MHz on random access and is also dual ported. Each independant block ram can be 512 by 32 or 16K by 1 or somewhere in between. Using the wider widths allows more bandwidth so I use the 32b form. Further the smallest FPGA may have 4 of these (sp3-50 about $3) and the largest about 104 (sp3-5000 about $100). Prices are in high volume. Block rams can be ganged into super blocks to make bigger rams with little speed impact. But for N cpus, I would want to limit 1 ram per cpu core. A more parallel uber core might use these block rams for super deep register files, tlbs, mmus etc.


Next is the width of the alu or adder. In ASIC/VLSI design the width delay can reduce to a mostly log cost so a 64 v 32 is similar to 32 v 16 delay difference due to propagate generate schemes forming N way trees. This starts to break down a bit after 64b as the doubling widths becomes wire limited.

In FPGAs general logic is about 5x slower than those found in 1GHz level ASICs BUT built in logic thats is simply instanced can be just as fast since they are still circuit level designed by the FPGA company. In FPGAs, carries are almost always ripple carry type as propagate generate circuits are irregular and are not in the FPGA fabric.

So a 16b adder can cyle at 200MHz, 32b at 150MHz and 64b at 100MHz. IE the cycle time follows the width. In all these cases the latency through the datapath is the same in cycle count. One way to speed up the wider cpu is to break the carry every 16b and add another pipeline. You can see now why cpus start to get very deep. Now a cpu can perform at 200MHz for any width if extra N/16 latency cycles are accepted. But that introduces hazard headaches which in turn have to be compensated for by various schemes.

Hazards can be reduced by making the compiler work harder, or by adding hazard detection and register forwarding logic and finally by adding multithreading to make near opcodes independant. There are more costs associated with all of these as well, I will be using some of each of these.

The hight of the register file is likely to be 16 or 32. The penalty here is not speed but sheer no of logic cells that form dual ported 16x1 teeny rams. These can be ganged into 4 wayported regfiles 1w, 3r paths. A 64b cpu with 64registers sucks up 64x64x2x3/16 luts or 1536 cells out of a few thousand. If a cpu is going to have really large reg file w & h, then it should use the blockrams but there are few of those also.

The piece de resistance is that since the cpu is a message passing transputer core in spirit, it can be scaled up for N way supercomps with far less overhead than the shared memory designs. Now if each cpu node even a 64b cpu is close to $1 per instance, it begs the question, would I rather have N Opterons that do not scale in cost albeit 10x faster per node, or would I rather have 100x cheaper but 10x slower nodes. Theres far more to the story than that, ie FPUs are going to be fairly weak in FPGA and scarce and there is a compiler to support C,Occam & HDL etc....

Hope this tidbit is of interest.

johnjaksonATusaDOTcom

@Megol
by Rayiner Hashem on Fri 23rd Jan 2004 02:27 UTC

Do you really belive that compilers model processors to that detail? They don't. There is no need to do that (and is in reality impossible) to get good performance out of the processor, and that applies for human coders also.
----------
Compilers do model the processor to that level of detail. GCC's code generator uses DFA (Deterministic Finite Automata) instruction scheduler. Each processor has a DFA description called an MD file. These MD files describe the details of the processor's pipelines. The MD file for the i386 architecture is 23,000 lines, plus another 1000 lines for each specific CPU model. Several thousand additional lines of code are dedicated to GCC's register allocator. And the i386 is a relatively lenient architecture!

Precise modeling of the processor is even more important for a processor like IA-64 and PPC-970 that have complex rules for instruction grouping and instruction dispatch. Its even more important on IA-64 which doesn't do any internal reordering or optimization.

See http://kegel.com/crosstool if you need to build
an x86 -> x86_64 cross-compiler for Linux. Might
work for other 64 bit chips for Linux, I haven't checked.

lets switch to 16 bit
by tim hawkins on Fri 23rd Jan 2004 06:55 UTC

Bits obviously make no difference- lets all switch back to 16 bit. J/k

Oh yeah, and Linux dosent run well on alot of 64 systems and alot of the 64 bit systems like like solaris run slow on 32 bit systems... example..

x86 - Linux vs. FreeBSD vs. Solaris
Winners are 1) FreeBSD 2) Linux 3) Solaris
It's expected that Linux will overtake FreeBSD due to its rapid development.

64-bit - Linux vs. FreeBSD vs. Solaris
Winners are 1) Solaris 2) FreeBSD 3) Linux

well..well, well

Omitted Details
by MJ on Fri 23rd Jan 2004 06:56 UTC

I appreciate the author's attempt to provide an objective and unbiased review of the performance aspects concerning 32-bit vs. 64-bit computing on his Ultra 5. However, there are a number of details that he did not explore, and a few critical mistakes concerning his testing methodology. I'm going to start with the larger issues and work down from there:

First, his comparison between 32 and 64 bit applications is not correct. From the article:
# file openssl

openssl: ELF 32-bit MSB executable SPARC32PLUS Version 1, V8+ Required, UltraSPARC1 Extensions Required, dynamically linked, not stripped


While the binary format of this executable is ELF-32, the application in question is not a true 32-bit application. The SPARC32PLUS, V8+ required indicates that this application is compiled to use the SPARC v8plus architecture. V8plus uses 32-bit addresses but allows an application to registerize its data in 64-bit quantities, so realistically these comparisons are between programs that use 32 vs 64 bit addresses but all have 64-bit registers. This distinction isn't explored in the article, but it is important. To get a true characterization of 32-bit addresses and registers, the benchmarks ought to also be compiled to the v7 architecture. I think this may make differences more observable between pure 32-bit and pure 64-bit applications.

The v8plus benchmarks show the obvious benefit of 64-bit registers to compute intensive applications while not suffering from the drawbacks of having a 64-bit address space. My suspicion is that if these tests are re-run for the v7 architecture, the results will find that the 32-bit applications perform better on workloads characterized by lots of load/store behavior, while the v9 applications trump the v7s at computations. This is because there's more register space on v9, allowing more data to be computed at once.

The reasons for 64-bit apps to slightly lag in performance are various but there are some important things to keep in mind when examining these kinds of problems. With 64-bit addresses, you've doubled the size of your pointers, so this is one reason why size of the compiled binaries increases. These addresses have to go somewhere. Also, since you have larger addresses, your cache footprint increases which means you get fewer lines in the cache. More cache misses == poorer performance as you have to go further down the memory heirarchy to satisfy your requests. As a point of fact, the SPARC v9 architecture only allows you 22-bits for as immediate operand, so to construct a 64-bit constant you have to issue more instructions. SPARC uses register windows, and when you take a register spill/fill trap in a 64-bit address space, you're going to have more information in a 64-bit trap than in a 32-bit. These are just a number of factors that characterize the behavior between 32 and 64 bit address spaces.

I also have some concerns about the author's static vs. dynamic linking. In two cases the author compares v8plus vs. v9 using completely dynamically linked binaries, and in the other cases, he compares v8plus to v9 using mostly dynamically linked applications only statically linking to libcrypto and libssl. The problem here, is that there is still dynamic linker overhead both as the application is started up, and as it runs. While the "statically" linked binaries obviously benefit from having to take fewer detours through the PLT, these apps are still dynamically linked to libc, libthread, and probably others. So, the full benefit of statically linking them is lost. The 64-bit dynamically linked apps take longer than their 32-bit counterparts for reasons which include more instructions in the PLT to generate the function address to which to jump.

I'm sure there are plenty of other performance aspects that I forgot to touch upon, but my biggest frustration with this article is that it fails to tease out the details about which applications perform better on 32-bits and which perform better on 64-bits and why. I hope my comments were able to fill in some of those gaps. By running his benchmarks on a v8plus architecture, the author has successfully demonstrated what an effective compromise 32-bit addresses and 64-bit registers can be, but he hasn't characterized actual 32-bit application performance. That said, I do appreciate his fair, factual, and un-evangelical approach to the benchmarking. It certainly provided a good starting point for discussions on 32-bit vs. 64-bit performance.

@Rayiner Hashem
by Sander Stoks on Fri 23rd Jan 2004 08:35 UTC

Praytell, how do you "optimize" for 64-bit processors?

By designing your huge-dataset algorithms around being able to address, day, each cell in a volume reconstruction, instead of using a technique similar to overlays from the old 16-bit days.

Even in "common" tasks like video editing, 32 bit addressing has reached its limits. It's not for nothing that serious file systems have 64 bit off_t's; but it would be cool to be able to mmap() such a file...

@Sander Stoks
by Rayiner Hashem on Fri 23rd Jan 2004 09:23 UTC

I don't think that's the sort of "optimization" the original poster was talking about, especially since the technique is entirely irrelevent to the benchmarks presented in the article.

Plus, what you are talking about is not so much optimization as it is getting rid of hacks that are no longer necessary because of a more permissive CPU architecture. Optimization takes general code and tunes it for a specific situation. What you are describing is actually the reverse --- taking code tuned for a specific situation (low-memory) and generalizing it.

@MJ
by Raptor on Fri 23rd Jan 2004 10:06 UTC

Also, since you have larger addresses, your cache footprint increases which means you get fewer lines in the cache. More cache misses == poorer performance as you have to go further down the memory heirarchy to satisfy your requests. As a point of fact, the SPARC v9 architecture only allows you 22-bits for as immediate operand, so to construct a 64-bit constant you have to issue more instructions.

Let's think for a second. The bigger the pointers the less number of cache lines, does that make any sense? The number of cache lines are the same regardless of the bits in an address. A cache line is identified by a tag and 32-bit and 64-bit address will eventually hash down to similar tags thus occupying all the cache lines in the cache. Line size and number of cache lines are always constant for caches.

Sparc uses a 22 bit immediate field only for the sethi instruction. There more ways to construct a 64-bit instruction. At max you will need 3 instructions to build a 64-bit constant.

Re: Let's switch to 16bit
by Janne on Fri 23rd Jan 2004 11:03 UTC

"Oh yeah, and Linux dosent run well on alot of 64 systems and alot of the 64 bit systems like like solaris run slow on 32 bit systems... example..

64-bit - Linux vs. FreeBSD vs. Solaris
Winners are 1) Solaris 2) FreeBSD 3) Linux

well..well, well"

I have data that disputes your rankings. See here:

64bit: Linux vs. FreeBSD vs. Solaris

1. Linux
2. FreeBSD
3. Solaris

As you can clearly see, I can make up rankings as well! Come talk to me when you have some REAL comparisons to show as proof, OK? If you don't I'll just assume that you pulled your rankings from thin air.

v7
by TonyB on Fri 23rd Jan 2004 15:37 UTC

Actually, compiling with v7 is actually pretty worthless. The v7 instruction set doesn't include integer divide or integer multiply, nor does it use quad-precision floats (nor does v8 by itself for quad floats), which can mean a significant slowdown in several apps, including OpenSSL:

compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DOPENSSL_NO_KRB5 -O3 -fomit-frame-pointer -Wall -mcpu=v7 -DB_ENDIAN -DBN_DIV2W
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
sign verify sign/s verify/s
rsa 512 bits 0.0163s 0.0016s 61.5 623.8
rsa 1024 bits 0.0947s 0.0052s 10.6 190.9
rsa 2048 bits 0.6176s 0.0187s 1.6 53.4
rsa 4096 bits 4.3433s 0.0694s 0.2 14.4
sign verify sign/s verify/s
dsa 512 bits 0.0149s 0.0183s 67.3 54.6
dsa 1024 bits 0.0503s 0.0614s 19.9 16.3
#


The result reduces performance of OpenSSL by about half.

Using -mcpu=ultrasparc (or -xarch=native) as a compiler flag represents what most people would (and many build scripts automatically) use for UltraSPARC-based systems. It represents the highest level of optimization for an UltraSPARC system, and that's what should be used if you don't have to worry about running your binaries on older systems. And if you do, it's probably worth your time to make two sets of binaries: UltraSPARC optimized, and binaries for the older processors you're using (hopefully v8 and not v7).

The test was done to see if 32-bit binaries run faster on 64-bit capable systems than 64-bit binaries, so using the maximum available optimizations for 32-bit on the UltraSPARC are entirely appropriate.

Granted, the tests were limited in scope. I hope this brings about a new round of performance tests done by others, representing other scenarios and methodologies.

I'm tired of seeing conjecture, and conjecture-taken-as-fact in regards to OS and platform performance. Even if it's backed up by sound computer science theory (64-bit data paths, cache misses, etc), it's still pure conjecture until it's tested.

Misconceptions
by jizzles on Fri 23rd Jan 2004 19:50 UTC

@Gandalf

"On the other hand using 64bit registers/pointers take up more memory. IE if you are deadling with 32bit integer variables, then you are essentially waisting 32bits of memory for each variable, and loading/storing from/to memory takes also more time. Therefore your applications slow down if you use 64bit builds that only deal with 32bit (integer) data."

Generally not true. There are multiple levels of the memory hierarchy, and traffic between main memory and caches generally happens in blocks (i.e. 32 or 64 bytes) at a time, which can read and written in burst modes. The on-die L1 cache is what the processor does actual reads and writes from, and generally this connection is as wide as a register (i.e. 64 bits). Whether you read 8 bits or 32 bits or 64 bits to/from the L1 cache doesn't make a damn bit of difference.

What does contribute to slowdown of 64 bit application is that more space (both static data and heap data) takes up more space, reducing the overall effectiveness of the cache.

"OpenSSL is using lots of floating point calculations, therefore performance is better with a 64bits build: the overhead of the 64bit pointers is not as bad as the benefit from using 64bit floating point registers directly."

Generally floating point registers on modern processors are 64 or 80 bits wide (sometimes 128), regardless of the size of an integer register. There is a benefit to having a larger bus between the register file and the L1 cache, however.

@Megol

"Do you really belive that compilers model processors to that detail? They don't. There is no need to do that (and is in reality impossible) to get good performance out of the processor, and that applies for human coders also."

As Rayiner said, yes they do. GCC has a fairly general and configurable processor description language. Other compilers for specific architectures can go into even greater detail and perform optimizations that would be impossible to express in this language or mean nothing on a different architecture. There is a need to do that in some (very limited) situations. Generally a person will profile the code to identify hot parts of the code and begin removing bottlenecks in those areas. When two or three extremely hot paths of code are identified and the codebase is fairly stable and mature, then it may be worthwhile to optimize those few dozen or couple hundred lines to death for the best performance.



Just a few more things...
by MJ on Fri 23rd Jan 2004 19:59 UTC

Let's think for a second. The bigger the pointers the less number of cache lines, does that make any sense? The number of cache lines are the same regardless of the bits in an address. A cache line is identified by a tag and 32-bit and 64-bit address will eventually hash down to similar tags thus occupying all the cache lines in the cache. Line size and number of cache lines are always constant for caches.

Eh, kindof. You're right that there will always be the same number of lines in the cache, but frequently caches are designed so that it is possible to put a number of blocks in one cache line. These blocks are selected by a tag, as you pointed out, and also an index. I should have been more specific, my comments were mostly in regard to the TLB where you will certainly be able to hold fewer blocks if you're using a 64-bit VA instead of a 32-bit VA.

Sparc uses a 22 bit immediate field only for the sethi instruction. There more ways to construct a 64-bit instruction. At max you will need 3 instructions to build a 64-bit constant.

Indeed. However, 22 bits is the biggest immediate value you get in the SPARC instruction set. This was more to point out the obvious benefit of being able to do the load directly in x86-64.

The test was done to see if 32-bit binaries run faster on 64-bit capable systems than 64-bit binaries, so using the maximum available optimizations for 32-bit on the UltraSPARC are entirely appropriate.

You've missed my point entirely. The goals you stated at the beginning of the article were that you wanted to compare the performance of 32-bit applications versus 64-bit applications. However, I point out that there are some caveats since your testing methodology, in a number of cases, mixes 64-bit things with 32-bit things. You don't even address the issues or the drawbacks and instead insist that evey possible optimization should be valid. If this is true, you should state that the purpose of your article is not to divine whether 32-bit is faster than 64-bits, but how users should go getting their apps to run the fastest on an UltraSPARC chip. Or, that your goals were to determine the difference in speed between applications that used 32-bit pointers and 64-bit pointers. If those were your claims I would have no issue. However, you've set out with a general goal and only tested a few cases. I argue that the cases you've tested are not sufficiently general that you can offer a correct conclusion on the entirety of 32 vs. 64 bit performance.

I'm tired of seeing conjecture, and conjecture-taken-as-fact in regards to OS and platform performance. Even if it's backed up by sound computer science theory (64-bit data paths, cache misses, etc), it's still pure conjecture until it's tested.

I'm not sure what you're trying to say here. However, going into a laboratory and running experiments is never going to cure cancer, AIDS, SARS, whatever. It is experimental result coupled with a body of knowledge that allows us to make conclusions that advance the sciences. If you don't know anything about viral pathology, anatomy, physiology, etc and you go run a medical experiment, you're unlikely to learn much from it. Do you insist that your waiters prove that real numbers exist before you pay your bill at a restaurant? Do you make people prove that gavity exists when they want to talk to you about it? This statement sounds awfully rediculous. There are plenty of things that are theoretical that we accept as fact because it facilitates the ease with which we communicate about other things. Like any other field, there is an accepted body of knowledge relating to Computer Systems which is important to understand if you hope to reach meaningful conclusions. Simply dismissing everyone else's points as conjecture because they have not proven them to _you_ is silly. However, if you think you can prove that application performance increases if you increase your cache miss rate, please feel free to test it out. I think you'll find that this is much more solid than conjecture.

@MJ
by Raptor on Fri 23rd Jan 2004 21:48 UTC

Eh, kindof. You're right that there will always be the same number of lines in the cache, but frequently caches are designed so that it is possible to put a number of blocks in one cache line. These blocks are selected by a tag, as you pointed out, and also an index. I should have been more specific, my comments were mostly in regard to the TLB where you will certainly be able to hold fewer blocks if you're using a 64-bit VA instead of a 32-bit VA.

The blocks you mention are always ths same size regardless of the address size. A n-way associative cache on a 64-bit processor will have the same line size and block size regardless of 32-bit or 64-bit adresses being used. The index is the line, say two VAs hash down to line 0 thier index is 0.

TLBs are nothing special, they are fully-associative caches for the MMU if you will. TLBs cache address translations and not data or instructions. The tag protion of the TTE is always the same size regardless of 32/64 bit VAs. The Data section which holds the physical address is also the same size no matter the size of the address.

gcc? Gag me with a spoon...
by Bascule on Fri 23rd Jan 2004 22:11 UTC

Just for fun I tried building building our grid analysis software with gcc 3.x instead of Sun's C compiler from the Forte Compiler Collection. The software makes extensive use of 64-bit integer math (and consequently executes much faster on a 900MHz UltraSPARC III+ than it does on a 2GHz Athlon running Linux) as all values in the grid files are either 64-bit integers (representing fixed point values) or 80-bit IEEE floats.

Unfortunately I don't have specific numbers offhand, and I'm sure gcc has matured quite a bit since I tried this (about a year and a half ago), but the binary bult with gcc performed at around 60% of the speed of the binary built with Forte 7.

Compared to the Sun compiler, gcc on sparcv9 is a joke for performance critical applications.

@Raptor
by MJ on Fri 23rd Jan 2004 22:21 UTC

The blocks you mention are always ths same size regardless of the address size. A n-way associative cache on a 64-bit processor will have the same line size and block size regardless of 32-bit or 64-bit adresses being used. The index is the line, say two VAs hash down to line 0 thier index is 0.

Yes, of course. I think we've misunderstood each other. All I'm saying is that if you've got blocks in your cache that are a fixed size and you're using a 64-bit quantity as opposed to a 32-bit quantity you'll use up more blocks holding the 64 bit data. Assume that your cache line holds a number of 64-byte blocks. If you're using 64-bit quantities you can old hold 8 words/block (64-bit words), but if you're using 32-bit quantities your cache block can hold 16 words (32-bit words). If you could previously store 16 32-bit items, now you can only store 8 64-bit items, your per/item ability to use the cache has decreased. You're right that the size itself has not decreased, but you can now hold fewer items. Obviously the 8 64-bit quantities are size equivalent to the 16 32-bit ones, but you've halved you're ability to access objects in the cache.

Sun Compiler
by TonyB on Fri 23rd Jan 2004 22:50 UTC

The next article (I believe it will go up on Monday) is on GCC (both 2.95 and 3.3.2) versus Sun's compiler. There are some surprises in the results.

@MJ
by Raptor on Fri 23rd Jan 2004 23:28 UTC

Yes, of course. I think we've misunderstood each other. All I'm saying is that if you've got blocks in your cache that are a fixed size and you're using a 64-bit quantity as opposed to a 32-bit quantity you'll use up more blocks holding the 64 bit data. Assume that your cache line holds a number of 64-byte blocks. If you're using 64-bit quantities you can old hold 8 words/block (64-bit words), but if you're using 32-bit quantities your cache block can hold 16 words (32-bit words). If you could previously store 16 32-bit items, now you can only store 8 64-bit items, your per/item ability to use the cache has decreased. You're right that the size itself has not decreased, but you can now hold fewer items. Obviously the 8 64-bit quantities are size equivalent to the 16 32-bit ones, but you've halved you're ability to access objects in the cache.


Yes what you just described is true to a certain extent. However, in practice every data type on a 64-bit binary is not 64-bit only. 64-bit binaries might have 8,16,32 bit data objects in them and caches do allow you to address a byte in a cache line. All I am getting at is that it is not very accurate to say that 64-bit addressing automatically yields poorer performance due to higher cache-misses than a 32-bit binary. It is possible in the scenario you describe above.

In reality not every one who codes a 64-bit program makes all the data 64-bit quantities.

@Raptor
by MJ on Fri 23rd Jan 2004 23:47 UTC

Yes what you just described is true to a certain extent. However, in practice every data type on a 64-bit binary is not 64-bit only. 64-bit binaries might have 8,16,32 bit data objects in them and caches do allow you to address a byte in a cache line. All I am getting at is that it is not very accurate to say that 64-bit addressing automatically yields poorer performance due to higher cache-misses than a 32-bit binary. It is possible in the scenario you describe above.

In reality not every one who codes a 64-bit program makes all the data 64-bit quantities.


I don't dispute those statements at all. I didn't think that I said that this was automatically the main reason that performance degrades, but that it is one possible aspect to consider when keeping track of issues affecting the performance of 64-bit applications. Certainly this is only going to apply to a subset of objects in an application. I didn't mean to give the impression that this was the primary cause of a performance difference between 32 and 64 bit apps. It sounds like we're in agreement, though...?

@MJ
by Raptor on Sat 24th Jan 2004 01:24 UTC

Well I reread your orignal post. You did make some really great points overall and I certainly think those are issues to consider about for new comers to the 64-bit world. However, I still got the impression that you mean 64-bit address implied more cache misses. But I am glad you clarified your statement.

No harm done, I am glad we had a good discussion here. It is seldom that a good discussion happens here at OSNews. Many seem to turn into a flameware at the blink of an eye.

@Raptor
by MJ on Sat 24th Jan 2004 03:07 UTC

However, I still got the impression that you mean 64-bit address implied more cache misses. But I am glad you clarified your statement.

Yeah, that original statement was not particularly clear. But you also made a lot of good points, where I was required to go, "oh crap, I was wrong...how do I salvage something correct from this..." but in the end I think we both managed to come to some concensus about that.

I am glad we had a good discussion here. It is seldom that a good discussion happens here at OSNews. Many seem to turn into a flamewar at the blink of an eye.

Hehe...me too. Some people take these sorts of things awfully personally, but I'm glad neither one of us seemed to. Hopefully the rest of Tony's articles will provide equally engaging material to discuss.

nbench on Mac OS X
by AMSR on Sat 24th Jan 2004 04:16 UTC

Does this nbench thing run on Mac OS X? There seems not to be a "configure" script, and when I run make as it says to do in the README I get: "ld: can't locate file for: -lcrt0.o"

The source is here:

http://www.tux.org/~mayer/linux/bmark.html

Assembly in 32 bit OpenSSL but not the 64 bit version...
by Jim Jones Freaky on Sat 24th Jan 2004 06:02 UTC

-DMD5_ASM

It looks like the 32 bit version of OpenSSL has some assembly code in it whereas the 64 bit version was compiled from C. So the performance gap isn't as large as what your graphs show in those tests.

@Janne
by tim hawkins on Sat 24th Jan 2004 06:03 UTC

"As you can clearly see, I can make up rankings as well! Come talk to me when you have some REAL comparisons to show as proof, OK? If you don't I'll just assume that you pulled your rankings from thin air."

I'm talking about the SPARC and I did not make up benchmarks, go research it yourself. FreeBSD performs better on than Linux in scalability --its widely known but its expected linux will exceed that due to its rapid development.

Go look for yourself. It's better than you search the internet yourself for this information because 1) i dont have the time 2) it will give you a better understanding on the performance from a variety of benchmarks and tests rather than a biased one.

I was just browing this article and saw this and decided to reply. Now, its time for me to get back to work. Because you know, customers are #1 not sitting around trying to pull of thousands of documents prooving what you said is wrong. Oddly enough, It appears your information came from less than thin air... That's all I have to say.


This benchmark is a sham
by Big Gums on Sat 24th Jan 2004 06:04 UTC

The author's methodology is jive.

(just kidding, I couldnt help myself as everyone here's tried to remain civil...)

Very nice. Thanks! for the work!
by Des on Sat 24th Jan 2004 06:34 UTC


interesting results

Well *I* appreciate your effort!
by Prog3K on Sat 24th Jan 2004 07:40 UTC

I think it can be used like you said: to ask further questions.

I think you can never have too much analysis, really.

Every sampling method you can devise can tell a story if you know how to read it.

A hearty WELL-DONE ans keep up the good work.

PS - I recently installed Gentoo Linux from stage-1, which I expect is analogous to your set-up; Starting with the compiler, the kernel and every executable on the system gets compiled specifically for your chip, using all your chip's optimizations or CISC-tricks.

It realy runs super-smooth and quick on my 2.4 GHz AMD.

Had to scrap it because I eventualled messed it up totally trying to get the soundcard to work, but I'll be giving it another shot soon!

Is 512MB of RAM the only storage offset in your Ultra 5?!
by Anonymous on Sat 24th Jan 2004 09:15 UTC

The author makes the point that the Ultra 5 can only have up to 512MB of RAM so that addressing beyond 2GB is never going to be an issue. What the author seems to be complettely missing is that the Ultra 5 has a *hard drive* that is larger than 2GB. Anyone that has really tried to performance tune a production NewsNet server or *production* mysql database knows that when buffers/databases grow beyond 2GB, it is nice when fseek() is being called from a truely 64bit binary. How large was Tony's mysql database files? He didn't say. I keep track of customer trouble tickets for at least a year and the database very quickly exceeds 2GB. I would make a bet that Tony was using something more along the lines of 20MB.

Slashdot anti-OSNews?
by Zachary on Sat 24th Jan 2004 09:16 UTC

Is it just me, or does Slashdot do everything in their power to not advertise OSNews? This article is linked as "Tony Bourke decided" with no mention of OSNews in the post. I seem to recall other recent OSNews articles going unnamed as well.

64 vs. 32 bit benchmarks on the web
by Damon Lynch on Sat 24th Jan 2004 11:00 UTC

Aces Hardware has some examples e.g.:

http://www.aceshardware.com/read.jsp?id=60000279

and

http://www.aceshardware.com/read.jsp?id=60000256 (unfortunatley only windows for this page)

RE: Omitted Details
by Jonathan Adams on Sat 24th Jan 2004 11:20 UTC

The problem here, is that there is still dynamic linker overhead both as the application is started up, and as it runs. While the "statically" linked binaries obviously benefit from having to take fewer detours through the PLT, these apps are still dynamically linked to libc, libthread, and probably others. So, the full benefit of statically linking them is lost. The 64-bit dynamically linked apps take longer than their 32-bit counterparts for reasons which include more instructions in the PLT to generate the function address to which to jump.

True, but Solaris has never shipped 64-bit static libraries for libc or libthread. And in Solaris Next, it will not have any 32-bit static libraries. So the comparisons he did are really the only reasonable ones to do.

Various comments
by mia on Sat 24th Jan 2004 12:56 UTC

A few remarks/corrections:

SPARC v7 _has_ floating point instructions, but it lacks integer multiplication/division. (hint: 'man gcc')

The reason why there is focus on the executable size when discussing 32/64 bit is not disk space but CPU cache size. When a larger portion of the executable fits within the CPU cache, the executable will operate faster.

It would have been interesting with a comparison against Sun CC. We have experienced that since GCC 2.95.3 Sun has not been able to match the efficiency and executable footprint of GCC.

Comparing 32 and 64 bit benchmarks
by Michael Meissner on Sat 24th Jan 2004 13:00 UTC

Assuming you don't have additional instructions or registers in 64-bit mode and that you don't have in 32-bit mode or heavily use 64-bit integer data types, the main effect of running in 64-bit mode is that your cache is less effective. This is because the cache is fixed size, and you are now putting in larger variables (namely larger ints and pointers). You would see this effect more if you were just on the edge of thrashing the cache in 32-bit mode, 64-bit mode would completely thrash.

Ultra 5 memory limitation
by mia on Sat 24th Jan 2004 13:02 UTC

And another note:

The Sun Ultra 5 will accept just as much memory as Ultra 10 (2GB?) - the only constraint is that you have to remove the floppy drive from your Ultra 5 to make physical space for the dimm's.

Amd64 FFTW bench results
by Joe Georger on Sat 24th Jan 2004 14:12 UTC

I have an Opteron 246 and ran the fftw benchmark in both 32 and 64-bit modes. The 64-bit mode was about 20% faster. If I'm not mistaken I believe that's because the processor has twice as many registers available when running in 64-bit mode.

gunzip performance
by Jake on Sat 24th Jan 2004 15:12 UTC

Just a thought regarding the unzipping performance. They are likely to be even (32 vs 64) because the unzipping process is a very low-CPU process. The algorithm is designed to make the decompression much easier than the compression, and therefore you are limited by I/O. You didn't mention whether you wrote the decompressed output to disk or to /dev/null. Of course, making that change would only remove one part of the I/O, reading the compressed file would still be done.
Thanks for the interesting article.

RE: nbench on Mac OS X
by Anthony on Sat 24th Jan 2004 15:30 UTC

Does this nbench thing run on Mac OS X? There seems not to be a "configure" script, and when I run make as it says to do in the README I get: "ld: can't locate file for: -lcrt0.o"


vi the Makefile and uncomment line 67 + 68. It should now compile with a simple 'make'

AMD64 SPECint2000 64 contra 32 Bit test
by Lightkey on Sat 24th Jan 2004 17:05 UTC

The german magazin c't did a comparison back in october 2003 with the gcc 3.3 also.
The 64 bit versions had only on 3 of 13 tests a worse result and were 6% faster in average, exactly the 64 bit versions were compared to the 32 ones:
164.gzip + 6%
175.vpr - 6%
176.gcc +16%
181.mcf - 1%
186.crafty + 4%
197.parser +10%
252.eon +25%
253.perlbmk - 1%
254.gap + 5%
255.vortex +10%
256,bzip2 + 4%
300,twolf + 1%

But the reason is probably the 16 registers instead of 8.
They say the PPC970 should be more qualified, if only there were a true 64 bit OS for it.

Give us the info already
by Majik Fox on Sat 24th Jan 2004 20:27 UTC

I wish this guy would get to the point and stop being so fluffy.

LD_LIBRARY_PATH for 64-bit
by Anonymous Solaris User on Sat 24th Jan 2004 23:11 UTC

FYI, Solaris's ld (the dynamic linker) supports two entirely separate LD_LIBRARY_PATH variables for 32 and 64-bit executables. I would suggest that if you are going to mix executing the 64- and 32- bit binaries and need their respective dynamic libraries to get loaded automatically, read the manpage for ld. It was very helpful when I once wanted to do this.

Curious
by Matt Day on Sat 24th Jan 2004 23:15 UTC

when i do the same tests using Forte Developer 7 on Solaris 8 the 64's are faster and quite abit.. try building the 64's with say -xarch=v9a and see how you go.. from years of experience gcc on solaris is crap.

64 versus 32 bits FUD?
by Robert M. Stockmann on Sun 25th Jan 2004 05:38 UTC

Is this dude sponsored by Intel? Why would someone go through these kind of efforts to find out a 64bit app is a tiny bit slower as the 32bit version? Maybe some people were shocked to find Opteron on 64bit was a lot faster as when running in 32bit mode. Why would someone create FUD about 64bit being slower as 32bit when Opteron currently is pulling _all_ bricks out of Intel's backyard??

remember this quote? :

"Windows [n.]
A thirty-two bit extension and GUI shell to a sixteen bit patch to an eight bit operating system originally coded for a four bit microprocessor and sold by a two-bit company that can't stand one bit of competition."
(Anonymous USEnet post)

Here's another one :

"Itanium [n.]
a.k.a. Itanic. An incompatible sixty-four bit extension to a thirty-two bit Pentium 4 CPU created by a company who's previous CPU was called Pentium 5 and presumably also cannot count upwards in performance."

Robert