Linked by Tony Bourke on Thu 22nd Jan 2004 21:29 UTC
Benchmarks When running tests, installing operating systems, and compiling software for my Ultra 5, I came to the stunning realization that hey, this system is 64-bit, and all of the operating systems I installed on this Ultra 5 (can) run in 64-bit mode.
Permalink for comment
To read all comments associated with this story, please click here.
Omitted Details
by MJ on Fri 23rd Jan 2004 06:56 UTC

I appreciate the author's attempt to provide an objective and unbiased review of the performance aspects concerning 32-bit vs. 64-bit computing on his Ultra 5. However, there are a number of details that he did not explore, and a few critical mistakes concerning his testing methodology. I'm going to start with the larger issues and work down from there:

First, his comparison between 32 and 64 bit applications is not correct. From the article:
# file openssl

openssl: ELF 32-bit MSB executable SPARC32PLUS Version 1, V8+ Required, UltraSPARC1 Extensions Required, dynamically linked, not stripped

While the binary format of this executable is ELF-32, the application in question is not a true 32-bit application. The SPARC32PLUS, V8+ required indicates that this application is compiled to use the SPARC v8plus architecture. V8plus uses 32-bit addresses but allows an application to registerize its data in 64-bit quantities, so realistically these comparisons are between programs that use 32 vs 64 bit addresses but all have 64-bit registers. This distinction isn't explored in the article, but it is important. To get a true characterization of 32-bit addresses and registers, the benchmarks ought to also be compiled to the v7 architecture. I think this may make differences more observable between pure 32-bit and pure 64-bit applications.

The v8plus benchmarks show the obvious benefit of 64-bit registers to compute intensive applications while not suffering from the drawbacks of having a 64-bit address space. My suspicion is that if these tests are re-run for the v7 architecture, the results will find that the 32-bit applications perform better on workloads characterized by lots of load/store behavior, while the v9 applications trump the v7s at computations. This is because there's more register space on v9, allowing more data to be computed at once.

The reasons for 64-bit apps to slightly lag in performance are various but there are some important things to keep in mind when examining these kinds of problems. With 64-bit addresses, you've doubled the size of your pointers, so this is one reason why size of the compiled binaries increases. These addresses have to go somewhere. Also, since you have larger addresses, your cache footprint increases which means you get fewer lines in the cache. More cache misses == poorer performance as you have to go further down the memory heirarchy to satisfy your requests. As a point of fact, the SPARC v9 architecture only allows you 22-bits for as immediate operand, so to construct a 64-bit constant you have to issue more instructions. SPARC uses register windows, and when you take a register spill/fill trap in a 64-bit address space, you're going to have more information in a 64-bit trap than in a 32-bit. These are just a number of factors that characterize the behavior between 32 and 64 bit address spaces.

I also have some concerns about the author's static vs. dynamic linking. In two cases the author compares v8plus vs. v9 using completely dynamically linked binaries, and in the other cases, he compares v8plus to v9 using mostly dynamically linked applications only statically linking to libcrypto and libssl. The problem here, is that there is still dynamic linker overhead both as the application is started up, and as it runs. While the "statically" linked binaries obviously benefit from having to take fewer detours through the PLT, these apps are still dynamically linked to libc, libthread, and probably others. So, the full benefit of statically linking them is lost. The 64-bit dynamically linked apps take longer than their 32-bit counterparts for reasons which include more instructions in the PLT to generate the function address to which to jump.

I'm sure there are plenty of other performance aspects that I forgot to touch upon, but my biggest frustration with this article is that it fails to tease out the details about which applications perform better on 32-bits and which perform better on 64-bits and why. I hope my comments were able to fill in some of those gaps. By running his benchmarks on a v8plus architecture, the author has successfully demonstrated what an effective compromise 32-bit addresses and 64-bit registers can be, but he hasn't characterized actual 32-bit application performance. That said, I do appreciate his fair, factual, and un-evangelical approach to the benchmarking. It certainly provided a good starting point for discussions on 32-bit vs. 64-bit performance.