posted by Rayiner Hashem on Tue 15th Nov 2005 17:44 UTC

"The Snappy"

OS X's performance has always been contentious issue. I myself found early versions of the OS to be unusable, given the machines on which it was expected to run. Luckily, 2.3 GHz will light a fire under even the fattest software. In a week of usage, I never found the machine to be slow. It doesn't feel super-fast, the way BeOS used to, but it never lags under load or otherwise freeze up. I was quite impressed about the speed of the OS X widget set. OS X's widgets feel very fast and smooth. For example, Spotlight's result list always scrolls and resizes smoothly, even when it is displaying hundreds of hits. Of course, another advantage of OS X's graphics system is that redraw is always flicker-free, since the desktop is double-buffered. This feature lends an incredible feel of solidity to the desktop, one I find far preferable to Windows' faster but "twitchier" behavior. The performance picture isn't all rosy, however. Darwin really is slow, and it's not just something a server administrator will notice. Basic things I do in the CLI, such as expanding large archives, feel much slower on the G5 than on the X2. Though I did not include compile performance in the benchmark results below, because compiling to PowerPC and to AMD64 are two different things, I found that the G5 is perhaps two-thirds as fast at compiling software as the X2. The blame for this issue can be distributed among the G5 CPU, for its mediocre integer performance, and Darwin, for its mediocre file I/O performance.

At this point, I would like to present some benchmarks I've conducted, but want to preface them with a warning. The benchmarks below are not designed too show off the G5 or the X2. They are not designed to show the absolute best performance achievable on each platform. The SPEC benchmarks, the results of which are easily available, do that very well. What these benchmarks are designed to do is to give an idea of what the machines will actually behave like running real software. Therefore, I didn't use XLC on the G5 or Intel C++ on the X2. I used good old GCC 4.0.1, which is the standard compiler on both OS X 10.4 and Ubuntu 5.10, and is the one with which most applications on these two platforms will be compiled. I should also point out that the use of GCC gives an advantage to the X2, not so much because GCC optimizes better for x86, but because the Opteron architecture is much more forgiving of mediocre code generation. I consider this a fair arrangement, because the reliance on magic compiler technology for good performance is as much a design flaw as a crappy FPU or slow memory bus. In the world of SPEC, the processor has the privilege of running highly-tuned code. In the real world, it runs whatever the user wants to run, which, more often than not, will be mediocre code from a commodity compiler. I would also like to make a comment about benchmarking in general. I am not someone who is impressed by small constant factors. I consider differences of 5%, the kind that gamers get excited over, to be statistical noise. I do not think most people will even notice a difference of 10%. At 20%, the differences become noticeable, if one is looking, but I can't say I've ever been in a situation where 10 minutes would have been too long, but 8 minutes would have been acceptable. From my perspective, unless I'm trying to show off, the difference has to be in the 30% range before it matters.

The benchmark lineup below is designed to reflect the type of programs I use on my machines. I use my computer for web browsing, listening to music, writing code, writing reports, and running engineering software and simulations. As I mentioned earlier, all the benchmarks were conducted using GCC 4.0.1. In the case of the G5, it was the version in XCode 2.2, while for the X2 it was the default version in Ubuntu 5.10. The compiler options used were as follows:


G5: -O3 -mcpu=G5 -mtune=G5 -mpowerpc64 -mpowerpc-gpopt -funroll-loops 
X2: -O3 -march=k8 -funroll-loops

The X2 doesn't really care about the compiler flags, but the G5 does. Using the above flags improved the performance of the G5 noticeably compared to the "-O2" option I normally use. It should also be noted that I didn't use Apple's '-fast' metaflag, for a number reasons. First, it is partially redundant. It specifies several statements that are on by default anyway. Second, it slowed down benchmarks, relative to the above flags, as often as it sped them up. Third, on the one case where it did show noticeable improvements over the above flags, nbench, it also generated code that could not complete the Neural Net portion of the test. The problem flag in question, -ftree-loop-linear, caused Neural Net to hang on both the G5 and X2. At this point, an astute reader will notice that, using the aforementioned compilers and compile flags, the following benchmarks were run in 32-bit mode on the G5, and 64-bit mode on the X2. This difference was intentional. The G5 is fastest when running 32-bit code, and the X2 is fastest when running 64-bit code. Moreover, OS X and its apps are almost completely 32-bit, while Ubuntu and its apps are almost completely 64-bit. Not only was each processor running in its fastest mode, but each was running the type of code it would run during normal usage.

One last note before getting to the benchmarks. Each benchmark description specifies the units in which the result are expressed, as well as whether higher or lower values are better. Listed along with each result is a ratio specifying the relative performance of the G5 to the X2. Regardless of the unit of the benchmark, "ratio" values over 1.0 mean the G5 was faster, while "ratio" values below 1.0 mean it was slower.

The first benchmark is a test of MP3 encoding performance using LAME. This benchmark should test the processor's integer performance on large streaming data sets. The version of LAME used was 3.96.1. The source data was Live's "Birds of Pray" CD, ripped by iTunes to WAV format. The results below are the time, in seconds, taken to encode the entire CD. Lower values are better, and are the average of three trials is reported.


                G5      X2    Ratio
LAME:         4m41s   4m51s    1.04

This particular benchmark is a good showing for the G5. For all intents and purposes, the two processors achieved the same result. Now, the G5 should have achieved a slightly higher result, given its significantly greater memory bandwidth, which indicates that its integer performance is weaker than the X2's.

The next benchmark is TSCP, a chess AI. This test should be a good indicator of the system's performance on branch-heavy integer code, such as many AIs and other forms of decision logic. The version of TSCP used was 1.81. The results below are in thousands of decision nodes visited per second, with higher values being better and the average of three trials being reported.


                G5      X2    Ratio
tscp:         303.3   388.7    0.78

This result shows what happens to the G5's long 16-stage pipeline when many branch mispredictions occur. Like the Pentium 4, it suffers a significant penalty on such types of code.

The next benchmark consists of a pair of neural network simulations. Please see the benchmark's site for a description of the kernels. The results below are in seconds required to complete each test, with lower values being better and the average of three trials being reported.


                 G5     X2   Ratio
BPN:          5.16s  5.63s   1.09
SOM:          1.34s  1.42s   1.06

Interestingly enough, the G5 beats the X2 in this integer benchmark. The reason appears to be a combination of several factors. First, while this benchmark is integer-heavy, it has very few branches and consists mainly of a large number of memory accesses to multi-dimensional arrays. Since these arrays fit in cache, the G5's slightly lower-latency L2 and 5% clock-speed advantage allow it to edge out the X2.

The next benchmark is typesetting a large TeX file using pdfeTeX. This benchmark is a purely integer benchmark that tests the processor's performance in manipulating tree-like data in cache. The specific TeX program used was pdfeTeX version 3.141592. The results below are the time taken to produce a PDF from the source file, in seconds. Lower values are better, and the average of three trials is reported.


                G5      X2    Ratio
physics.tex: 1.241s  0.950s    0.77

These results are consistent with what we saw regarding the G5's integer performance in the tscp benchmark. It should be noted that the source file was small enough to fit into cache. Had the source data been large enough to spill into memory, the results for the G5 would likely be a bit worse, given the random memory access patterns involved in the benchmark and what we can see of the G5's memory performance in the later benchmarks.

Table of contents
  1. "The Story"
  2. "The Interface"
  3. "The Software"
  4. "The Software, Continued"
  5. "The Snappy"
  6. "The Snappy, Continued"
  7. "The Inevitable"
e p (0)    208 Comment(s)

Technology White Papers

See More