posted by Rayiner Hashem on Tue 15th Nov 2005 17:44 UTC

"The Snappy, Continued"

The next benchmark is SciMark2. Two variants of the benchmark were used, one with a small data set and another with a large data set. Note, I do not report the composite score, because it makes no more sense to me to sum the results of completely different benchmark kernels than it does to sum vectors and kumquats. Please see the SciMark FAQ to see what each test kernel does, and then compare each result individually. The results below are in MegaFLOPs, with higher values being better, and the average of three trails being reported.


                G5      X2    Ratio
Small FFT:   583.4   585.3     1.00
Large FFT:    32.0    56.0     0.57
Small SOR:   434.0   515.2     0.84
Large SOR:   394.1   505.9     0.78
Small  MC:    86.3   261.9     0.33
Large  MC:    86.6   260.6     0.33
Small SMM:   744.7   722.2     1.03
Large SMM:   312.2   351.9     0.89
Small  LU:  1202.9   837.6     1.44
Large  LU:   411.5   424.6     0.97

Well, these results are all over the map. The most noticeable thing is that the X2's performance degrades much more gracefully when we go from the small, in-cache data set to the large, in-memory data set. Clearly, the X2's low-latency memory subsystem gives it a leg-up in these benchmarks. Even with 33% more memory bandwidth and 5% more clock speed, the G5 loses in nearly every test. In most of these cases, the G5 is 80% to 90% as fast as the X2, and is 44% faster in one case, but in three of the benchmarks, it is about one-half to one-third as fast. Poor code generation by GCC on a couple of these kernels is a likely explanation.

The next benchmark is a C version of the classic LINKPACK benchmark. LINKPACK is an extremely simplistic benchmark that measures the raw FP throughput of the processor by doing large matrix multiplications. It represents the best case for FPU performance on a particular processor. The variant below is a double-precision version of LINPACK. The results below are in MegaFLOPs, with higher values being better and the average of three trials being reported.


                G5      X2    Ratio
LINPACK:      1391     876     1.59

The results of this benchmark are great for the G5. It is well over 60% faster than the X2 in this benchmark. This result is to be expected, since the G5 has a floating-point multiply-accumulate (FMAC) instruction, which is an important part of the matrix multiplication algorithm. On the X2, the compiler must generate two separate instructions to have the same effect as one FMAC.

The next benchmark is a Fourier transform, specifically the one distributed with the FFTW 3.0.1 source code. Four variants were used here, a small and large double-precision FFT, and a small and large single-precision FFT. All are in-place, complex, forward transforms. In this benchmark the small data-set result is more important. I don't know of any particular uses for 1 million point FFTs, though I'm sure there are some. In comparison, 4096-point FFTs are quite common in signal processing, and JPEG2000 uses a related algorithm on similarly sized blocks for image compression. Since AltiVec cannot do double-precision math, both AltiVec and SSE2 were disabled for the double-precision tests. For the single-precision tests, AltiVec is enabled, but for some reason FFTW generates improper assembly for SSE2 (on GCC 4.0), so in fairness no numbers are reported for the X2's. However, the G5 results are reported anyway, because they show that the G5 makes one heck of a digital signal processor! Moreover, these results are likely quite representative of highly-tuned AltiVec-friendly algorithms. As before, the results below are in MegaFLOPs, with higher values being better and the average of three trials being reported.


                G5      X2    Ratio
Double (4K):  2594     2044    1.27
Double (1M):   777     662     1.17
Float (4K):   3709       *        *
Float (1M):    915       *        *

These results support my belief that GCC generated poor code on the G5 for the FFT in SciMark. The G5 wins by a quite a margin in both cases, even though we can again see the X2 closing the gap when memory performance comes into play.

The next benchmark is raytracing with POVray, using the built-in POVray benchmark. This benchmark tests the performance of the CPU on an essentially floating-point algorithm with a significant integer component. Random access memory performance also comes into play for accessing many of the bookkeeping data structures necessary to perform the actual raytracing step. The version of POVray used was 3.6.1. The results below report the time required to render the 384x384 test image. Lower values are better, and the average of three trials is reported.


                G5      X2    Ratio
POVray:      32m32s  25m31s    0.78

We see here that despite its very impressive theoretical floating-point performance, the G5's integer performance hurts it in this real-world application. While some would argue that POVray isn't optimized for the G5, it should be noted that POVray isn't exactly optimized for the X2 either. Although, I think it is safe to say that the G5 has more untapped potential in situations like these than does the X2.

The next benchmark is a Blender render using the pseudo-standard Blender benchmark file test.blend. A word of warning about this benchmark. I could not get it to compile properly on OS X with GCC 4.0. As a result, I used the standard binary for OS X. I do not consider these results fully representative of what the G5 can do, but have included them anyway, because in the real world, it is rarely the case that all of one's applications will be using the vendor's absolute-latest compiler. This test stresses a number of different aspects of the system, since rendering involves both floating-point and integer operations on large in-memory data sets. This test should be a good indicator of how the G5 can be expected to run media applications not specifically optimized for a particular processor. Note that this test is dual-threaded, so it uses both cores on each CPU. The results below are in seconds, with smaller values being better and the average of three trails being reported.


                G5      X2    Ratio
test.blend:   1m34s   1m15s    0.80

These results are about as expected, and very similar to the POVray result above.

Moving on to the mixed-code benchmarks, we have FreeBench, a free cross-platform benchmark suite. The version used was 1.0.3, compiled from the UNIX tarball for both machines. Some scripts were edited on OS X to make the suite build properly, but no code was changed. A description of the test kernels can be found here. The results below are speedup relative to a 333MHz Sun Ultra 10, with higher values being better and the average of three trails being reported.


                 G5     X2   Ratio
Analyzer:      7.71   3.28    2.35 
4 in a Row:   12.01  16.97    0.71
Mason:         8.76  11.42    0.76
pCompress2:    6.59   9.68    0.68
PiFFT:         5.72   6.71    0.85
DistRay:       7.59   8.88    0.85
Neural:        3.56   7.89    0.45

Aside from the two outlying values, these results are fairly consistent with what we've seen so far. The G5 performs around 75% as well as the X2 on integer code, and 85% as well on general floating-point code.

The last benchmark is nbench, which is a total processor benchmark based on the original ByteMark. The version of nbench used was 2.2.2. Again, I think combining the results into a composite score is meaningless, so I do not report the final average. Please read the description of each kernel here and evaluate each result individually with consideration to what performance parameters are tested by each kernel. The results below are speedup relative to a K6-233, higher values are better, and only one test was run since nbench has its own trial-repetition and averaging logic.


                 G5     X2   Ratio
Numeric Sort:  7.19   9.43    0.76 
String Sort:  25.28  13.92    1.82
Bitfield:      7.03  11.97    0.59
FP Emulation: 17.42  16.88    1.03
Fourier:      17.14  12.79    1.34
Assignment:   26.18  35.59    0.74
IDEA:         19.18  24.08    0.79
Huffman:      12.48  14.71    0.85
Neural Net:    2.03  28.10    0.07
LU Decomp:    34.59  45.43    0.76

All in all, this benchmark is a decent showing for the G5. It loses two benchmarks by a large margin, wins two by a large margin, ties one benchmark, and is between 75% and 85% as fast as the X2 on the rest. The neural net result is incredibly bad, but that is partly the result of compiler optimizations. When using "-Os", the G5 achieves a score of around 14 on this trial, though it scores in all the other kernels suffer significantly.

The above benchmarks present a lot of numbers to go through. However, a few trends clearly present themselves. For general integer code, the G5 shows itself to be about 75% as fast as the X2. For general floating point code, it seems to be 85% as fast. For code that really works to its strengths, it can be up to 50% or more faster. Of course, on code that really hits its weaknesses, or hits bad compiler optimizations, as may be the case, it can be 50% or more slower. Some observations we can make are that the G5 seems to be quite picky about code generation, given the many cases above where its performance drops to a fraction of that of the X2. We can also see that its memory controller is relatively poor compared to the X2's, since its performance drop in the large data tests is always larger than X2's performance drop. The G5 seems to have an excellent FPU, which shines in some particularly suitable benchmarks but is more often than not hampered by its mediocre integer performance and memory subsystem.

Table of contents
  1. "The Story"
  2. "The Interface"
  3. "The Software"
  4. "The Software, Continued"
  5. "The Snappy"
  6. "The Snappy, Continued"
  7. "The Inevitable"
e p (0)    208 Comment(s)

Technology White Papers

See More