“The other day I posted an article to highlight our new benchmark application Geekbench. It received a lot of attention, but there was some concern about the machines used in the testing (mainly, the Athlon 64 and Pentium 4c are kind of old). Despite the fact that the article was meant to be more about the testing than the results, some people refused to let it pass. Luckily, a few people responded to my request for results from more machines, so here is a smaller article comparing just three machines: a PowerMac G5 Quad, an Athlon 64 X2 and a Pentium D.”
http://www.geekpatrol.ca/article/103/geekbench-comparison-redux …
Forbidden
You don’t have permission to access /article/103/geekbench-comparison-redux on this server.
It’s still a useless closed-source synthetic benchmark. Why post this rubbish again?
Edited 2006-02-02 19:57
But since I could not access the article when it was originally posted, I thought it was interesting that there was no mention as to how the machines were prepared to run the benchmark test. How many times were the tests run on the systems? And why is there so much variance in the system memory of the machines? Shouldn’t the test machines be as close to equal with memory as possible?
I’m inclined to agree with nimble about the practicality of the benchmark. A whole lot of numbers and graphs that really don’t say that much about how the systems really perform.
People need to stop with these non-real-world benchmarks. I don’t care how many INTs per second it can process, can it do a photoshop effect faster than another machine. Can it render a video faster? Why is this hard to understand?
buy loads of ram, get the faster int_per_second cpu your money can buy, with the largest bus (that really matters). oh. maybe that’s old news…
anyway i agree, raw power data is rarely interesting to normal user. even, percepted responsiveness is often more important: if a task is easy in the eyes of a user if should be executed fast; if it’s not the computer is “slow”.
such a kind of research is much more interesting in my eyes.
in my eyes geekbench is yet another (geek)toy, with the exact same flaws of the other ones and maybe more because of multiplatform. have fun (:
Better than the original article, but still comparing a quad G5 to a single dual-core x86 proc. And the lowest end dual-cores too. So I’m not really sure what I’m meant to think of these results since they’re impossible to compare to one another.
That and the fact that the PD was beating the A64 in all the memory tests shows that it is really only testing memory bandwidth. It would have been killed if latency had been involved at all.
Once again, these tests are useless. Reasoning:
1. The author is not testing the three systems himself, but rather relying on results sent in by others. There is no consistency in this method.
2. We have *no* idea what kind of motherboards, chipsets, and BIOS options are being used on the P4/A64. This is a biggie, people. The Athlon 64 owner could be running PC2700 3.0-4-4-10 2T RAM for all we know, BECAUSE THIS INFORMATION ISN’T SPECIFIED. Look, I hold a job as a hardware reviewer and writer, so I’m qualified enough to say that memory timing options are freaking *critical* when you want to do a comparison between platforms. What’s the use in saying the Athlon64/P4 are faster at so-and-so, when you haven’t made an effort to standardize any of the other subsystems in your test rig?
3. He’s testing compiler + C library performance more than anything. He even says that the benchmark is not optimized for any particular CPU. Do we know whether his FPU tests are x87, SSE, SSE2, or SSE3? This is also a critical piece of information, because it would lend insight into why the Pentium 4 beats the Athlon 64 in some of the FPU tests. The Athlon 64s x87 FP is much faster than the Pentium 4’s, while the P4s have better-implemented SSE2/SSE3 units — so which is it?
At our lab here we’ve got a huge variety of different processors, motherboards, and RAM types. If he were willing to send over the source code to his silly program, I could find the time next week to compile it and put together some standardized test rigs in order to give some reasonable scores. As it stands right now, this is like doing a video card review by asking your friends with GF2MXs, 6600 GTs, 9600 XTs, 9800 Pros, and 7800 GTXs to install some games, benchmark at 1024×768, and then give you the scores.
Yes, it’s *that* ambiguous. Chipset driver versions? Video driver versions? Start-up items that could be clobbering CPU/memband? Game settings to use? How many runs to do? It’s all up to the individual testers to decide, making the scores useless to anyone but the person who ran the test.
> …I could find the time next week to compile it and put together some standardized test rigs…
mail him/them if you have all that free time (:
3. He’s testing compiler + C library performance more than anything.
I have no problem with that, particularly since the compilers chosen for each platform are the ones most widely used in each platforms’ apps (Visual C++ 200X on Win & GCC 4 on Mac)
Basically, this is neither a purely synthetic test nor a purely “apps testbed” benchmark.
* Purely synthetic benchmarks compare the theoretical limits of processor performance. Drawback: can be misleading since 99.9% of applications are never fully optimized for a single processor/processor model.
* Purely app-testbed based benchmarks give more useful data, as long as the apps selected represent what you use everyday. Drawback: the opposite of the synthetic benchmark, you can’t assume one processor is slower than another just because a Photoshop bake-off said so. Why? Again, optimization: An app with a long history of being x86-only, when ported to PPC will lose many optimized codepaths that will take time to be adapted to fully take advantage of PPCs strengths – the same applies when going from traditional PPC apps to x86 ports of the same apps.
Edited 2006-02-06 09:37
1. Less synthetic, more real. Big applications, with real uses, and real bottlenecks, need to be tested, and their results timed.
2. More specs are needed for the systems.
3. More information of how the tests were created. Apple has used trickery with this to cook their own statistics.
More computers used doesn’t change the real failing of this: it’s not testing useful software.
Not this again. If anything (Like Kroc said) many people would rather see real world performance comparisons. These number games are stupid. Compare between common software or tasks on each platform for example.
Let’s look at the results one by one:
Mandlebrot: Okay, I’ll buy that, the G5 has a good FPU with hardware square root, and a 500 Mhz clock rate advantage, so it should do fine.
Blowfish: The G5 scores 4x higher than the X2 on this test, even in the single-threaded test. The only explanation for this is that even in the “cache” test, the code on the X2 falls out of cache. This is supported by the fact that the G5’s results fall off dramatically in the memory test, and the X2’s results do not. If you’re going to do a cache test, at least make sure the test fits in cache on the CPUs in question…
Emulation: This result makes sense. The X2’s scores put it at the level of a 2.3 GHz 970MP, which sounds about right from my experience.
Memory Performance: These results just don’t make any sense, and I’m not even sure the author understands what he’s talking about. In the description, he says its a single-threaded test, and immediately after, he attributes the G5’s dominance to the fact that it has four cores.
Memory + Operation Performance: These make sense, sense they are bandwidth-limited, and the 970MP has 33% faster RAM.
Overall, I don’t even understand why they are bothering to build this benchmark. The tests are too simplistic, and some of the results are way out there. One of the key ways to evaluate a new benchmark is to prove that it highlights trends that are already known. This benchmark suggests that the G5 has awesome integer and memory performance, which is known to be untrue. Since the source code to the benchmark is unavailable, its impossible to tell why the benchmark is giving incorrect answers, and its not possible to recompile and test for yourself either.
its not possible to recompile and test for yourself either.
Well, you could download the executables from the link below, if you’re inclined to run such untrusted code:
http://www.geekpatrol.ca/article/97/geekbench-preview
But I just don’t understand why they don’t publish the source and the compiler settings. What do they have to hide?
Being that the mac is at least $3000 or so, couldn’t they at least have used an AMD machine w/ an athlon or opteron dual core that has 1mb L2 cache per core? The lowest level athlon w/ this cache is the $500(for cpu only) athlon x2 4400 i believe.
The blowfish cache test is run on two 32 bit values. That’s 8 bytes, for those of you keeping track. The whole operation takes up less than 6 k when you add it all up. I’m no CPU expert, but how do you suppose that 6 k doesn’t fit into the cache on an X2?
We could have run the tests on an Opteron or a newer, higher end Athlon, but we’re working with the kindness of strangers at this point.
Here’s something to consider for a moment: despite what you may think, this is still about the testing rather than the results right now. The fact that you may not believe that does not make it any less true. The only real reason for the second article is that a whole bunch of people wanted to see results for newer x86 CPUs. Also, we’re not trying for the theoretical values in any of our tests, we’re just running regular code through the machines. I guess you could say that we’re working towards a real world synthetic benchmark.
Anyway, now back to your regularly scheduled commenting.
The blowfish cache test is run on two 32 bit values. That’s 8 bytes, for those of you keeping track. The whole operation takes up less than 6 k when you add it all up. I’m no CPU expert, but how do you suppose that 6 k doesn’t fit into the cache on an X2?
I presume you’re encrypting more than 8 bytes. How big is the data set that you’re encrypting?
Here’s something to consider for a moment: despite what you may think, this is still about the testing rather than the results right now.
First, if it wasn’t about the CPUs, you wouldn’t be commenting on their relative performance. People see you benchmark and start making comments like “it looks like it was a mistake for Apple to switch to x86”, and that just doesn’t make sense. Moreover, part of the reason I’m pointing to the flaws in the result is because it reflects on the benchmark itself. If the benchmark tells you 2 + 2 = 5, there is something wrong with the benchmark. Your results for integer performance and memory performance are at odds with nbench, SPEC, and a whole bunch of proven benchmarks. Clearly, something is wrong with the benchmark.