Linked by Christopher W. Cowell-Shah on Thu 8th Jan 2004 19:33 UTC
General Development This article discusses a small-scale benchmark test run on nine modern computer languages or variants: Java 1.3.1, Java 1.4.2, C compiled with gcc 3.3.1, Python 2.3.2, Python compiled with Psyco 1.1.1, and the four languages supported by Microsoft's Visual Studio .NET 2003 development environment: Visual Basic, Visual C#, Visual C++, and Visual J#. The benchmark tests arithmetic and trigonometric functions using a variety of data types, and also tests simple file I/O. All tests took place on a Pentium 4-based computer running Windows XP. Update: Delphi version of the benchmark here.
Permalink for comment
To read all comments associated with this story, please click here.
Author's Reply, part 2
by Chris Cowell-Shah on Fri 9th Jan 2004 09:01 UTC

...continued from previous post

These results are not indicative of anything. Real programs do more than just math and I/O. What about string manipulation? What about object creation? etc.
The short answer: of course you're right. But most programs do some math and some I/O, so these results will be at least somewhat relevant to virtually any program. Besides, I made liberal use of the phrase "limited scope" and even titled the article "Math and File I/O" so no one could claim false advertising! The longer answer is more interesting, but probably also more controversial. I think it's fair to say that there are two camps when it comes to benchmarking: the "big, full-scale application benchmark" camp and the "tiny building block benchmark" camp. The arguments used by members of each camp go like this. Big is more accurate in that it tests more of the language and tests complex interactions between the various parts of the language. That's why only large applications like the J2EE Pet Store (later copied by Microsoft to demonstrate .NET performance) are helpful. But wait, says the other camp. Small is more accurate because it tests common components that all programs share. Big is useless because it covers performance for your program, not mine. Mine may use very different parts of the language than yours, hence show very different results. Performance results gleaned from a database-heavy application like Amazon's on-line catalogue can tell us nothing about what language to use when coding a CPU-intensive Seti@Home client. No no, the big camp retorts, small is useless because it doesn?t really do much, and what it does do reduces to near-identical calls to the OS or basic CPU operations. Small doesn?t let differences between various languages show through, because the aspects that are unique to each language are not tested. My own take on the issue is this: all of these points are true, and they suggest that the only worthwhile benchmarking is lots of different benchmarks, written on different scales, testing different things. Throw together enough different sorts of benchmarks and you?ll end up with something useful. The benchmark I presented here falls within the "small benchmark" camp simply because small benchmarks are a whole lot quicker and easier to write than big benchmarks. But I've presented just one (or two, if you split up math and I/O) benchmark. These results are not useless by any means, but they become a whole lot more useful when they are combined with other benchmarks with different scopes, testing different aspects of languages (such as string manipulation, object creation, collections, graphics, and a gazillion others). And while my project can certainly be criticized for being ?too small,? keep in mind that different languages do produce different results under this benchmark, so it is showing some sort of difference between the languages. In other words, I don?t think it?s too small to be at least a little helpful.

The compile time required for JIT compilers (like a JVM) approaches zero when it's amortized over the time that a typical server app (for example) runs. Shouldn't you exclude it from your test?
Good point; I hadn't thought of that. Next time I will probably exclude it by calling each function once before starting the timer.

Java should perform about the same as C++, and an unmanaged C program should perform better than a managed .NET program. Why run benchmarks when we all know how they'll turn out?
Because theory isn't always borne out in reality.

The sorting criteria (using the total instead of a geometric average) is unusual, and favors languages that optimize slow operations.
I did not know about the geometric mean technique, but am very interested in hearing more about it. I had no idea how best to weight the various components of the benchmark, so figured the easiest thing to do was to weight them equally and just add them all up. Some may complain that since the trig component is relatively small, it should be given less weight in the final tally. But I would respond that it?s not small for all languages. The trig component for Java 1.4.2 is longer than all of that language's other components combined. But the real answer to the problem of sorting and analyzing the results is simple: if people want to massage the raw data differently (maybe you never use trig in your programs, so want to exclude the trig results entirely), go for it! And be sure to tell us what you come up with.

You should use more than 3 runs, and you should provide the mean and median of all scores.
I actually did more like 15 to 20 runs of each benchmark, with at least 3 under tightly controlled conditions. I was a little surprised to find that there were virtually no differences in results regardless of how many other processes were running or how many resources were free. I guess all the other processes were running as very low priority threads and didn't interfere much. I deliberately included only the best scores rather than the median because I didn't want results skewed by a virus scanner firing off in the background, or some Windows file system cache getting dumped to disk just as the I/O component started. I figured the best case scenario was most useful and most fair.

Why didn't you use a high-speed math package for Python, such as numpy or numeric?
I didn't know about numpy or numeric. I probably should have used a high-speed math package, assuming it would be something that a new Python programmer could find out about easily and learn quickly.

Shouldn't stuff like this be peer reviewed before being posted?
This ain't Nature or Communications of the ACM--I figure the 100+ comments I received do constitute a peer review! ;) Nevertheless, I like your idea of a two-part submission, with methodological critique after part 1 and results presented in part 2. I'll remember that for next time.

Your compile-time optimization is inconsistent. E.g., why omit frame pointers with Visual C++ but not gcc C?
Because Visual Studio had an easy checkbox to turn on that optimization, whereas the 3 minutes I spent scanning the gcc man page revealed -O3 but not -fomit-frame-pointers. Similarly, I compiled Java withy -g:none to turn strip debugging code but didn't mess with memory settings for the JVM. Someone who programs professionally in C/C++ (or knows more about Java than I do) could have hand-tuned the optimizations more successfully, I'm sure.

Your C++ program is really just C! What gives?
I don?t know C++. I taught myself just enough C (from an O?Reilly book) to code the benchmark. So yes, the C++ benchmark is running pure C code. From my rudimentary knowledge of C vs. C++, I assumed that there were no important extensions to C that would produce significantly different performance over straight C for low-level operations like this, so I stuck to straight C. I called it a ?Visual C++? benchmark because it was compiled by Microsoft?s C++ compiler. And if C++ really is a superset of C (please correct me if that?s not the case?I could be very wrong), then a C program is also a C++ program.

Your trig results are meaningless because you don't check the accuracy of the results. You could be trading accuracy for speed.
Mea culpa--I did sample the trig test results to compare accuracy across languages; they're all equally accurate (at least, to 8 decimal places or so). I forgot to explain that in the article.

Again, thanks for all of the comments. I?ve learned a lot from your suggestions and future benchmarks I may run will certainly benefit from the collective experience of all of the posters.

-- Chris Cowell-Shah