Linked by Michael on Tue 29th Mar 2011 23:53 UTC
Benchmarks "Version 4.6 of GCC was released over the weekend with a multitude of improvements and version 2.9 of the Low-Level Virtual Machine is due out in early April with its share of improvements. How though do these two leading open-source compilers compare? In this article we are providing benchmarks of GCC 4.5.2, GCC 4.6.0, DragonEgg with LLVM 2.9, and Clang with LLVM 2.9 across five distinct AMD/Intel systems to see how the compiler performance compares."
Thread beginning with comment 468265
To read all comments associated with this story, please click here.
Another useless phoronix article
by tylerdurden on Wed 30th Mar 2011 00:43 UTC
tylerdurden
Member since:
2009-03-17

What were the flags, data sets used, etc, etc, etc?

Reply Score: 2

Valhalla Member since:
2006-01-24

What were the flags, data sets used, etc, etc, etc?
While the data sets shouldn't matter in any way I can only agree that the lack of information concerning the compiler flags makes it sort of a black box benchmark.

From what I've gathered these tests are done with the flags that have been set upstream (if it's those in the original source packages or those chosen by some repository package maintainer I have no idea) but those flags may very well be set to very low levels of optimization and/or contain debugging flags which impact performance.

Other packages like x264 enables tons of handwritten assembly for x86,x64 by default which pretty much renders the tests worthless as a comparison between compilers, and afaik Phoronix's tests do not disable this assembly code when doing their benchmarks.

A (imo) proper test would be to compile all packages at atleast -O3 (and possibly -O2) and compare the corresponding results.

As it is now, an upstream package may come with compiler settings either intentionally tailored to a specific compiler or one that by chance suits one compiler better than it does another which may not reflect the performance of each compiler when told to generate their fastest code (usually -O3).

Not comparing the compilers at a specific (or several specific) optimization level (preferably the highest if only one is to be used) means that the test-results may often be a poor reflection of the actual compiler capacity.

I can see that Phoronix may shy away from testing a large set of optimization levels but then they should atleast settle on -O3 which is the level which from the compilers standpoint *should* generate the fastest code. As it is now, the packages they benchmark may have -Os, or -O2 for all we know and since there's really no fair way of measuring performance of 'middle' settings (how can you decide if -O2 on one compiler corresponds to -O2 on another? maybe -O2 is closer to -O3 in compiler A, while -O2 is closer to -O1 on compiler B) the only 'fair' way would be to either compare several optimization levels against eachother or only the highest.

As it is now, I find Phoronix test results interesting but I do take them with a large grain of salt. I will continue to rely on my own tests and tests where all relevant data is presented for easy verification.

I do applaud Phoronix for doing these benchmarks, I just wish they weren't done in (again imo) such a sloppy manner.

Reply Parent Score: 5

tylerdurden Member since:
2009-03-17

Indeed, data sets and their characteristics do not matter, it is not like a computer program's main function is to process data or anything.


For example, different schedulings can have vastly diverging behaviors in performance, esp. given to the cache and load/store queue architectures of modern out-of-order multicore processors. And the characteristics of the data set are fundamental to properly understand the behavior being observed and benchmarked. Similarly, it is fundamental to understand the type of data distribution, to understand for example if the compiler is scheduling efficiently to keep the the multiple functional units busy. Etc, etc.

So yes, understanding the data sets, as well as the instruction mixes is fundamental when benchmarking different compilers and their performance properly.

I don't consider studies done with such huge omissions to be useful at all, a waste of time if anything. Although I understand it is a relatively easy way of filing up 8+ pages of contents with graphs.

Reply Parent Score: 2

AnyoneEB Member since:
2008-10-26

Note that -O2 is used in GCC because it often (usually?) produces faster code than -O3. For some discussion on the topic, see Gentoo's documentation page on optimization flags: http://www.gentoo.org/doc/en/gcc-optimization.xml . Basically, it sounds like the GCC optimization levels are the separated by the amount of work the compiler has to do to optimize the code, and the extra work done by the -O3 optimizations tends to increase code size (and therefore hurt caching) so it often slows down programs.

That said, testing compilers at multiple optimization levels would likely be more informative about how good their optimizations actually are.

Edited 2011-03-30 20:34 UTC

Reply Parent Score: 1