Linked by Michael on Tue 29th Mar 2011 23:53 UTC
Benchmarks "Version 4.6 of GCC was released over the weekend with a multitude of improvements and version 2.9 of the Low-Level Virtual Machine is due out in early April with its share of improvements. How though do these two leading open-source compilers compare? In this article we are providing benchmarks of GCC 4.5.2, GCC 4.6.0, DragonEgg with LLVM 2.9, and Clang with LLVM 2.9 across five distinct AMD/Intel systems to see how the compiler performance compares."
Thread beginning with comment 468283
To view parent comment, click here.
To read all comments associated with this story, please click here.
Valhalla
Member since:
2006-01-24

What were the flags, data sets used, etc, etc, etc?
While the data sets shouldn't matter in any way I can only agree that the lack of information concerning the compiler flags makes it sort of a black box benchmark.

From what I've gathered these tests are done with the flags that have been set upstream (if it's those in the original source packages or those chosen by some repository package maintainer I have no idea) but those flags may very well be set to very low levels of optimization and/or contain debugging flags which impact performance.

Other packages like x264 enables tons of handwritten assembly for x86,x64 by default which pretty much renders the tests worthless as a comparison between compilers, and afaik Phoronix's tests do not disable this assembly code when doing their benchmarks.

A (imo) proper test would be to compile all packages at atleast -O3 (and possibly -O2) and compare the corresponding results.

As it is now, an upstream package may come with compiler settings either intentionally tailored to a specific compiler or one that by chance suits one compiler better than it does another which may not reflect the performance of each compiler when told to generate their fastest code (usually -O3).

Not comparing the compilers at a specific (or several specific) optimization level (preferably the highest if only one is to be used) means that the test-results may often be a poor reflection of the actual compiler capacity.

I can see that Phoronix may shy away from testing a large set of optimization levels but then they should atleast settle on -O3 which is the level which from the compilers standpoint *should* generate the fastest code. As it is now, the packages they benchmark may have -Os, or -O2 for all we know and since there's really no fair way of measuring performance of 'middle' settings (how can you decide if -O2 on one compiler corresponds to -O2 on another? maybe -O2 is closer to -O3 in compiler A, while -O2 is closer to -O1 on compiler B) the only 'fair' way would be to either compare several optimization levels against eachother or only the highest.

As it is now, I find Phoronix test results interesting but I do take them with a large grain of salt. I will continue to rely on my own tests and tests where all relevant data is presented for easy verification.

I do applaud Phoronix for doing these benchmarks, I just wish they weren't done in (again imo) such a sloppy manner.

Reply Parent Score: 5

tylerdurden Member since:
2009-03-17

Indeed, data sets and their characteristics do not matter, it is not like a computer program's main function is to process data or anything.


For example, different schedulings can have vastly diverging behaviors in performance, esp. given to the cache and load/store queue architectures of modern out-of-order multicore processors. And the characteristics of the data set are fundamental to properly understand the behavior being observed and benchmarked. Similarly, it is fundamental to understand the type of data distribution, to understand for example if the compiler is scheduling efficiently to keep the the multiple functional units busy. Etc, etc.

So yes, understanding the data sets, as well as the instruction mixes is fundamental when benchmarking different compilers and their performance properly.

I don't consider studies done with such huge omissions to be useful at all, a waste of time if anything. Although I understand it is a relatively easy way of filing up 8+ pages of contents with graphs.

Reply Parent Score: 2

vodoomoth Member since:
2010-03-30

What on earth are you talking about? They are testing compilers: that's feed them some source code and inspect the machine code on a specific aspect and here, they've chosen to inspect either the speed of the compiled code or the time taken by each compiler to compile.

What do you expect as "data sets"? The source code of each program that compiled? Or for instance, the data used as input to the compiled program? like the video used for the x264 encoding? or the files used for the 7-zip compression?

Reply Parent Score: 2

AnyoneEB Member since:
2008-10-26

Note that -O2 is used in GCC because it often (usually?) produces faster code than -O3. For some discussion on the topic, see Gentoo's documentation page on optimization flags: http://www.gentoo.org/doc/en/gcc-optimization.xml . Basically, it sounds like the GCC optimization levels are the separated by the amount of work the compiler has to do to optimize the code, and the extra work done by the -O3 optimizations tends to increase code size (and therefore hurt caching) so it often slows down programs.

That said, testing compilers at multiple optimization levels would likely be more informative about how good their optimizations actually are.

Edited 2011-03-30 20:34 UTC

Reply Parent Score: 1

Valhalla Member since:
2006-01-24

Note that -O2 is used in GCC because it often (usually?) produces faster code than -O3.

Well, the fact that -O2 does beat -O3 sometimes is why I wrote *should*, but from my experience -O3 usually beats -O2 on both GCC and LLVM. Which is as it should be, since -O0 is no optimizations, -O1 is slight optimization, -Os favours code size over speed, -O2 tries to strike a balance between code size and speed, and -O3 will opt for maximum speed at the cost of code size.

The reason -O2 sometimes beats -O3 is most likely due to flawed heuristics resulting in cache misses and failed branch prediction etc by some of the more advanced optimizations enabled by -O3. Cache optimization is sensitive to cpu platform settings, so using '-march=native' would be a good choice for code to perform as good as possible on your machine.

It's interesting though that while I've found -O2 to beat -O3 on certain tests using GCC and LLVM, when I've tried Open64, -O3 has always performed much better than -O2, so in a -O2 test between GCC, LLVM and Open64, Open64 would likely be at a disadvantage, hence why I think it's apt to go for the option that is *meant* to generate the fastest code (-O3), OR benchmark compilers across several optimization levels.

Also note that what once was faster with -O2 may not be faster with the next iteration of that compiler, given that heuristics improve (sadly they also sometimes regress). This is a very difficult part of compiler technology which is why optimizations such as PGO (profile guided optimization) is so effective. It is also why programs like the Linux kernel makes use of C extensions like __builtin_expect and __builtin_prefetch to guide the compiler when optimizing for branch predictions and cache prefetching.

Reply Parent Score: 2