An ongoing series of articles that track the quality of programming tools for Linux, including Opteron and Pentium 4 tests for the GNU Compiler Collection (GCC) and Intel C++.
An ongoing series of articles that track the quality of programming tools for Linux, including Opteron and Pentium 4 tests for the GNU Compiler Collection (GCC) and Intel C++.
I strongly disagree with the test result, especially with POV-Ray. My experiences (backed with published results from magazines like c’t and iX from Germany) is that POV-Ray gains enormous speed when using Intel’s interprocedural optimization (IPO), since only with this mode, a significant amount of loops are vectorized for SSE/SSE2, and the execution time is between 30% and 50% faster than generated with GCC. — However, POV-Ray is one of the rare programs that gains so much speed when using Intel C++ Compiler properly.
The compile time in the tables suggests that the author did not even tried to use the -ipo flag (or maybe he does not even know about that flag), while suggesting a “full optimization”. The typical compile time of POV-Ray using IPO should be > 3 times the without the -ipo flag.
Therefore, this test is pretty much flawed, as the author does not seem to have significant experience with the Intel C++ Compiler. I would recommend the author to read the documentation and possibly how the SPEC test suite is compiled, before doing comparative tests, or even claiming “full optimization” (which is typically -O3 with correct vectorized code generation, plus IPO and possibly profile guided dispatch).
P.S.: I found the passsage where he states that POV-Ray does not link with IPO. This is not true.
It is neccessary to understand how the IPO performs and how it is linked. For the full POV-Ray optimization, you have to rewrite the Makefile, and compile everything from source code (executable depending on *.c and not *.o) with -ipo.
It is very unfortunate that you cannot take the typical Makefile (or even worse, automake) with icc to hope for performance gain. But that is life.
All test proves that . 64bit didn’t give anything.
I wonder how the result would be on a “uncompromised” 64-Bit platform, e.g. Intel Itanium 2 or Compaq Alpha.
Why don’t I use the Intel compiler on my Opteron system? The simple answer is that Intel’s compiler does not produce code that takes advantage of the Opteron’s features.
This is completely not true. Intel C++ Compiler 8.1 for Intel EM64T produces x86_64 codes. It is just because the author does not have a commercial license of Intel C++ Compiler, that he does not have access to the EM64T version of the compiler (note: just as the IA-64 version, it is a completely different version to the IA-32 one).
If you do not believe me (quote from Intels mailing list announcement from July 24th):
* The Intel(R) C++ Compiler for Linux*, Extended Memory 64 Technology (EM64T), version 8.1 compiler is now available. The compiler runs on Intel® EM64T Linux systems and generates executables for Intel® EM64T Linux Systems Included in this package, the Intel® Debugger 7.4 for Linux*, provides native debugging of applications on Intel® EM64T platforms using a command-line or GUI interface.
Installation of this compiler package requires that you have a current valid commercial license for the Intel C++ Compiler for Linux* Systems
Now I managed to compile the newest POV-Ray 3.6.1 with Intel C++ Compiler and IPO activated:
Intel C++ Compiler 8.0, Build 20040412Z
Total Scene Processing Times
Parse Time: 0 hours 0 minutes 2 seconds (2 seconds)
Photon Time: 0 hours 0 minutes 53 seconds (53 seconds)
Render Time: 0 hours 1 minutes 8 seconds (68 seconds)
Total Time: 0 hours 2 minutes 3 seconds (123 seconds)
real 2m3.066s
user 1m47.860s
sys 0m0.040s
Official POV-Ray binary (GCC 3.4.1 according to binary’s version string)
Total Scene Processing Times
Parse Time: 0 hours 0 minutes 2 seconds (2 seconds)
Photon Time: 0 hours 1 minutes 21 seconds (81 seconds)
Render Time: 0 hours 1 minutes 28 seconds (88 seconds)
Total Time: 0 hours 2 minutes 51 seconds (171 seconds)
real 2m51.169s
user 2m46.800s
sys 0m0.020s
That is a speedup by 28.1%, not so “marginal” as the article claims. The test was performed on Intel Pentium M 1.5 GHz, running Red Hat Enterprise Linux 3 WS Update 3.
> All test proves that . 64bit didn’t give anything.
Wow smart guy, read the whole article next time.
“This article is not a comparison of the Pentium 4 and Opteron processors; my test systems are far too different for any such comparison to have meaning.”
Just to make the comparison complete, the result using IPO and PGO (profile-guided optimization):
Total Scene Processing Times
Parse Time: 0 hours 0 minutes 2 seconds (2 seconds)
Photon Time: 0 hours 0 minutes 51 seconds (51 seconds)
Render Time: 0 hours 0 minutes 57 seconds (57 seconds)
Total Time: 0 hours 1 minutes 50 seconds (110 seconds)
real 1m50.185s
user 1m46.010s
sys 0m0.000s
The code runs now 35.6% faster than GCC 3.4.1.
Well, you can’t really put the blame on a guy making a benchmark of compilers to have to come up with his own patch to make POV-ray compile, can you? (yes, he did patch gcc but the patch was already existing).
IMHO you should send your patch to POV-ray maintainers, this way, the next time he makes such comparision article, the result will be more accurate..
His Opteron machines use the 240 processor (i.e. 1.4 GHz). This means that the Xeons have a 1.4 GHz (100% clockspeed) advantage over it.
It seems like every time GCC has a major release, they break everything that already works. I haven’t checked out the code myself, but based on the benchmarks, it looks like they are planning on keeping the tradition alive.
[quote]
Some folk may object to my use of -ffast-math — however, in numerous accuracy tests, -ffast-math produces code that is both faster and more accurate than code generated without it. Yes, -ffast-math has other aspects that make for interesting debate; however, such discussions belong in another article.[/quote]
Can anyone confirm this? It is going against what the GCC documentation states and from my own tests, enabling -ffast-math seems to slow down Scimark on a G4.
There’s an error on the article about gcc: scott say that the tree-ssa will be included in 4.0, but it’s already present in 3.4
Am I correct in noting that the new gcc version is 4.0 and not 3.5? This is somewhat important to me as gfortran (the brand-spanking-new Fortran 95 compiler) was to be included in the 3.5/tree-ssa branch.
What optimization options did you use on the GCC version.
gcc -03 misses out a large number of expensive optimisations.
Actually let me correct myself, -O3 misses loop unrolling which makes quiet a difference on pov-ray and still requires you to use the -mtune and -mcpu
options.
-O3 can slow down the program due to the increased code size, cache misses etc.
Of course the tree-ssa isn’t included in GCC 3.4. The command line options for SSA are present since about GCC 3.2, but they’re not really functional. See gcc.gnu.org.
May 13, 2004
The tree-ssa branch has been merged into mainline.
April 20, 2004
GCC 3.4.0 has been released.
Tree-ssa was in a sub-branchof the 3.4 tree. It was then merged into the mainline (3.5 development not 3.4 release).
Also note, they just renamed 3.5 into 4.0. They are also working very hard on cleaning up the code base. The old code layout was FE – RTL – BE; now with tree-ssa installed it’s FE – tree_ssa – RTL – BE. They are still adding code to the tree_ssa section so it’ll support more optimization. They also need to review a lot of the RTL stuff and kill it. When they get finished, the code should be something more like FE – tree_ssa – BE.
Since I do not have GCC 3.4.1 myself, it is the official binary from POV-Ray folks. The automake system defaults to -g -O2, so I think this is also the flag (since they have no specific interest to do aggressive optimizations).
Well, the issue is just that the standard Automake generated mechanism is almost in all cases (i.e. except small test programs) the wrong approach. In fact he do not have to patch anything, just do not use “make”.
If you worked with ICC for a while, you will notice it. I just think that benchmarking a compiler while with very little experience working with it is a bad approach.
BTW, the correct approach:
CC=icc CXX=icpc ./configure COMPILED_BY=”your name <email@address>”
cd unix
icpc -O3 -xW -tpp7 -ipo -DHAVE_CONFIG_H -I. -I. -I.. -I../source -I../source/base -I../source/frontend -I../unix -L/usr/X11R6/lib -o povray svga.cpp unix.cpp xwin.cpp ../source/*.cpp ../source/base/*.cpp ../source/frontend/*.cpp -lpng -ltiff -lz -ljpeg -lXpm -lSM -lICE -lX11 -lm
(the CC/CXX variables in configure is just to give the binary a correct compiler version string)
When doing profile guided optimization, append first “-prof_genx” after “-ipo”, run the compiled binary with the benchmark once, then recompile with “-prof_use” instead of “-prof_genx”, then the resulted binary will have profile guided optimization in those places of the source code that is covered by benchmark.pov.
GCC 4.0 scores very poorly because it is very much a Work In Progress.
They’re making a completely new optimization infrastructure called tree-ssa, and they only have a very, very few optimizations ported over to the new structure. They do run SOME of the old optimizations, but there are regressions, and they know that. The results can be unpredictable. This also explains the very poor compile times – the code tree is being run through TWO optimizers, neither of which was meant to work with the other.
Just to make it clear – 4.0 is NOT intended for doing Real Work right now. This same exact set of benchmarks run 3 months from now using 4.0 again will have significantly different results.
Everyone involved is REALLY excited about the new optimization infrastructure – it shows a lot of promise.
—-
About the Intel compiler building for x86_64 – that’s probably true – you see, Intel makes processors that (sort of) implement the x86_64 architecture. The only problem is, it doesn’t even PRETEND to optimize for the AMD64 chips. So, while the compiler can produce 64-bit code, it’s really not a fair comparison to pit it against GCC, which WILL optimize for AMD64
@Mike: If GCC breaks binary compatibility (which it does only for C++), it’s because there was a bug in the ABI which compromises correctness. If it breaks source compatibility, it’s because your code was not compliant, and incompliant code is less than worthless. The primary goals of a compiler is correct code generation and precise language conformance. Nothing should compromise these goals, not even compatibility.
@et al: It’s important to note that these are primarily floating-point benchmarks, and the Intel/GCC comparison is done on the P4. On integer code and on CPUs that aren’t so sensitive to instruction scheduling as the P4 (Opteron, P6, P-M), GCC is quite competitive with Intel C++. That said, Intel C++ really is an excellent compiler (though it tends to generate *large* binaries), and deserves a place on your system if only for it’s excellent error messages and language-conformance (courtesy of the EDG frontend). Testing your code with more than one C++ compiler is always a good way to ensure it’s standard’s conformance.
(backed with published results from magazines like c’t and iX from Germany)
Do you remember which version of C’T this was?
If anyone has a link for other benchmarks on other architectures (MIPS/MIPSPro, SPARC/Sun’s compiler, Alpha) i’d like to see them.
I think the first benchmark (when the GCC quality was really poor) was on c’t 23/01, S. 222, with possibly a earlier benchmark appearing on iX, but I forgot which issue it was. The most recent test is on c’t 7/03, S. 226, when GCC improved its performance a lot, especially in g++.
In fact, it would be interesting to compare the Intel C++ Compiler performance against the Sun Studio 9 compiler on IA-32 and possibly EM64T. But I think with such a competition on the IA-32/64/EM64T market, Sun’s pricing (all platforms for US$ 3000) is simply between totally inacceptable and outrageous. I am wondering who is (so excessively rich to) going to buy this thing, especially the Linux version they start to offer.
And in other news, stupid people are more likely to say stupid things.
I thought GCC held up respectably, I’ve never expected it to be the best, just stable.
And the gentlemen above about Intel over AMD, yes Intel makes an overall better chip, however AMD makes a better gaming chip and frankly I prefer AMD’s use of Moores law over Intels. That plus it saved me $100 last time I built a machine (late 2002).
You seem to say the article bring nothing, but at least it tells ‘by how much’ the Intel compiler is faster than gcc in some case, it also helps seeing the difference between various versions of gcc.
if you look carefully, SUN is using the PathScale compilers for Opteron systems….
Since I don’t frequent these forums, I’ll try to answer all of the questions and comments at one time:
Intel EMT64 compiler on Opteron
I have a commercial Intel license, as mentioned in the article. It is not a matter of the icce compiler running on Opteron — it is a matter of Intel turning off certain optimizations for processors that are not “Genuine Intel.” Also, as I state in the article, I don’t think it is reasonable to run Intel’s compiler on their competitor’s chips.
POV-Ray benchmarks
If you’re going to quote numbers, use the same official benchmark from the POV-Ray web site; there’s a link to it in the article. Otherwise, there’s no way to compare your numbers to mine.
Use of -ffast-math
Check the GCC Mailing list archives; you’ll find several recent discussions where people such as myself have shown that -ffast-math is a non-issue in terms of accuracy. I use William Kahan’s PARANOIA as my baseline for proper numerical behavior; since he’s one of the inventors of IEE 754, I’d say he knows his business.
tree-ssa
The “tree-ssa” in 3.4 is experimental, and is not the same thing as the tree-ssa which forms the middleware of GCC 4.0. GCC 3.0 was developed as a completely separate branch, and integrated into GCC mainline development recently. The details are available at the main GCC web site.
..Scott
Didn’t I mention that I also use benchmark.pov? And you are clearly avoiding the central problem: You switched off IPO. It is not a matter of comparability, your benchmark does not reflect the true ability of the Intel Compiler.
Even if you do not know how to compile correctly with IPO and without using the Automake mechanism, does not change the fact that most people truely using aggressive optimizations with Intel Compiler simply does not use Automake (and essentially compiles like me).
Blame on me, I did not explicitly mention that I tested it using benchmark.pov. But why are you automatically assuming that I did not use benchmark.pov, even as you have not even successfully built povray using IPO?
I am the author of an arbitrary precision math library,
MAPM. Arbitrary precision math (perform math on common
functions to 1000’s, 1,000,000’s, etc of digits) can be
a good benchmark since it is very number crunching
intensive, but also has very little disk/file I/O.
I have used GCC and the Intel compiler for testing and
I find the Intel compiler is 10-15% faster (going from
memory here …..)
Just a possible suggestion for future benchmarks.
http://www.tc.umn.edu/~ringx004/mapm-main.html
Mike