Linked by ebasconp on Fri 10th Jun 2011 22:22 UTC
Benchmarks "Google has released a research paper that suggests C++ is the best-performing programming language in the market. The internet giant implemented a compact algorithm in four languages - C++, Java, Scala and its own programming language Go - and then benchmarked results to find 'factors of difference'."
Thread beginning with comment 476998
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[3]: GCC isn't all that great
by Valhalla on Sat 11th Jun 2011 20:21 UTC in reply to "RE[2]: GCC isn't all that great"
Valhalla
Member since:
2006-01-24

Valhalla,

"Examples?"

Maybe GCC should have been able to optimize one of the len comparisons away, but really it's (deliberately?) stupid code imo and there will always be these cases where the compiler fails to grasp 'the bigger picture'.

Secondly it seems obvious that gcc doesn't choose to use vectorization since it has no idea of how many integers are to be added, and depending on that it could very well be less efficient to use vectorization (I seriously doubt ICC would do so either with this snippet), case in point I changed the 'len' variables to a constant (100) in your snippet which resulted in the following (note 64bit):

leaq 16(%rdx), %rax
cmpq %rax, %rdi
jbe .L11
.L7:
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L4:
movdqu (%rdx,%rax), %xmm1
movdqu (%rdi,%rax), %xmm0
paddd %xmm1, %xmm0
movdqu %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $400, %rax
jne .L4
rep
ret
.L11:
leaq 16(%rdi), %rax
cmpq %rax, %rdx
ja .L7
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movl (%rdx,%rax), %ecx
addl %ecx, (%rdi,%rax)
addq $4, %rax
cmpq $400, %rax
jne .L2
rep
ret

Setting the constant to 5 or above made GCC use vectorization for the loop.

Granted, calls to this function would not be using constants in a 'real program' but the compiler would likely have a much better chance of deciding if it should vectorize the loop or not.

As for loop unrolling, it is not turned on by -O3 due to being difficult to estimate it's efficiency without runtime data (it's turned on automatically when PGO is used), does any compiler turn this on by default? (GCC and Clang/LLVM doesn't).

Reply Parent Score: 4

Alfman Member since:
2011-01-28

"Maybe GCC should have been able to optimize one of the len comparisons away, but really it's (deliberately?) stupid code imo and there will always be these cases where the compiler fails to grasp 'the bigger picture'."

Stupid code? Sure it was a trivial example, but that was deliberate. You'll have to take my word that GCC has the same shortcomings on more complex code from real programs.

Even if you want to blame the programmer here, a seasoned programmer will have no reasonable way of knowing if GCC has optimized the loop correctly without looking at the assembly output. Do we really want to go down the route of saying programmers need to check up on GCC's assembly output?

Should students be taught to avoid legal C constructs which give the GCC optimizer a hard time?


"Secondly it seems obvious that gcc doesn't choose to use vectorization since it has no idea of how many integers are to be added, and depending on that it could very well be less efficient to use vectorization"

I have no idea why GCC chose not to use SSE, but the result is still that the assembly language programmer would be able to beat it.

"(I seriously doubt ICC would do so either with this snippet)"

I wish someone could test this for us, Intel boasts very aggressive SSE optimization.

Reply Parent Score: 3

moondevil Member since:
2005-07-08

You are forgetting something in your examples.

It used to be so that most humans could beat compiler generated code. In this day and age it is only true for small code snippets or simple processors.

Most up to date processors use out-of-order execution with superscalar processing units, and translate CISC instructions into microcode RISC like code. And this varies from processor model to processor model within the same family even!

It is very hard for most humans to still be able to keep all processor features on their head while coding assembly and still be able to beat the code generated from high performance compilers. Not GCC, but the ones you pay several thousand euros/dollars for, with years of research put into them.

Reply Parent Score: 2

Valhalla Member since:
2006-01-24


Stupid code? Sure it was a trivial example, but that was deliberate.

Having two comparisons within the loop was obviously poor coding in this example, which you expected the compiler to fix for you.


Do we really want to go down the route of saying programmers need to check up on GCC's assembly output?

If you find that the performance is not what you'd expect out of the given code, you will profile and look at assembly output of the performance hotspots, doesn't matter if it's GCC, VC, ICC, Clang/LLVM.


Should students be taught to avoid legal C constructs which give the GCC optimizer a hard time?

Legal constructs does not equal efficient code. Compilers have never been able to turn shitty code into good code. If you know of one, please inform me, I'd buy it in a second.


I wish someone could test this for us, Intel boasts very aggressive SSE optimization.

I compiled your snippet with Clang 2.9, it didn't vectorize it either until I exchanged the len vars with constants just like in the case with GCC. Again I doubt ICC would do it either.

Again, this function within the context of a whole program would likely yield a different result (definately if PGO was used). For instance it would be interesting disecting the output of some of the micro-benchmarks over at language-shootout.

On a slightly off-topic note, anyone have any experience with the ekopath4 compiler suite? it appears that it is to be released as open source (gplv3) and judging by the performance benchmarks it appears to offer some gpgpu solution:

http://www.phoronix.com/scan.php?page=article&item=phoronix_dirndl_...

Reply Parent Score: 2

f0dder Member since:
2009-08-05

Do we really want to go down the route of saying programmers need to check up on GCC's assembly output?
When you need maximum performance for a piece of code, yes. When you don't, simply don't worry about it - not generating optimal code doesn't matter if the code isn't a hotspot.

Reply Parent Score: 1