Linked by ebasconp on Fri 10th Jun 2011 22:22 UTC
Benchmarks "Google has released a research paper that suggests C++ is the best-performing programming language in the market. The internet giant implemented a compact algorithm in four languages - C++, Java, Scala and its own programming language Go - and then benchmarked results to find 'factors of difference'."
Thread beginning with comment 476914
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[2]: GCC isn't all that great
by Alfman on Sat 11th Jun 2011 07:54 UTC in reply to "RE: GCC isn't all that great"
Alfman
Member since:
2011-01-28

Valhalla,

"Examples?"


That's a fair question, but please understand I don't have very much time to allot to it.

Anyway, here's a trivial example.

void vector_add(unsigned int*a, unsigned int a_len, unsigned int*b, unsigned int b_len) {
unsigned int i;
for(i=0; i<a_len && i<b_len; i++) {
a[i] += b[i];
}
}


Here is the GCC output using -O3 and march=core2


vector_add:
push ebp
mov ebp, esp
push edi
mov ecx, DWORD PTR [ebp+8]
push esi
mov edi, DWORD PTR [ebp+20]
push ebx
mov esi, DWORD PTR [ebp+16]
mov ebx, DWORD PTR [ebp+12]
test ebx, ebx
je .L5
test edi, edi
je .L5
xor edx, edx
.L3:
mov eax, DWORD PTR [esi+edx*4]
add DWORD PTR [ecx+edx*4], eax
inc edx
cmp ebx, edx
jbe .L5
cmp edi, edx
ja .L3
.L5:
pop ebx
pop esi
pop edi
leave
ret


Now I'm not sure how many people read asm, but it's short and not difficult to follow.

GCC failed to vectorize the code with SSE2 instructions. This would be an obvious way to significantly optimize the example (both in terms of memory bandwidth and in terms of parallel computation).

Secondly, GCC failed to recognize that is isn't necessary to carry two compare operations into the loop. It should be using a cmov to calculate the minimum number of loops upfront, and then there would only be a single condition in the loop.

Thirdly, although there is no performance penalty in this case, GCC generally manipulates indexes instead of converting loops to use pointer arithmetic. Pointer arithmetic is far more efficient in the case when a shortage of general purpose registers forces variables onto the stack. I've seen GCC load both the base pointer and index from the stack and add them inside the loop at every iteration.

Fourthly, it didn't unroll any loops. Personally I will usually trade off in favour of instruction size anyway, but it's interesting because I specified -O3. I'm sure we could get GCC to unroll loops by manually setting the heuristics, but an assembly programmer would have yet another easy edge over the default compiler output.


"however I very much disagree with it 'not being all that difficult to beat it', you certainly need to be a very good assembly programmer to do so."

You may be assuming that GCC does a good job in the first place. It's usually 'good enough', but it isn't great.


"Hence very few programmers out there can actually beat the optimized output of compilers."

Maybe the example above will change your mind?

I don't have ICC on hand to compare, but it is said to be excellent at SSE2 and is probably also better at optimization all around.

Reply Parent Score: 5

acobar Member since:
2005-11-15

Well, I will repeat to you what I do to everyone I met and that would ask my advise (not your case) on programming: first look for a good, tested, implementation. Many people try to code fragments that are already optimized for ages. They don't look at Donald Knuth's ACP, fftw, lapack, atlas and many other well-thought code before they start and, as such, write subpar code.

For example, if you change your code to:

void vector_add(unsigned int*a, unsigned int a_len, unsigned int*b, unsigned int b_len)
{
unsigned int i;
for (i = a_len < b_len ? a_len : b_len ; i > 0 ; ) {
i--;
a[i] += b[i];
}
}

You get a pretty decent optimized code:

vector_add:
.LFB0:
.cfi_startproc
cmpl %esi, %ecx
cmovbe %ecx, %esi
testl %esi, %esi
je .L1
.p2align 4,,10
.p2align 3
.L4:
decl %esi
mov %esi, %eax
movl (%rdx,%rax,4), %ecx
addl %ecx, (%rdi,%rax,4)
testl %esi, %esi
jne .L4
.L1:
rep
ret
.cfi_endproc

As can be clearly seen, compilers are not black-magic wizards and, regretfully, depend on the code quality we give to them, and we should never forget that.

Reply Parent Score: 4

Alfman Member since:
2011-01-28

acobar,

"Well, I will repeat to you what I do to everyone I met and that would ask my advise (not your case) on programming: first look for a good, tested, implementation. Many people try to code fragments that are already optimized for ages. They don't look at Donald Knuth's ACP, fftw, lapack, atlas and many other well-thought code before they start and, as such, write subpar code."


I actually don't encourage people to optimize things at such small levels or in assembly. I'm just disturbed that so many people are under the false impression that their compilers are so difficult to beat.


"You get a pretty decent optimized code:"

Yes, I am aware that we can coerce the compiler to generate better instructions by rewriting the C code.

(Although it still isn't as good for the other reasons already discussed).

It concerns me that GCC converts our C code to assembly so literally instead of generating more optimal paths.

Should we really encourage devs to adjust thier programming style to compensate for GCC optimizer deficiency?

Does this mean that C devs will have to understand CPU internals in order to have GCC produce optimal code?

I really don't think we should be making excuses for GCC here. There is nothing unreasonable about expecting the original code to be optimized by the compiler.


Edit: By the way I am guilty of rewriting C code at the micro level in order to help GCC optimizer along. But we shouldn't have to do this in the first place.

Edited 2011-06-11 10:08 UTC

Reply Parent Score: 5

Neolander Member since:
2010-03-08

I'd argue that the C/C++ developer shouldn't have to do black magic in order to optimize its code either.

Alfman's use of for loops is perfectly normal, easy to understand, and you'll find it in every book on C. One shouldn't have to sacrifice code clarity like you did in order to get better performance, especially on such trivial code.

Reply Parent Score: 2

Timmmm Member since:
2006-07-25

for (i = a_len < b_len ? a_len : b_len ; i > 0 ; ) {
i--;
a[i] += b[i];
}


I'm pretty sure this is a terrible idea because it will mess up all the cache lines and pre-fetching if you walk through memory backwards.

Reply Parent Score: 2

Valhalla Member since:
2006-01-24

Valhalla,

"Examples?"

Maybe GCC should have been able to optimize one of the len comparisons away, but really it's (deliberately?) stupid code imo and there will always be these cases where the compiler fails to grasp 'the bigger picture'.

Secondly it seems obvious that gcc doesn't choose to use vectorization since it has no idea of how many integers are to be added, and depending on that it could very well be less efficient to use vectorization (I seriously doubt ICC would do so either with this snippet), case in point I changed the 'len' variables to a constant (100) in your snippet which resulted in the following (note 64bit):

leaq 16(%rdx), %rax
cmpq %rax, %rdi
jbe .L11
.L7:
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L4:
movdqu (%rdx,%rax), %xmm1
movdqu (%rdi,%rax), %xmm0
paddd %xmm1, %xmm0
movdqu %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $400, %rax
jne .L4
rep
ret
.L11:
leaq 16(%rdi), %rax
cmpq %rax, %rdx
ja .L7
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movl (%rdx,%rax), %ecx
addl %ecx, (%rdi,%rax)
addq $4, %rax
cmpq $400, %rax
jne .L2
rep
ret

Setting the constant to 5 or above made GCC use vectorization for the loop.

Granted, calls to this function would not be using constants in a 'real program' but the compiler would likely have a much better chance of deciding if it should vectorize the loop or not.

As for loop unrolling, it is not turned on by -O3 due to being difficult to estimate it's efficiency without runtime data (it's turned on automatically when PGO is used), does any compiler turn this on by default? (GCC and Clang/LLVM doesn't).

Reply Parent Score: 4

Alfman Member since:
2011-01-28

"Maybe GCC should have been able to optimize one of the len comparisons away, but really it's (deliberately?) stupid code imo and there will always be these cases where the compiler fails to grasp 'the bigger picture'."

Stupid code? Sure it was a trivial example, but that was deliberate. You'll have to take my word that GCC has the same shortcomings on more complex code from real programs.

Even if you want to blame the programmer here, a seasoned programmer will have no reasonable way of knowing if GCC has optimized the loop correctly without looking at the assembly output. Do we really want to go down the route of saying programmers need to check up on GCC's assembly output?

Should students be taught to avoid legal C constructs which give the GCC optimizer a hard time?


"Secondly it seems obvious that gcc doesn't choose to use vectorization since it has no idea of how many integers are to be added, and depending on that it could very well be less efficient to use vectorization"

I have no idea why GCC chose not to use SSE, but the result is still that the assembly language programmer would be able to beat it.

"(I seriously doubt ICC would do so either with this snippet)"

I wish someone could test this for us, Intel boasts very aggressive SSE optimization.

Reply Parent Score: 3

burnttoys Member since:
2008-12-20

That code can not be vectorised as it is written as 'a' and 'b' may point to overlapping areas of memory.

The 'restrict' keyword is necessary.

Reply Parent Score: 1

Alfman Member since:
2011-01-28

burnttoys,

"That code can not be vectorised as it is written as 'a' and 'b' may point to overlapping areas of memory.

The 'restrict' keyword is necessary."

Wow, thanks burnttoys, you are correct, some constructive criticism!

It should be using restrict just like 'memcpy' does.

Unfortunately I tried it and it didn't change the output at all. Additionally I see a lot of devs claiming that the restrict keyword doesn't have any effect under GCC, can you confirm that?

Reply Parent Score: 2

Carewolf Member since:
2005-09-08

After identifying that gcc didn't perform these optimization have you considered trying gcc with the options to enable them? -funroll-loops -ftree-vectorize.

It is not fun blaming the compiler for not doing optimizations it hasn't been asked to do. I know the default optimizations suck, but that is a well known problem with gcc.

The auto-vectorizer isn't very good with integers, but give it a try.

Edited 2011-06-13 09:46 UTC

Reply Parent Score: 2

Neolander Member since:
2010-03-08

Aren't those supposed to be automatically enabled by O2 or O3 ?

Reply Parent Score: 1

Alfman Member since:
2011-01-28

Carewolf,

Thanks for the feedback.

For me GCC does use SSE for me with and without '-ftree-vectorize' (when I tweak the C source code).

I couldn't get GCC to do vector math without changing the source file to calculate the number of loops for GCC. In a case as simple as this, the compiler should have been able to handle it.

I know Valhalla is complaining about this specific example (not sure why?), but I do frequently come across issues like this in much more complex code where GCC misses an equally trivial optimization.

Reply Parent Score: 2