Linked by ebasconp on Fri 10th Jun 2011 22:22 UTC
Benchmarks "Google has released a research paper that suggests C++ is the best-performing programming language in the market. The internet giant implemented a compact algorithm in four languages - C++, Java, Scala and its own programming language Go - and then benchmarked results to find 'factors of difference'."
Thread beginning with comment 476901
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE: GCC isn't all that great
by Valhalla on Sat 11th Jun 2011 04:12 UTC in reply to "GCC isn't all that great"
Valhalla
Member since:
2006-01-24

I tend to agree that it's not usually worth optimizing code in assembly, however GCC's code is often less efficient than it should be.

Examples?


We as programmers can do a better job than GCC at anticipating certain states and algorithmic conditions which give GCC trouble. Under these circumstances, it's not all that difficult to beat it.

Certainly we programmers can do a better job given that we understand the overall purpose of the code much better than any compiler, however I very much disagree with it 'not being all that difficult to beat it', you certainly need to be a very good assembly programmer to do so.

Expert programmers who are both well-versed in assembly and the things they are solving with it will certainly atleast equal and most often beat a compiler. Case in point would be encoders like x264 etc, they are full with hand-optimized assembly which easily beat the generated output of both GCC, ICC, LLVM etc when disabling the hand-optimized assembly and relying entirely on compiler optimization. But these guys are very good assembly programmers and experts at what they are trying to implement using it, this is hardly the norm amongst programmers. Hence very few programmers out there can actually beat the optimized output of compilers.

Reply Parent Score: 5

RE[2]: GCC isn't all that great
by Alfman on Sat 11th Jun 2011 07:54 in reply to "RE: GCC isn't all that great"
Alfman Member since:
2011-01-28

Valhalla,

"Examples?"


That's a fair question, but please understand I don't have very much time to allot to it.

Anyway, here's a trivial example.

void vector_add(unsigned int*a, unsigned int a_len, unsigned int*b, unsigned int b_len) {
unsigned int i;
for(i=0; i<a_len && i<b_len; i++) {
a[i] += b[i];
}
}


Here is the GCC output using -O3 and march=core2


vector_add:
push ebp
mov ebp, esp
push edi
mov ecx, DWORD PTR [ebp+8]
push esi
mov edi, DWORD PTR [ebp+20]
push ebx
mov esi, DWORD PTR [ebp+16]
mov ebx, DWORD PTR [ebp+12]
test ebx, ebx
je .L5
test edi, edi
je .L5
xor edx, edx
.L3:
mov eax, DWORD PTR [esi+edx*4]
add DWORD PTR [ecx+edx*4], eax
inc edx
cmp ebx, edx
jbe .L5
cmp edi, edx
ja .L3
.L5:
pop ebx
pop esi
pop edi
leave
ret


Now I'm not sure how many people read asm, but it's short and not difficult to follow.

GCC failed to vectorize the code with SSE2 instructions. This would be an obvious way to significantly optimize the example (both in terms of memory bandwidth and in terms of parallel computation).

Secondly, GCC failed to recognize that is isn't necessary to carry two compare operations into the loop. It should be using a cmov to calculate the minimum number of loops upfront, and then there would only be a single condition in the loop.

Thirdly, although there is no performance penalty in this case, GCC generally manipulates indexes instead of converting loops to use pointer arithmetic. Pointer arithmetic is far more efficient in the case when a shortage of general purpose registers forces variables onto the stack. I've seen GCC load both the base pointer and index from the stack and add them inside the loop at every iteration.

Fourthly, it didn't unroll any loops. Personally I will usually trade off in favour of instruction size anyway, but it's interesting because I specified -O3. I'm sure we could get GCC to unroll loops by manually setting the heuristics, but an assembly programmer would have yet another easy edge over the default compiler output.


"however I very much disagree with it 'not being all that difficult to beat it', you certainly need to be a very good assembly programmer to do so."

You may be assuming that GCC does a good job in the first place. It's usually 'good enough', but it isn't great.


"Hence very few programmers out there can actually beat the optimized output of compilers."

Maybe the example above will change your mind?

I don't have ICC on hand to compare, but it is said to be excellent at SSE2 and is probably also better at optimization all around.

Reply Parent Score: 5

acobar Member since:
2005-11-15

Well, I will repeat to you what I do to everyone I met and that would ask my advise (not your case) on programming: first look for a good, tested, implementation. Many people try to code fragments that are already optimized for ages. They don't look at Donald Knuth's ACP, fftw, lapack, atlas and many other well-thought code before they start and, as such, write subpar code.

For example, if you change your code to:

void vector_add(unsigned int*a, unsigned int a_len, unsigned int*b, unsigned int b_len)
{
unsigned int i;
for (i = a_len < b_len ? a_len : b_len ; i > 0 ; ) {
i--;
a[i] += b[i];
}
}

You get a pretty decent optimized code:

vector_add:
.LFB0:
.cfi_startproc
cmpl %esi, %ecx
cmovbe %ecx, %esi
testl %esi, %esi
je .L1
.p2align 4,,10
.p2align 3
.L4:
decl %esi
mov %esi, %eax
movl (%rdx,%rax,4), %ecx
addl %ecx, (%rdi,%rax,4)
testl %esi, %esi
jne .L4
.L1:
rep
ret
.cfi_endproc

As can be clearly seen, compilers are not black-magic wizards and, regretfully, depend on the code quality we give to them, and we should never forget that.

Reply Parent Score: 4

Valhalla Member since:
2006-01-24

Valhalla,

"Examples?"

Maybe GCC should have been able to optimize one of the len comparisons away, but really it's (deliberately?) stupid code imo and there will always be these cases where the compiler fails to grasp 'the bigger picture'.

Secondly it seems obvious that gcc doesn't choose to use vectorization since it has no idea of how many integers are to be added, and depending on that it could very well be less efficient to use vectorization (I seriously doubt ICC would do so either with this snippet), case in point I changed the 'len' variables to a constant (100) in your snippet which resulted in the following (note 64bit):

leaq 16(%rdx), %rax
cmpq %rax, %rdi
jbe .L11
.L7:
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L4:
movdqu (%rdx,%rax), %xmm1
movdqu (%rdi,%rax), %xmm0
paddd %xmm1, %xmm0
movdqu %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $400, %rax
jne .L4
rep
ret
.L11:
leaq 16(%rdi), %rax
cmpq %rax, %rdx
ja .L7
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movl (%rdx,%rax), %ecx
addl %ecx, (%rdi,%rax)
addq $4, %rax
cmpq $400, %rax
jne .L2
rep
ret

Setting the constant to 5 or above made GCC use vectorization for the loop.

Granted, calls to this function would not be using constants in a 'real program' but the compiler would likely have a much better chance of deciding if it should vectorize the loop or not.

As for loop unrolling, it is not turned on by -O3 due to being difficult to estimate it's efficiency without runtime data (it's turned on automatically when PGO is used), does any compiler turn this on by default? (GCC and Clang/LLVM doesn't).

Reply Parent Score: 4

burnttoys Member since:
2008-12-20

That code can not be vectorised as it is written as 'a' and 'b' may point to overlapping areas of memory.

The 'restrict' keyword is necessary.

Reply Parent Score: 1

Carewolf Member since:
2005-09-08

After identifying that gcc didn't perform these optimization have you considered trying gcc with the options to enable them? -funroll-loops -ftree-vectorize.

It is not fun blaming the compiler for not doing optimizations it hasn't been asked to do. I know the default optimizations suck, but that is a well known problem with gcc.

The auto-vectorizer isn't very good with integers, but give it a try.

Edited 2011-06-13 09:46 UTC

Reply Parent Score: 2

RE[2]: GCC isn't all that great
by f0dder on Sat 11th Jun 2011 18:22 in reply to "RE: GCC isn't all that great"
f0dder Member since:
2009-08-05

...not to mention that they're dealing with pretty darn specific algorithms that can take advantage of SSE and benefit from hand-rolled register allocation.

Not so much with your typical run of the mill software ;)

Reply Parent Score: 1

RE[2]: GCC isn't all that great
by _xmv on Sun 12th Jun 2011 11:57 in reply to "RE: GCC isn't all that great"
_xmv Member since:
2008-12-09

this is hardly the norm amongst programmers. Hence very few programmers out there can actually beat the optimized output of compilers.

i think you're confusing "programers" and "i fired up a shell and a compiler and i think i can program stuff"

Seriously, in fact.

Sure few will code in asm for the hell of it (except MenuetOS people ;-), but any who's actually a real programmer will insert asm wherever he needs it properly.
Coding everything asm is just too time consuming and can be major hair pulling due to the size of the projects without a strong incentive to manage things (like menuet does)

Reply Parent Score: 2

Alfman Member since:
2011-01-28

_xmv,

"Sure few will code in asm for the hell of it (except MenuetOS people ;-), but any who's actually a real programmer will insert asm wherever he needs it properly.
Coding everything asm is just too time consuming and can be major hair pulling due to the size of the projects without a strong incentive to manage things (like menuet does)"

GCC could improve in the future, but until then, the common perceptions don't do justice to the skill of assembly level programmers.

I still think that for most programmers/projects, there are very strong reasons not to going down the ASM route regardless of asm skill.

It's usually too much of a maintenance burden, and I don't have the time or hardware to test all the hardware combinations that GCC/C can support.

There are some times when I'm not left with much choice though.

In the last performance critical project I worked on I needed to divide a 64bit number by a 32bit number such that the result would always fit in a 32bit register. Obviously this is exactly what the x86 DIV instruction does since the 386.

Unfortunately, GCC (on x86) cannot optimize this case and instead emulates a full 64bit division (numerator, divisor, dividend) in a hidden call to "__divdi3", which has terrible performance.

GCC does what's required of the C spec (full 64bit division, and then demoting to 32bit), but it doesn't bother trying to optimize the case.

Conversely, GCC does optimize the case where two 32bit variables are multiplied into a 64bit variable:

uint32_t a,b;
uint64_t ab;
...
ab = (uint64_t)a*(uint64_t)b;

GCC does not perform a 64bit multiplication above, it calls the x86 MUL instruction, which multiplies 32bit numbers and returns a 64bit number.


Fun times!

Reply Parent Score: 2