Linked by ebasconp on Fri 10th Jun 2011 22:22 UTC
Benchmarks "Google has released a research paper that suggests C++ is the best-performing programming language in the market. The internet giant implemented a compact algorithm in four languages - C++, Java, Scala and its own programming language Go - and then benchmarked results to find 'factors of difference'."
Permalink for comment 476914
To read all comments associated with this story, please click here.
RE[2]: GCC isn't all that great
by Alfman on Sat 11th Jun 2011 07:54 UTC in reply to "RE: GCC isn't all that great"
Alfman
Member since:
2011-01-28

Valhalla,

"Examples?"


That's a fair question, but please understand I don't have very much time to allot to it.

Anyway, here's a trivial example.

void vector_add(unsigned int*a, unsigned int a_len, unsigned int*b, unsigned int b_len) {
unsigned int i;
for(i=0; i<a_len && i<b_len; i++) {
a[i] += b[i];
}
}


Here is the GCC output using -O3 and march=core2


vector_add:
push ebp
mov ebp, esp
push edi
mov ecx, DWORD PTR [ebp+8]
push esi
mov edi, DWORD PTR [ebp+20]
push ebx
mov esi, DWORD PTR [ebp+16]
mov ebx, DWORD PTR [ebp+12]
test ebx, ebx
je .L5
test edi, edi
je .L5
xor edx, edx
.L3:
mov eax, DWORD PTR [esi+edx*4]
add DWORD PTR [ecx+edx*4], eax
inc edx
cmp ebx, edx
jbe .L5
cmp edi, edx
ja .L3
.L5:
pop ebx
pop esi
pop edi
leave
ret


Now I'm not sure how many people read asm, but it's short and not difficult to follow.

GCC failed to vectorize the code with SSE2 instructions. This would be an obvious way to significantly optimize the example (both in terms of memory bandwidth and in terms of parallel computation).

Secondly, GCC failed to recognize that is isn't necessary to carry two compare operations into the loop. It should be using a cmov to calculate the minimum number of loops upfront, and then there would only be a single condition in the loop.

Thirdly, although there is no performance penalty in this case, GCC generally manipulates indexes instead of converting loops to use pointer arithmetic. Pointer arithmetic is far more efficient in the case when a shortage of general purpose registers forces variables onto the stack. I've seen GCC load both the base pointer and index from the stack and add them inside the loop at every iteration.

Fourthly, it didn't unroll any loops. Personally I will usually trade off in favour of instruction size anyway, but it's interesting because I specified -O3. I'm sure we could get GCC to unroll loops by manually setting the heuristics, but an assembly programmer would have yet another easy edge over the default compiler output.


"however I very much disagree with it 'not being all that difficult to beat it', you certainly need to be a very good assembly programmer to do so."

You may be assuming that GCC does a good job in the first place. It's usually 'good enough', but it isn't great.


"Hence very few programmers out there can actually beat the optimized output of compilers."

Maybe the example above will change your mind?

I don't have ICC on hand to compare, but it is said to be excellent at SSE2 and is probably also better at optimization all around.

Reply Parent Score: 5