C++ Clear Winner in Google Language Benchmark

Guest post by ebasconp 2011-06-10 Benchmarks 99 Comments

“Google has released a research paper that suggests C++ is the best-performing programming language in the market. The internet giant implemented a compact algorithm in four languages – C++, Java, Scala and its own programming language Go – and then benchmarked results to find ‘factors of difference’.”

99 Comments

2011-06-10 10:38 pm

Alfman verbose=1
I didn’t see anything about which compilers they chose to compare.

Performance characteristics not the result of language syntax so much as the implementation of the compilers and supporting libraries. Google should have known this from the get go, so why didn’t they compare implementations?

The significant variation in performance between C/C++ compilers clearly prove (as if any proof was needed) that compiler implementations do matter.

Another point is that some language implementations (eiffel) compile down to C and then are compiled/optimized by the C compiler.

The question of which language syntax performs better is meaningless.

2011-06-11 4:29 am

Valhalla

Performance characteristics not the result of language syntax so much as the implementation of the compilers and supporting libraries.

Certain language characteristics have obvious performance consequences, for instance java, scala, go are all garbage collected which certainly comes with a performance penalty.

Another point is that some language implementations (eiffel) compile down to C and then are compiled/optimized by the C compiler.

Eiffel wasn’t measured here was it? Nor was any other language (I know of Vala) which compiles to C?

Either way, I doubt there was any news to anyone that C++ was faster than Java, Scala and Go. As for the compilers of these languages, certainly Go has a very immature compiler suite given it’s young age and it still being heavily in development, Gccgo (which relies on GCC’s backend for code optimization) is said by the Go devs to produce much faster code than the official compilers.

2011-06-11 7:23 am

dorin.lazar

Performance characteristics not the result of language syntax so much as the implementation of the compilers and supporting libraries.

Certain language characteristics have obvious performance consequences, for instance java, scala, go are all garbage collected which certainly comes with a performance penalty.

I doubt that those exact features were put to test – garbage collection has nothing to do with this paper – and I doubt that it had anything to do with memory allocation. As a side note, I found that garbage collection does work better than manual memory management in memory-intensive programs.

2011-06-11 5:15 pm

Elv13
Read the last part of the paper.

(myself) Manual delete are faster as long as the destructor is simple. You can have a pseudo fast GC my deriving all objects from a GC class and use it to increment/decrement object reference count and use a templated, inlined accessor like function instead of = or override = manually for every class (to keep the object count). Then when the ref count = 0, then delete itself.

But this is not “safe” nor complete. Real time ref analysis is slow. Static/dynamic analysis also require to keep/manage a complete list of all memory allocation, this take management time, interpreter time (for VM based language) and a lot of memory. Al least, it how I see it.

2011-06-11 6:14 pm

f0dder
How do you handle cyclic references with that approach? (root-tracing GC helps a lot in handling those).

How do you suspect it performs in heavily threaded code? (you need mutexing both on the refcount and on your heap).

GC isn’t the end-all-be-all, but for a lot of scenarios it works very well – and it’s going to be a lot of effort emulating in C++, and won’t be a full solution as long as you need to interface with legacy code.

Don’t get me wrong, I’m a sucker for C++ and RAII, but it’s not the only-and-best solution for all scenarios.
2011-06-13 12:31 pm

sakeniwefu

How do you suspect it performs in heavily threaded code? (you need mutexing both on the refcount and on your heap).

Faster than GC if it was to free memory at the same rate.

GC performs better than manual memory management only in a small subset of scenarios. It is true that it is good enough for many more, and that it keeps dumb programmers from freeing used objects and the like.

However, as soon as their design assumptions aren’t met they fail badly. Generational GC aces object creation and destruction benchmarks. As true as it is that it bloats up until virtual memory dries up, when faced with real world apps that don’t follow the general rule.
2011-06-13 1:30 pm

pantheraleo
GC performs better than manual memory management only in a small subset of scenarios.

GC typically performs better in situations where there is a lot of fairly small amounts of memory allocation and deallocation happening in a relatively short period of time. The typical (and usually recommended) practice in C or C++, of course, is to free memory as soon as you are done with it. In this type of situation, that can create a lot of overhead. Garbage collection will use more memory of course, but will be significantly faster because it optimizes memory deallocation. The infilife algorithm is probably one of the simplest test cases where this pattern can be seen, and where GC runs circles around manual memory management.

It is true that it is good enough for many more, and that it keeps dumb programmers from freeing used objects and the like.

I don’t think it’s fair to use the word “dumb” really. Everyone makes mistakes, or sometimes forgets to free memory when they are done with it. And anyone who tells you they have never de-referenced a pointer that they though should point at something, but was actually null, is flat out lying. Especially in today’s high pressure “release or die” environments where software release cycles tend to be very fast paced.

And of course, I’m sure I don’t have to repeat it here because it is common knowledge these days. But with RAM and CPUs as cheap as they are these days, it’s far more economical to throw another CPU, or throw more RAM at the problem, than it is to spend even a couple of hours trying to heavily optimize code to get just a little more speed or a little less memory consumption out of it.

And of course, that was one of the flaws with this benchmark, which Google admitted flat out. In the real world, no one spends that much time optimizing their C++ code. Not with development cycles as fast paced as they are today.

Edited 2011-06-13 13:43 UTC
2011-06-13 4:59 pm

Zan Lynx
And of course, I’m sure I don’t have to repeat it here because it is common knowledge these days. But with RAM and CPUs as cheap as they are these days, it’s far more economical to throw another CPU, or throw more RAM at the problem, than it is to spend even a couple of hours trying to heavily optimize code to get just a little more speed or a little less memory consumption out of it.

A little more speed? A little less memory consumption?

A popular Java program these days is Minecraft. I love it myself. But it has cases where, when memory consumption hits 700 MB or so, the game gets insanely choppy. This is the garbage collection happening. It sucks. It’s unusable.

To fix it, the recommendation is to run the 64-bit server JVM and let it use a gigabyte of RAM.

A whole gigabyte, to hold a chunk of active blocks which, if stored in a well-thought out C++ data structure, would only take a hundred megabytes or so.
2011-06-13 6:01 pm

pantheraleo
A popular Java program these days is Minecraft. I love it myself. But it has cases where, when memory consumption hits 700 MB or so, the game gets insanely choppy. This is the garbage collection happening. It sucks. It’s unusable.

That sounds like bad programming to me. Probably excessive object creation, which could likely be found and fixed with a profiler actually.

I’ve done a fair amount of Java3D programming myself, and don’t have this problem with things I have written.

Also, this problem is not unique to Java. I’ve seen C++ games that have similar problems for the same reason. Excessive object creation. Without careful planning, that’s an easy trap to fall into in any object oriented language when doing game development with it.

A whole gigabyte, to hold a chunk of active blocks which, if stored in a well-thought out C++ data structure, would only take a hundred megabytes or so.

See what I said above. The exact same thing is true in Java. A well-thought out data structure in Java could avoid this problem. Don’t blame Java for bad programming technique on the part of the developer. You could get yourself into the exact same trouble with C++ if you naively used a separate object to hold each block, and then ended up with excessive object creation because of it. This is bad programming technique. Not the fault of Java.

Edited 2011-06-13 18:04 UTC
2011-06-13 6:44 pm

Zan Lynx
It seems to me that most of the fixes for slow, choppy garbage collection and excessive memory usage in Java are to avoid using the memory allocator. Allocating an array of objects and reusing them, for example. Or discarding and creating a whole new array instead of one object at a time.

This is almost exactly the equivalent of using a pool allocator in C++.

So it always strikes me that to write really efficient Java the programmer has to do as much work as in C++ and sacrifices the easy programming Java is supposed to offer.
2011-06-13 8:15 pm

pantheraleo
So it always strikes me that to write really efficient Java the programmer has to do as much work as in C++ and sacrifices the easy programming Java is supposed to offer.

Being “easy” is not one of the benefits I would actually associate with Java. I mean other then garbage collection, which can bite if misused, It’s nearly as difficult as C++.

The benefits of Java are more in the cross-platform capabilities, it’s strong built in threading and networking support, and its security model.

One of the problems with GC is that it can hide potential performance problems that would be immediately obvious in C++. The classic Java example is probably the String class. New Java programmers often get themselves into trouble because they don’t realize that strings are immutable in Java. So they loop over strings, modify them, change them, append to them, etc. Of course, it works fine. But they wonder why their programs are slow and eat tons of memory. The reason of course, is because behind the scenes, Java is creating a new String each time a scene is made, and marking the old one for GC. But unless you are aware that String are immutable in Java, you’d never know this from looking at the code. The solution, of course, is to use a StringBuffer or StringBuilder (depending on whether you need thread safety or not) — a lesson new Java programmers often learn the hard way.

Edited 2011-06-13 20:22 UTC
2011-06-13 9:42 pm

TemporalBeing

GC performs better than manual memory management only in a small subset of scenarios.

GC typically performs better in situations where there is a lot of fairly small amounts of memory allocation and deallocation happening in a relatively short period of time. The typical (and usually recommended) practice in C or C++, of course, is to free memory as soon as you are done with it. In this type of situation, that can create a lot of overhead. Garbage collection will use more memory of course, but will be significantly faster because it optimizes memory deallocation. The infilife algorithm is probably one of the simplest test cases where this pattern can be seen, and where GC runs circles around manual memory management.

Problem with GC is that your applications take a performance penalty at unpredictable times – whenever the GC decides to clean-up. That can adversely affect the rest of the application and you have little to no control over it. Yes, some GC algorithms allow you to specify a good time to do GC clean-up, but AFAIK they are neither popular nor the ‘top dog’ algorithms.
2011-06-13 10:24 pm

pantheraleo
Problem with GC is that your applications take a performance penalty at unpredictable times – whenever the GC decides to clean-up.

Java’s default GC behavior (you can tweak it with command line switches) is to try to run GC as an idle task. it will try to wait until no other higher priority threads in the VM are scheduled to run before it does GC. Of course, this isn’t always possible if the application is very busy. In those cases, if memory starts to reach the maximum allowed heap space that is configured for the JVM, then the GC might have to run at a less opportune time.

But ideally, it will try to do it as an idle task. Unless it is forced to do otherwise because it is getting close to hitting max heap size.
2011-06-13 5:03 pm

f0dder

How do you suspect it performs in heavily threaded code? (you need mutexing both on the refcount and on your heap).

Faster than GC if it was to free memory at the same rate.

…and the whole point is that you aren’t doing it that way when you’re dealing with GC, leading to scenarios where you get much better performance. Right tool for the job and all that, ever heard about it?

Of course you can spend a lot of time developing a different approach in C++, or you can spend time writing a custom memory allocator with different characteristics, but that’s not always time effective.

It is true that it is good enough for many more, and that it keeps dumb programmers from freeing used objects and the like.

You generally have to be more careful wrt. deallocation in GC languages since they typically only deal with memory rather than resources, whereas in C++ we can utilize the wonderful RAII. If you think Java and C# programmers are “dumb” and/or lazy because of GC, I suspect you haven’t done much development in either of those languages.

2011-06-11 8:27 am

Alfman verbose=1
Valhalla,

“Certain language characteristics have obvious performance consequences, for instance java, scala, go are all garbage collected which certainly comes with a performance penalty”

That’s an assumption, which may or may not be true in practice.

Generational garbage collection algorithms can provide excellent data locality. C/C++ objects are generally forced to live out their lives at a fixed address, which is often not ideal after address space becomes fragmented.

Malloc/Free are more expensive than certain garbage collection algorithms such as semi-space collectors, which are absolutely trivial.

The tradeoffs are too complex to make absolute statements about which is better. I wouldn’t be comfortable saying which is better without benchmarking them on a specific project.

The following link advocates garbage collection over malloc/free for performance reasons (although I think their conclusions are biased and overgeneralised)

http://www.ibm.com/developerworks/java/library/j-jtp09275/index.htm…

All this said, I personally prefer to control new/delete myself.

“Eiffel wasn’t measured here was it? Nor was any other language (I know of Vala) which compiles to C?”

If we compare modern languages which are on more equal footing feature-wise, then which is faster: perl or ruby or python?

I’m saying that it doesn’t make sense to compare the *language* (as google claims to have done), one has to compare the *implementation*.
2011-06-13 9:29 am

Carewolf
Technically garbage-collected languages should be faster but just use more memory (faster as long as the extra memory is available). In practice though this is almost never the case, but you can make many examples where Java runs faster than a C++ program, because they have no “inline” overhead of memory management.

2011-06-10 10:52 pm

poundsmack
faster than programs coded in, say, assembly? I kind of find that hard to believe.

2011-06-10 10:54 pm

Neolander
Depends on who codes the assembly. I’ve heard once that 90% of ASM developers can’t beat the output of a modern optimizing C/C++ compiler.

2011-06-10 10:56 pm

poundsmack
That’s very true, it all really depends on the quality of the code. And with Intel’s C++ compiler, a well coded C++ program can get pretty fast depending on the operation.

2011-06-11 10:55 am

viton
And with Intel’s C++ compiler

You think of ICC as it has been sent by God from heaven.

In reality it is just +-5% difference on average between ICC and GCC4 on performance side.

2011-06-11 4:10 am

ShadesFox
And even when you can beat the compiler, often it isn’t worth the time sunk into the asm coding to beat the compiler.

2011-06-11 1:06 am

MacMan
faster than programs coded in, say, assembly? I kind of find that hard to believe.

Very few people can write code as good as a modern C++ compiler.

I work with a guy who kept bragging about how great his LU factorization that he hand-coded directly in assembler is. Well, turns out this clown wrote it using the ancient x87 FP stack, he never heard of SSE (this was a few months ago).

So, I through together a quick and dirty LU factorization in C++ (OK, copied it out of numerical recipes), compiled it with gcc 4.4, enable SSE, and it was about 20 freaking times faster than his “hand tuned assembler”

And when the next great SSE 4,5,6… comes out, all I do is re-compile:)

Unless they have a really really good reason for doing it, when someone tells you they want to write something in assembler, you should probably back hand them.

2011-06-11 1:19 am

ebasconp
Agree completely with you.

Assembler has THE POTENTIAL that could help writing the fastest algorithms; but just real geniuses get full advantage of such potential.

If you have 1000 man; all of them will be able to cut a wood into pieces; 20% of them will be able to build a chair with the wood but just one or two will be able to create a beautiful, durable, reliable chair
2011-06-11 1:33 am

jessesmith
I’ve done some work in Assembly and I tend to agree with you. It is possible to make Assembly code that’s smaller and more efficient than C/C++, but it is rarely worth it. And most people aren’t going to write asm that’s better than a modern compiler with optimizations turned on.

But it is nice to have the option of falling back on Assembly code when you really need to cut corners.
2011-06-11 2:36 am

panzi
What did the guy say when you showed him your results?

2011-06-11 2:59 am

ebasconp
He actually said: “That occurred because you executed my program in client mode instead of server mode. That makes an enormous difference. Running it in server mode will almost always run circles around client mode.”
2011-06-11 2:34 pm

MacMan
What did the guy say when you showed him your results?

Not much, just kind of starred at me, then started saying some nonsense about how in Russia, everything is written in assembler because the the compilers are crippled by the CIA, or something to that extent.

2011-06-14 1:26 pm

eldarion
Just a small note: it’s assembly (the language), not assembler (the software that creates “object files” from assembly code).

2011-06-14 1:50 pm

Neolander
Just a small note: it’s assembly (the language), not assembler (the software that creates “object files” from assembly code).

If he’s French, he has half an excuse though : here, we say “assembleur” (which sounds like “assembler”) for both

Edited 2011-06-14 13:50 UTC

2011-06-10 11:20 pm

Rooki
This paper has some pretty serious issues as has been discussed extensively https://groups.google.com/forum/#!topic/golang-nuts/G8L4af-Q9WE and http://www.reddit.com/r/programming/comments/hqkwk/google_paper_com…

For instance nearly all the optimization for the C++ code could have been applied to the Java code. Also according to Ian Lance Taylor:

Despite the name, the “Go Pro” code was never intended to be an example

of idiomatic or efficient Go. Robert asked me to take a look at his

code and I hacked on it for an hour to make a little bit nicer. If I

had realized that he was going to publish it externally I would have put

a lot more time into making it nicer.

I’m told that the program as compiled by 6g is significantly faster now

than it was when Robert did his measurements.

So yeah, in general a fairly floored benchmark, see the threads I linked for more details. I’m sure there’s equivalent Java and Scala biased threads floating around as well!

Edited 2011-06-10 23:22 UTC

2011-06-10 11:42 pm

pantheraleo
I agree the paper has multiple problems.

Another issue with the the Java portion is they don’t state whether they ran the JVM in client mode or server mode. That makes an enormous difference. Running it in server mode will almost always run circles around client mode.

Also, it’s a rather theoretical benchmark that says next to nothing about real world performance. As they admit themselves, they didn’t test threading performance or anything. The benchmark is not very indicative of the kinds of operations that most real world applications actually perform.

Edited 2011-06-10 23:46 UTC

2011-06-11 4:36 am

Kasi
You sir, are a comedic god.

2011-06-11 5:10 am

pantheraleo
You sir, are a comedic god.

You, sir. Added absolutely nothing of value to this conversation with this stupid comment.

2011-06-12 3:36 pm

Slambert666

You sir, are a comedic god.

You, sir. Added absolutely nothing of value to this conversation with this stupid comment.

Well, he did. By calling you a “comedic god” he implied that your comment was pure comedy. Hence he adds value by directing unsuspecting readers of your original comment to the fact that it is COMPLETELY STUPID and could perhaps be interpreted as comedy and not informed discourse.
2011-06-12 4:22 pm

pantheraleo
Well, he did. By calling you a “comedic god” he implied that your comment was pure comedy

Well, considering it got voted up several times, apparently most readers didn’t think it was stupid. So maybe that says more about your level of stupidity if you thought it was stupid? Or perhaps it says something about your lack of knowledge of Java, or your lack of knowledge of how real world applications typically behave?
2011-06-12 7:29 pm

baderman
Or simply points out, which programmers are in majority here. It’s an unfortunate flaw of democracy .

Guys – relax
2011-06-12 9:42 pm

pantheraleo
Or simply points out, which programmers are in majority here. It’s an unfortunate flaw of democracy .

Well, again. The server vs. client JVM mode makes a very big difference in performance. And, using a different algorithm, I can produce results where Java is 5x faster than C++. So the points I made were completely valid. Also, given that the test was intended specifically to test Scala, the fact that they didn’t rest concurrency at all is kind of mind boggling to me.

2011-06-11 9:29 am

Savior
Another issue with the the Java portion is they don’t state whether they ran the JVM in client mode or server mode.

They used server mode, as should be obvious from the first paragraph of section V:

“The benchmark consists of a driver … and performs loop recognition on it for 15.000 times. The purpose is to force Java/Scala compilation, so that

the benchmark itself can be measured on compiled code.”

The client-mode VM compiles the code right away and then leaves it at that. It’s the server mode that needs a few thousand executions to recompile and optimize critical parts.

Also, it’s a rather theoretical benchmark that says next to nothing about real world performance.

While the benchmark is rather theoretical, the issues discovered are real and DO affect real world performance. Everybody uses collections, recursions, etc. Also, I would definitely consider mathematical computations part of the “real world”.

2011-06-11 5:46 pm

pantheraleo
The client-mode VM compiles the code right away and then leaves it at that. It’s the server mode that needs a few thousand executions to recompile and optimize critical parts.

Not true. Both the client and server mode JVM to JIT compilation. Also, it typically doesn’t take a few thousand executions before JIT can be done.

While the benchmark is rather theoretical, the issues discovered are real and DO affect real world performance.

Some of the issues do, sure. But if I use a different test, I can make Java be 5 times faster than C++. For example, try doing a benchmark with infilife in Java vs. C++. The Java version is literally almost 5 times faster than the C++ version. The reason is because infilife is one of those tests where garbage collection provides a major advantage in that it optimizes memory allocation and deallocation. Sure, if you really want to spend the time on it, you could probably optimize the C++ version to be as fast as the Java version. But the code would be many times larger, and much more complex than the Java version.

The point is, using different code and different operations, I can very easily produce very different results. You can’t take one sample like this and try to say that C++ is the fastest language.

Google even admitted in the paper that optimizing the C++ version was very difficult, and that most times those types of optimizations would not be done.

2011-06-11 3:29 am

Alfman verbose=1
I tend to agree that it’s not usually worth optimizing code in assembly, however GCC’s code is often less efficient than it should be.

We as programmers can do a better job than GCC at anticipating certain states and algorithmic conditions which give GCC trouble. Under these circumstances, it’s not all that difficult to beat it.

I know ICC is generally better than GCC.

2011-06-11 4:12 am

Valhalla
I tend to agree that it’s not usually worth optimizing code in assembly, however GCC’s code is often less efficient than it should be.

Examples?

We as programmers can do a better job than GCC at anticipating certain states and algorithmic conditions which give GCC trouble. Under these circumstances, it’s not all that difficult to beat it.

Certainly we programmers can do a better job given that we understand the overall purpose of the code much better than any compiler, however I very much disagree with it ‘not being all that difficult to beat it’, you certainly need to be a very good assembly programmer to do so.

Expert programmers who are both well-versed in assembly and the things they are solving with it will certainly atleast equal and most often beat a compiler. Case in point would be encoders like x264 etc, they are full with hand-optimized assembly which easily beat the generated output of both GCC, ICC, LLVM etc when disabling the hand-optimized assembly and relying entirely on compiler optimization. But these guys are very good assembly programmers and experts at what they are trying to implement using it, this is hardly the norm amongst programmers. Hence very few programmers out there can actually beat the optimized output of compilers.

2011-06-11 7:54 am

Alfman verbose=1
Valhalla,

“Examples?”

That’s a fair question, but please understand I don’t have very much time to allot to it.

Anyway, here’s a trivial example.

void vector_add(unsigned int*a, unsigned int a_len, unsigned int*b, unsigned int b_len) {

unsigned int i;

for(i=0; i<a_len && i<b_len; i++) {

a[i] += b[i];

}

}

Here is the GCC output using -O3 and march=core2

vector_add:

push ebp

mov ebp, esp

push edi

mov ecx, DWORD PTR [ebp+8]

push esi

mov edi, DWORD PTR [ebp+20]

push ebx

mov esi, DWORD PTR [ebp+16]

mov ebx, DWORD PTR [ebp+12]

test ebx, ebx

je .L5

test edi, edi

je .L5

xor edx, edx

.L3:

mov eax, DWORD PTR [esi+edx*4]

add DWORD PTR [ecx+edx*4], eax

inc edx

cmp ebx, edx

jbe .L5

cmp edi, edx

ja .L3

.L5:

pop ebx

pop esi

pop edi

leave

ret

Now I’m not sure how many people read asm, but it’s short and not difficult to follow.

GCC failed to vectorize the code with SSE2 instructions. This would be an obvious way to significantly optimize the example (both in terms of memory bandwidth and in terms of parallel computation).

Secondly, GCC failed to recognize that is isn’t necessary to carry two compare operations into the loop. It should be using a cmov to calculate the minimum number of loops upfront, and then there would only be a single condition in the loop.

Thirdly, although there is no performance penalty in this case, GCC generally manipulates indexes instead of converting loops to use pointer arithmetic. Pointer arithmetic is far more efficient in the case when a shortage of general purpose registers forces variables onto the stack. I’ve seen GCC load both the base pointer and index from the stack and add them inside the loop at every iteration.

Fourthly, it didn’t unroll any loops. Personally I will usually trade off in favour of instruction size anyway, but it’s interesting because I specified -O3. I’m sure we could get GCC to unroll loops by manually setting the heuristics, but an assembly programmer would have yet another easy edge over the default compiler output.

“however I very much disagree with it ‘not being all that difficult to beat it’, you certainly need to be a very good assembly programmer to do so.”

You may be assuming that GCC does a good job in the first place. It’s usually ‘good enough’, but it isn’t great.

“Hence very few programmers out there can actually beat the optimized output of compilers.”

Maybe the example above will change your mind?

I don’t have ICC on hand to compare, but it is said to be excellent at SSE2 and is probably also better at optimization all around.

2011-06-11 9:12 am

acobar
Well, I will repeat to you what I do to everyone I met and that would ask my advise (not your case) on programming: first look for a good, tested, implementation. Many people try to code fragments that are already optimized for ages. They don’t look at Donald Knuth’s ACP, fftw, lapack, atlas and many other well-thought code before they start and, as such, write subpar code.

For example, if you change your code to:

void vector_add(unsigned int*a, unsigned int a_len, unsigned int*b, unsigned int b_len)

{

unsigned int i;

for (i = a_len < b_len ? a_len : b_len ; i > 0 ; ) {

i–;

a[i] += b[i];

}

}

You get a pretty decent optimized code:

vector_add:

.LFB0:

.cfi_startproc

cmpl %esi, %ecx

cmovbe %ecx, %esi

testl %esi, %esi

je .L1

.p2align 4,,10

.p2align 3

.L4:

decl %esi

mov %esi, %eax

movl (%rdx,%rax,4), %ecx

addl %ecx, (%rdi,%rax,4)

testl %esi, %esi

jne .L4

.L1:

rep

ret

.cfi_endproc

As can be clearly seen, compilers are not black-magic wizards and, regretfully, depend on the code quality we give to them, and we should never forget that.

2011-06-11 9:57 am

Alfman verbose=1
acobar,

“Well, I will repeat to you what I do to everyone I met and that would ask my advise (not your case) on programming: first look for a good, tested, implementation. Many people try to code fragments that are already optimized for ages. They don’t look at Donald Knuth’s ACP, fftw, lapack, atlas and many other well-thought code before they start and, as such, write subpar code.”

I actually don’t encourage people to optimize things at such small levels or in assembly. I’m just disturbed that so many people are under the false impression that their compilers are so difficult to beat.

“You get a pretty decent optimized code:”

Yes, I am aware that we can coerce the compiler to generate better instructions by rewriting the C code.

(Although it still isn’t as good for the other reasons already discussed).

It concerns me that GCC converts our C code to assembly so literally instead of generating more optimal paths.

Should we really encourage devs to adjust thier programming style to compensate for GCC optimizer deficiency?

Does this mean that C devs will have to understand CPU internals in order to have GCC produce optimal code?

I really don’t think we should be making excuses for GCC here. There is nothing unreasonable about expecting the original code to be optimized by the compiler.

Edit: By the way I am guilty of rewriting C code at the micro level in order to help GCC optimizer along. But we shouldn’t have to do this in the first place.

Edited 2011-06-11 10:08 UTC
2011-06-11 11:22 am

acobar
Perhaps, I should have said it on a more direct style: I do not encourage people to write optimized code, I tell them to look for good algorithms and rather to good libraries whenever they are available.

To me, actually, all this talk about the better language to code in is a bit misleading, we should clearly concentrate on algorithms. We don’t know on what we will be coding 20 years from now be we know that good algorithms will probably survive, like they did on past.

Sincerely, for most of the applications it does not matter a dime if you spend 1 millisecond or 1 microsecond doing something, and there are a factor of 1000 involved. And for the ones that matter, there are a chance that it is already well known and that a wise compromise was devised.

I see people everyday praising languages that make them write less code, or type less, or write more comfortable (understandable) code, or in the ones they like the syntax more. And all those arguments are sensible.

I, particularly, like languages that have all tools I feel I may need, like GUI generators, automation by scripts, code templates (not necessarily like on C++), profilers (for the very rare occasions where they are needed), many of modern OO concepts and, of course, that may be linked against good, well known tested libraries.

To be even more direct: people should learn more before code, unlike what we so repeatedly see. This would save time on rewrites and latter tunes.

Edited 2011-06-11 11:38 UTC
2011-06-11 1:13 pm

ebasconp
OT: I enjoyed this thread… Though I cannot contribute here, the only thing I can say is that I like OSnews because it encourages so technical discussions like this. Thank you guys!!!
2011-06-11 5:30 pm

Alfman verbose=1
acobar,

“Perhaps, I should have said it on a more direct style: I do not encourage people to write optimized code, I tell them to look for good algorithms and rather to good libraries whenever they are available.”

I’m not sure what this has to do with my original post, but we seem to be on the same track.

“To me, actually, all this talk about the better language to code in is a bit misleading, we should clearly concentrate on algorithms.”

Ditto.

“Sincerely, for most of the applications it does not matter a dime if you spend 1 millisecond or 1 microsecond doing something, and there are a factor of 1000 involved.”

Well I actually disagree here, but that’s just me.

“And for the ones that matter, there are a chance that it is already well known and that a wise compromise was devised.”

Yes, by far and large, it is very difficult to write truly original algorithms because it’s extremely likely thousands of other developers have tried to solve the same problems before us. However just because someone else has already done it dosn’t strictly mean that we need to copy and paste every single line of code either.

“I see people everyday praising languages that make them write less code…And all those arguments are sensible.”

Yes they are.
2011-06-11 6:48 pm

acobar
I guess I read too much on your arguments on:

I actually don’t encourage people to optimize things at such small levels or in assembly. I’m just disturbed that so many people are under the false impression that their compilers are so difficult to beat.

I understand that what you gave was just a very simple example about code optimization and the common sense idea people have about compilers outsmart our flesh brains, what we both see as unconnected to reality. But we should not forget that they are coded by biological primates, though possibly very smart ones and build through on hours orders of magnitude greater than the time we have available to complete our work, they are still creations of human brains and, as such, will always have space to a little helping hand.

Anyway, I was just trying to emphasize a very common fault on our education on CS. People feel a push toward optimizing but, at same time lacks a proper knowledge about algorithms, computer architecture and code generation and how they are linked. Because of that, they frequently write subpar code that they should not. That is why I vehemently push them towards libraries.
2011-06-12 1:48 am

Alfman verbose=1
acobar,

“Anyway, I was just trying to emphasize a very common fault on our education on CS. People feel a push toward optimizing but, at same time lacks a proper knowledge about algorithms”

You really think CS people feel a push towards optimizing? I’ve tried promoting my talent for optimizing algorithms and there’s been zero interest in it from employers or clients.

As such, I end up loosing the argument that optimization is important on the grounds that if it were important, then employers would be willing to pay for it.

“they frequently write subpar code that they should not. That is why I vehemently push them towards libraries.”

That sounds like generally good advice. The quality of libraries isn’t always top notch though.

For an example, programmers are better off writing their own arbitrary precision math code than relying on Java’s own classes (is this still true today?)

http://people.cis.ksu.edu/~rhowell/calculator/comparison.html

I know what you are saying though. I’d say it depends.

I think code clarity is more important than performance most of the time.
2011-06-14 8:38 am

vodoomoth
Are you saying that GCC should have turned the code you wrote into the code by @acobar? Or that it should

magically guess the intentions in some code section?

I don’t see a compiler failing to do that qualify as a “deficiency”.
2011-06-14 9:36 am

Alfman verbose=1
vodoomoth,

“Are you saying that GCC should have turned the code you wrote into the code by @acobar? Or that it should

magically guess the intentions in some code section?”

Not exactly, no.

It should have recognized that it would achieve the exact same logic semantics had it precomputed the number of iterations up front instead of testing array bounds on each iteration. This is a fairly common optimization for many use cases.
2011-06-11 10:15 am

Neolander
I’d argue that the C/C++ developer shouldn’t have to do black magic in order to optimize its code either.

Alfman’s use of for loops is perfectly normal, easy to understand, and you’ll find it in every book on C. One shouldn’t have to sacrifice code clarity like you did in order to get better performance, especially on such trivial code.
2011-06-11 10:47 am

acobar
Sorry, but I fail to find where the code I showed is “less understandable”. It is actually a very well known technique, i.e., reduce the number of unneeded comparisons.

Anyway, it was used only to point out one of the many gross assumptions people do about code efficiency, compiler optimizations and related stuff. We clearly should teach more about algorithms than about languages like we do today.
2011-06-11 10:57 am

Neolander
Sorry, but I fail to find where the code I showed is “less understandable”. It is actually a very well known technique, i.e., reduce the number of unneeded comparisons.

I was thinking about how you used trigraphs and put the loop’s iterator inside of the loop’s body instead of putting it within the for() itself.

Anyway, it was used only to point out one of the many gross assumptions people do about code efficiency, compiler optimizations and related stuff. We clearly should teach more about algorithms than about languages like we do today.

But then wouldn’t there be a feeling among both students and their future employers that they have learned lots of pretty theory, but not how to actually do things in practice ?

Also, is there such a thing as a language-independent algorithm ? Different languages have very different ways of expressing things, and algorithmic course generally teach students about an imperative algorithmic language pretty close to the structure of Pascal and C.

Edited 2011-06-11 11:00 UTC
2011-06-14 8:20 am

vodoomoth

But then wouldn’t there be a feeling among both students and their future employers that they have learned lots of pretty theory, but not how to actually do things in practice ?

I hear this often in the professional world, as if theory was the natural antonym of “practice”. But do you think that Oracle made the piles of money they’ve made through the years without using Codd’s relational algebra? Instead of relying on “indexes”, hints and DBAs to make SQL queries more efficient, it is sometimes faster to rewrite a query, which means you know about relational algebra, you understand that it is always better to reduce the number of tables in a join, you know when nested queries are a valuable contribution to the overall efficiency. In that specific real-life case I’m implicitly referring to, the query was rewritten by me and the response time that was above the 10-minute timeout fell to something like 30 seconds on average.

In practice, the goal is to query the database (or execute code on the CPU). Any elements that get me closer to that goal, be they elements of theory, are welcome.

My point is: knowing the theory behind things doesn’t take a thing away from people. If I was to hire someone to fill a position, I’d rather take the guy who knows the theory. And, as a side note, the “professionals” I’ve worked with who disparaged “nice theory” not only have terrible justifications for their bad practices, but they are resistant to thinking outside the box. Err, “their” box.

I’m still waiting to see an example of theory ruining or impeding some work instead of making it better IN PRACTICE.
2011-06-14 9:23 am

Alfman verbose=1
vodoomoth,

Speaking of Oracle’s optimizations:

I was terribly disappointed with the Oracle 9i cost based optimizer, or was it 10GRAC? Things tend to blur after a few years.

After upgrading, many of the resulting queries went from nearly instantaneous to 30s or several minutes long. It was the version which started preferring hash joins for everything, even when /*+FIRST*/ was requested. Updating statistics for all tables had no effect. There may have been a few DB links in there, but I don’t recall them being the cause.

In theory the cost based optimizer should have been better than the rule based optimizer.

At the time we were told to either add /*+RULE*/ or manually specify an index hint, either of which solved the problem. Oracle always had us running patched versions to solve solve one glitch or another.

I do remember tracking down some tricky oracle logic bugs, such as the time it thought to_date(‘2099-12-31’)+1 > to_date(‘2099-12-31’) was false.

This bug manifested itself as broken business rules, we ultimately traced the 2099-12-31 value to bad user data.

Ah oracle, still my favorite RDB though! These days I work predominantly with mysql (yuck), also owned by oracle.
2011-06-13 10:25 am

Carewolf
If you are not a wizard you should not be programming C/C++. There are other languages that are much better suited for high-level programming. C/C++ are only superior languages in enabling more power to a powerful wizard, in unskilled hands they are at best teaching/testing tools.
2011-06-13 11:06 am

Neolander
Yup, but you can do low-level programming (which makes the use of HLLs much harder, even though some projects like MS Singularity try to use them anyway) and still wish you didn’t have to care about the internals of the compiler’s optimizer and to do compiler-specific wizardry in order to optimize your code.

Edited 2011-06-13 11:08 UTC
2011-06-13 10:56 am

Timmmm
for (i = a_len < b_len ? a_len : b_len ; i > 0 ; ) {

i–;

a[i] += b[i];

}

I’m pretty sure this is a terrible idea because it will mess up all the cache lines and pre-fetching if you walk through memory backwards.
2011-06-14 2:26 pm

Megol

for (i = a_len < b_len ? a_len : b_len ; i > 0 ; ) {

i–;

a[i] += b[i];

}

I’m pretty sure this is a terrible idea because it will mess up all the cache lines and pre-fetching if you walk through memory backwards.

No. Prefetchers (in every processor I’ve touched) support negative strides and working backwards have no performance problems.

2011-06-11 8:21 pm

Valhalla
Valhalla,

“Examples?”

Maybe GCC should have been able to optimize one of the len comparisons away, but really it’s (deliberately?) stupid code imo and there will always be these cases where the compiler fails to grasp ‘the bigger picture’.

Secondly it seems obvious that gcc doesn’t choose to use vectorization since it has no idea of how many integers are to be added, and depending on that it could very well be less efficient to use vectorization (I seriously doubt ICC would do so either with this snippet), case in point I changed the ‘len’ variables to a constant (100) in your snippet which resulted in the following (note 64bit):

leaq 16(%rdx), %rax

cmpq %rax, %rdi

jbe .L11

.L7:

xorl %eax, %eax

.p2align 4,,10

.p2align 3

.L4:

movdqu (%rdx,%rax), %xmm1

movdqu (%rdi,%rax), %xmm0

paddd %xmm1, %xmm0

movdqu %xmm0, (%rdi,%rax)

addq $16, %rax

cmpq $400, %rax

jne .L4

rep

ret

.L11:

leaq 16(%rdi), %rax

cmpq %rax, %rdx

ja .L7

xorl %eax, %eax

.p2align 4,,10

.p2align 3

.L2:

movl (%rdx,%rax), %ecx

addl %ecx, (%rdi,%rax)

addq $4, %rax

cmpq $400, %rax

jne .L2

rep

ret

Setting the constant to 5 or above made GCC use vectorization for the loop.

Granted, calls to this function would not be using constants in a ‘real program’ but the compiler would likely have a much better chance of deciding if it should vectorize the loop or not.

As for loop unrolling, it is not turned on by -O3 due to being difficult to estimate it’s efficiency without runtime data (it’s turned on automatically when PGO is used), does any compiler turn this on by default? (GCC and Clang/LLVM doesn’t).

2011-06-12 2:11 am

Alfman verbose=1
“Maybe GCC should have been able to optimize one of the len comparisons away, but really it’s (deliberately?) stupid code imo and there will always be these cases where the compiler fails to grasp ‘the bigger picture’.”

Stupid code? Sure it was a trivial example, but that was deliberate. You’ll have to take my word that GCC has the same shortcomings on more complex code from real programs.

Even if you want to blame the programmer here, a seasoned programmer will have no reasonable way of knowing if GCC has optimized the loop correctly without looking at the assembly output. Do we really want to go down the route of saying programmers need to check up on GCC’s assembly output?

Should students be taught to avoid legal C constructs which give the GCC optimizer a hard time?

“Secondly it seems obvious that gcc doesn’t choose to use vectorization since it has no idea of how many integers are to be added, and depending on that it could very well be less efficient to use vectorization”

I have no idea why GCC chose not to use SSE, but the result is still that the assembly language programmer would be able to beat it.

“(I seriously doubt ICC would do so either with this snippet)”

I wish someone could test this for us, Intel boasts very aggressive SSE optimization.
2011-06-12 9:00 pm

moondevil
You are forgetting something in your examples.

It used to be so that most humans could beat compiler generated code. In this day and age it is only true for small code snippets or simple processors.

Most up to date processors use out-of-order execution with superscalar processing units, and translate CISC instructions into microcode RISC like code. And this varies from processor model to processor model within the same family even!

It is very hard for most humans to still be able to keep all processor features on their head while coding assembly and still be able to beat the code generated from high performance compilers. Not GCC, but the ones you pay several thousand euros/dollars for, with years of research put into them.
2011-06-13 1:35 am

Alfman verbose=1
moondevil,

“You are forgetting something in your examples.”

Fair enough, but what?

“It used to be so that most humans could beat compiler generated code. In this day and age it is only true for small code snippets or simple processors.”

I was asked for an example, which I provided. Then I provided another example with division. In these examples, I’m not aware of any processor for which GCC generated optimal code, which is what I set out to demonstrate; Nothing more, nothing less.

“Most up to date processors use out-of-order execution with superscalar processing units, and translate CISC instructions into microcode RISC like code. And this varies from processor model to processor model within the same family even!”

I have some observations:

1. In practice x86 binary code is precompiled and shared among many models. If you have an Intel Core2 Q6600, you cannot simply purchase/download software specifically for that model.

2. I don’t believe GCC even allows model-specific compilation (I doubt ICC does either, but I could be wrong). You can specify a processor family, but that’s it.

“It is very hard for most humans to still be able to keep all processor features on their head while coding assembly and still be able to beat the code generated from high performance compilers. Not GCC, but the ones you pay several thousand euros/dollars for, with years of research put into them.”

Well there you go, in conclusion it seems that you do agree with my argument that we can beat GCC? I’m not moving goalposts here, this is what I said from the beginning – it’s even in the title of the thread.

Edited 2011-06-13 01:48 UTC
2011-06-13 6:02 am

Valhalla

Stupid code? Sure it was a trivial example, but that was deliberate.

Having two comparisons within the loop was obviously poor coding in this example, which you expected the compiler to fix for you.

Do we really want to go down the route of saying programmers need to check up on GCC’s assembly output?

If you find that the performance is not what you’d expect out of the given code, you will profile and look at assembly output of the performance hotspots, doesn’t matter if it’s GCC, VC, ICC, Clang/LLVM.

Should students be taught to avoid legal C constructs which give the GCC optimizer a hard time?

Legal constructs does not equal efficient code. Compilers have never been able to turn shitty code into good code. If you know of one, please inform me, I’d buy it in a second.

I wish someone could test this for us, Intel boasts very aggressive SSE optimization.

I compiled your snippet with Clang 2.9, it didn’t vectorize it either until I exchanged the len vars with constants just like in the case with GCC. Again I doubt ICC would do it either.

Again, this function within the context of a whole program would likely yield a different result (definately if PGO was used). For instance it would be interesting disecting the output of some of the micro-benchmarks over at language-shootout.

On a slightly off-topic note, anyone have any experience with the ekopath4 compiler suite? it appears that it is to be released as open source (gplv3) and judging by the performance benchmarks it appears to offer some gpgpu solution:

http://www.phoronix.com/scan.php?page=article&item=phoronix_dirndl_…
2011-06-13 7:01 pm

Alfman verbose=1
“Having two comparisons within the loop was obviously poor coding in this example, which you expected the compiler to fix for you.”

I think it surprised you that I was able to come up with an example, and now your grasping at straws… I won’t hold you to your original statements, don’t feel the need to defend them.

“If you find that the performance is not what you’d expect out of the given code, you will profile and look at assembly output of the performance hotspots, doesn’t matter if it’s GCC, VC, ICC, Clang/LLVM.”

The difference is, there is no need to audit the compiler if it can be trusted to do a great job in the first place. The fact that we can reveal shortcomings by looking at GCC’s asm dump implies that we are able to do better.

“Legal constructs does not equal efficient code. Compilers have never been able to turn shitty code into good code. If you know of one, please inform me, I’d buy it in a second.”

Still more excuses. Why does GCC reorder and optimize some code paths but not others? The developer shouldn’t have to mess with clean code just to make it perform better under GCC.

“I compiled your snippet with Clang 2.9, it didn’t vectorize it either until I exchanged the len vars with constants just like in the case with GCC. Again I doubt ICC would do it either.”

That’s not really a sufficient answer.
2011-06-13 11:29 pm

Valhalla

I think it surprised you that I was able to come up with an example,

Hardly, depending on the number of integers to add and the memory alignment of the integer data (both of which are unknown to the compiler in this example), vectorizing this loop may very well turn out slower afaik. There’s a reason all the compilers support sse intrinsics, they’re anything but general purpose registers. Both GCC and Clang/LLVM are considered strong compilers, neither of them vectorized this snippet. You claim this proves them ‘not all that great’, I say this 6-line example is anything but conclusive.

The difference is, there is no need to audit the compiler if it can be trusted to do a great job in the first place. The fact that we can reveal shortcomings by looking at GCC’s asm dump implies that we are able to do better.

You are basically saying that you know that a vectorized loop will outperform a non-vectorized loop no matter what the data length and data alignment is?

Because that’s what this example entails, the compiler knows nothing about the data length and the data alignment at compile-time. And not knowing this, the compilers (GCC and Clang) chose not to vectorize. When I gave it the data length both compilers vectorized the loop (as I showed in an earlier post).

Still more excuses. Why does GCC reorder and optimize some code paths but not others? The developer shouldn’t have to mess with clean code just to make it perform better under GCC.

Because no compiler is perfect, and again clean code does not equal efficient code. There’s a reason you don’t start questioning the compiler optimizations until you’ve questioned the actual algorithm.

That’s not really a sufficient answer.

It was a statement, neither GCC not Clang/LLVM vectorized your snippet, and like I said doubt ICC would either.

Despite our bantering (or perhaps because of it!), I have to say it’s fun discussing technical stuff here on OSNews once in a while. So thanks alfman, acobar and others for participating in this (imho) interesting discussion

btw I’m surpised f0dder hasn’t weighed in, I seem to recall him from win32 assembly forums way back in the day (maybe my memory is playing tricks on me)!
2011-06-14 2:30 am

Alfman verbose=1
Valhalla,

“There’s a reason all the compilers support sse intrinsics, they’re anything but general purpose registers.”

In theory, one could create intrinsics for every single assembly opcode available and then claim that it is the developer’s fault that they don’t get used. However C is used by devs who don’t want to program at the opcode level.

If you can get away with using intrinsics instead of inline assembly, then sure, go ahead. But they are not very portable between compilers nor architectures.

And not every opcode we want to optimize has intrinsics. You haven’t addressed the division example, I’d be very grateful if you could find a way to optimize 64bit / 32bit -> 32bit without using assembly.

“Both GCC and Clang/LLVM are considered strong compilers, neither of them vectorized this snippet.”

Well if you say so, GCC usually doesn’t score highly on benchmarks.

“Hardly, depending on the number of integers to add and the memory alignment of the integer data (both of which are unknown to the compiler in this example)”

Even so, GCC did a bad job.

Change the example so that the function only accepts one length, and GCC produces the SSE code. So it’s clear that an unknown length was not the factor.

“Because that’s what this example entails, the compiler knows nothing about the data length and the data alignment at compile-time. And not knowing this, the compilers (GCC and Clang) chose not to vectorize. When I gave it the data length both compilers vectorized the loop (as I showed in an earlier post).”

This is not strictly true, GCC can tell exactly how long the arrays are by looking at the rest of the program, it simply chooses not to optimize that way.

But this brings up another good point, which speaks to my first post (repeated here) “We as programmers can do a better job than GCC at anticipating certain states and algorithmic conditions which give GCC trouble.”

There may be times when the developer knows things which we cannot reasonably expect the compiler to derive, nor does the C language provide the means for us to tell it. The result of this uncertainly is poorer optimization.

“It was a statement, neither GCC not Clang/LLVM vectorized your snippet, and like I said doubt ICC would either.”

We’ll have to leave it to the unknown.

Your view is too extreme for my liking. What if an optimizer does not eliminate loop invariants because the developer’s code calculated them each time? We could blame the “shitty code” instead of the compiler there too. In fact any time the compiler failed to optimize but did a literal translation of logic, we could argue the developer is to blame, right?

If that is not your view, then what is the criteria for non-optimizations which are the developer’s fault versus those which are the compiler’s fault?

(and I won’t accept this answer: if GCC doesn’t handle it, then it’s the developer’s fault)

“Despite our bantering (or perhaps because of it!), I have to say it’s fun discussing technical stuff here on OSNews once in a while.”

I much prefer technical stuff to gadgetry hype, but maybe I’m just jealous that I cannot afford a lifestyle where many gadgets come into play.
2011-06-15 8:10 pm

Valhalla

In theory, one could create intrinsics for every single assembly opcode available and then claim that it is the developer’s fault that they don’t get used.

Sure in theory, but in practice we don’t need that, however pretty much every compiler out there supports sse insintrics and there’s a reason for that.

And not every opcode we want to optimize has intrinsics. You haven’t addressed the division example,

I haven’t seen such an example? Show me the code!

Well if you say so, GCC usually doesn’t score highly on benchmarks.

I haven’t seen any fresh benchmarks outside Phoronix in quite some time (and GCC does very well there), sadly multimedia mike stopped doing his smackdowns, do you know of any fresh online comparisons between gcc/icc?

This is not strictly true, GCC can tell exactly how long the arrays are by looking at the rest of the program,

Is this back again to the two comparisons or are you suggesting that GCC should magically know the length of these arrays at compile-time? I’m not sure I follow.

Your view is too extreme for my liking…

Funny, I find your views are too extreme, I think you need to be a good assembly programmer in order to beat a current optimizing compiler. Obviously there will be specifical instances were the compiler generates subpar code, if that wasn’t the case the compiler devs wouldn’t have to work so hard on improving the optimization release after release. But generally I believe it to holds true, and this 6 line function did not convince me otherwise.

I much prefer technical stuff to gadgetry hype, but maybe I’m just jealous that I cannot afford a lifestyle where many gadgets come into play.

Heh, man if there’s one thing I don’t need it’s another gadget to distract me Instead of writing this post I should be finishing up as much work as I can before going on vacation, but yet I find myself here

This C/assembly debate seems to come up all the time, the next time it does, I might just sit it out. It takes so much time to make the case, and at the end it’s totally inconsequential.

Please don’t, I’ve found discussing this with you very interesting even if we hold what seems like opposing views, and it’s not as if any of the comment discussions here on OSNews are anything but inconsequential.

Again thanks for this discussion Alfman.
2011-06-15 11:48 pm

Alfman verbose=1
Valhalla,

“Sure in theory, but in practice we don’t need that, however pretty much every compiler out there supports sse insintrics and there’s a reason for that.”

Intrinsics are specific to the compilers and architecture, some people consider it to be bad practice to use non-compliant C code, but maybe they’re living in the past.

“Is this back again to the two comparisons or are you suggesting that GCC should magically know the length of these arrays at compile-time? I’m not sure I follow.”

I don’t consider it black magic at all. Every low level assembly optimization we make by hand could be codified into rules for GCC to use as well. GCC’s bag of optimizations seems to be missing a few tricks.

“Funny, I find your views are too extreme, I think you need to be a good assembly programmer in order to beat a current optimizing compiler. Obviously there will be specifical instances were the compiler generates subpar code”

I don’t think most programmers should use assembly either, I’m just pointing out deficiencies. This was a simple case, no black magic here. The GCC compiler does other optimizations which are far more convoluted.

Should there be zero overlap between the optimizations which could have been done by the C programmer and those done by the compiler? Hopefully not. Many are specific to the architecture.

I was hoping you’d answer this question:

“…then what is the criteria for non-optimizations which are the developer’s fault versus those which are the compiler’s fault?”

“I haven’t seen such an example? Show me the code!”

I mentioned it verbally already, but now it’s been codified below:

uint64_t a;

uint32_t b;

uint32_t c;

scanf(“%lld %d”, &a, &b);

c = a/b; // 64bit divide by 32bit store into 32bit

printf(“%lld / %d = %d\n”, a, b, c);

return 0;

call __isoc99_scanf

mov ebx, DWORD PTR [esp+32]

mov esi, DWORD PTR [esp+36]

mov edi, DWORD PTR [esp+44]

mov DWORD PTR [esp], ebx

mov DWORD PTR [esp+8], edi

mov DWORD PTR [esp+4], esi

mov DWORD PTR [esp+12], 0

call __udivdi3

mov DWORD PTR [esp+16], edi

mov DWORD PTR [esp+8], ebx

mov DWORD PTR [esp+12], esi

mov DWORD PTR [esp+20], eax

mov DWORD PTR [esp+4], OFFSET FLAT:.LC1

mov DWORD PTR [esp], 1

call __printf_chk

Obviously I want to use the x86 DIV instead of emulating an unnecessary 64bit division with udivdi3.

A single DIV opcode is *exactly* what *I* need. However arguably the compiler cannot be sure that there won’t be an overflow (so they’re semantically different). However even when I guaranty there’ll be no overflow, GCC still won’t optimize it.

scanf(“%lld %d”, &a, &b);

if ( (a>>32) < b ) {

c = a/b; // result must fit in 32bits

} else {

c = -1; // overflow

}

printf(“%lld / %d = %d\n”, a, b, c);

call __isoc99_scanf

mov ebx, DWORD PTR [esp+36]

mov esi, DWORD PTR [esp+44]

mov edi, DWORD PTR [esp+32]

mov eax, -1

cmp ebx, esi

jae .L25

mov DWORD PTR [esp+8], esi

mov DWORD PTR [esp+12], 0

mov DWORD PTR [esp], edi

mov DWORD PTR [esp+4], ebx

call __udivdi3

.L25:

mov DWORD PTR [esp+16], esi

mov DWORD PTR [esp+8], edi

mov DWORD PTR [esp+12], ebx

mov DWORD PTR [esp+20], eax

mov DWORD PTR [esp+4], OFFSET FLAT:.LC1

mov DWORD PTR [esp], 1

call __printf_chk

I’d be very happy if you could fix GCC’s output, but I haven’t found a way. This case has particularly poor performance implications.

Maybe GCC has the same limitation with 64 bit?

“I’ve found discussing this with you very interesting even if we hold what seems like opposing views, and it’s not as if any of the comment discussions here on OSNews are anything but inconsequential.”

Oh I enjoy discussing it too. But pushing unpopular views can also be an unrelentingly labor. It seems that the drive to “optimize stuff” is a legacy with more footholds in the past than the future.

Edited 2011-06-16 00:07 UTC
2011-06-13 4:42 pm

f0dder
Do we really want to go down the route of saying programmers need to check up on GCC’s assembly output?

When you need maximum performance for a piece of code, yes. When you don’t, simply don’t worry about it – not generating optimal code doesn’t matter if the code isn’t a hotspot.

2011-06-13 8:53 am

burnttoys
That code can not be vectorised as it is written as ‘a’ and ‘b’ may point to overlapping areas of memory.

The ‘restrict’ keyword is necessary.

2011-06-13 7:12 pm

Alfman verbose=1
burnttoys,

“That code can not be vectorised as it is written as ‘a’ and ‘b’ may point to overlapping areas of memory.

The ‘restrict’ keyword is necessary.”

Wow, thanks burnttoys, you are correct, some constructive criticism!

It should be using restrict just like ‘memcpy’ does.

Unfortunately I tried it and it didn’t change the output at all. Additionally I see a lot of devs claiming that the restrict keyword doesn’t have any effect under GCC, can you confirm that?
2011-06-13 8:18 pm

burnttoys
I couldn’t tell you more about GCC vectorisation in any great detail.

I have seen restrict et al work (that is generate vector code) on ARM for the NEON instruction set but right now I can’t remember if that was RVCT or GCC. If ARM is your thing it might be worth checking the linaro.org builds of GCC. For x86 – I’m afraid I’ve no idea. I just hit “build” in Qt Creator as none of my x86 code is really that performance critical anymore!
2011-06-14 7:26 am

Alfman verbose=1
burnttoys,

There seem to have been a small army of devs reporting issues with ‘restrict’ since 2002, which have been marked as dups of each other. They ultimately point to this bug which indicates a fix in GCC 4.5.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14187

2011-06-13 9:45 am

Carewolf
After identifying that gcc didn’t perform these optimization have you considered trying gcc with the options to enable them? -funroll-loops -ftree-vectorize.

It is not fun blaming the compiler for not doing optimizations it hasn’t been asked to do. I know the default optimizations suck, but that is a well known problem with gcc.

The auto-vectorizer isn’t very good with integers, but give it a try.

Edited 2011-06-13 09:46 UTC

2011-06-13 9:56 am

Neolander
Aren’t those supposed to be automatically enabled by O2 or O3 ?
2011-06-13 10:30 am

Carewolf
Depends on the gcc-version.

I just double-checked: According to info:gcc the most recent version (4.6) has enabled -ftree-vectorize on -O3, but still not -funroll-loops.

Unrolling is only default enabled if you compile with profiling-data that helps the compiler to unroll the correct loops.
2011-06-13 11:12 am

Neolander
Thanks for the clarification.

I think I can understand the rationale behind that decision : loop unrolling sounds like a rather costly optimization if mindlessly applied to all loops without PGO coming into play. But in this case, where PGO is not used, it indeed reduces the performance of the GCC binary.
2011-06-13 7:51 pm

Alfman verbose=1
Carewolf,

Thanks for the feedback.

For me GCC does use SSE for me with and without ‘-ftree-vectorize’ (when I tweak the C source code).

I couldn’t get GCC to do vector math without changing the source file to calculate the number of loops for GCC. In a case as simple as this, the compiler should have been able to handle it.

I know Valhalla is complaining about this specific example (not sure why?), but I do frequently come across issues like this in much more complex code where GCC misses an equally trivial optimization.
2011-06-14 9:32 am

Carewolf
I also rarely see GCC vectorizing integer loops (it seems to choke on signed vs unsigned), GCC does slightly better with floating-point loops, and it helps if you allow it to break some strict math rules.

As I noted in another comment the newest version of gcc now has -ftree-vectorize in -O3, so I was not fully up to date in my first reply.

If you compile to AMD64, SSE is automatically used for all math (not vectorized, one value at a time). You can also use SSE for math in IA32 using -mfpmath=sse. SSE math is generally much faster but removes the 80bit temporaries quirk 487 math has.
2011-06-14 9:59 am

Alfman verbose=1
Carewolf,

“I also rarely see GCC vectorizing integer loops (it seems to choke on signed vs unsigned), GCC does slightly better with floating-point loops, and it helps if you allow it to break some strict math rules.”

Really? I’d be surprised if signed vs unsigned was the reason it chokes. In two’s complement, the calculation is identical regardless of sign.

0x43 + 0xfe (this is -2) = 0x41 (+carry flag)

0x43 – 2 = 0x41

0xf0 (this is -0x10) + 0xfd (this is -0x3) = 0xed (this is -0x13) (+carry flag)

0xfe (this is -2) * 0xfb (this is -5) = 0x0a (0xf9 carry)

In other words, GCC doesn’t even have to care whether a variable is signed or unsigned in order to do “+ – *”.

unsigned int x=-2;

unsigned int y=-10;

unsigned int z=49;

printf(“%d\n”, x*y*z); // give us 980

GCC merely ignores the carry (this is another one of my optimization gripes in fact, assembly programmers can use the carry flag, C programmers have to do computations using the next larger word size, sometimes wasting a register).

“If you compile to AMD64, SSE is automatically used for all math (not vectorized, one value at a time). You can also use SSE for math in IA32 using -mfpmath=sse. SSE math is generally much faster but removes the 80bit temporaries quirk 487 math has.”

Yes, it’s good to get away from quirky FP design.

Edit:

This C/assembly debate seems to come up all the time, the next time it does, I might just sit it out. It takes so much time to make the case, and at the end it’s totally inconsequential.

Edited 2011-06-14 10:15 UTC

2011-06-11 6:22 pm

f0dder
…not to mention that they’re dealing with pretty darn specific algorithms that can take advantage of SSE and benefit from hand-rolled register allocation.

Not so much with your typical run of the mill software
2011-06-12 11:57 am

_xmv
this is hardly the norm amongst programmers. Hence very few programmers out there can actually beat the optimized output of compilers.

i think you’re confusing “programers” and “i fired up a shell and a compiler and i think i can program stuff”

Seriously, in fact.

Sure few will code in asm for the hell of it (except MenuetOS people ;-), but any who’s actually a real programmer will insert asm wherever he needs it properly.

Coding everything asm is just too time consuming and can be major hair pulling due to the size of the projects without a strong incentive to manage things (like menuet does)

2011-06-12 5:46 pm

Alfman verbose=1
_xmv,

“Sure few will code in asm for the hell of it (except MenuetOS people ;-), but any who’s actually a real programmer will insert asm wherever he needs it properly.

Coding everything asm is just too time consuming and can be major hair pulling due to the size of the projects without a strong incentive to manage things (like menuet does)”

GCC could improve in the future, but until then, the common perceptions don’t do justice to the skill of assembly level programmers.

I still think that for most programmers/projects, there are very strong reasons not to going down the ASM route regardless of asm skill.

It’s usually too much of a maintenance burden, and I don’t have the time or hardware to test all the hardware combinations that GCC/C can support.

There are some times when I’m not left with much choice though.

In the last performance critical project I worked on I needed to divide a 64bit number by a 32bit number such that the result would always fit in a 32bit register. Obviously this is exactly what the x86 DIV instruction does since the 386.

Unfortunately, GCC (on x86) cannot optimize this case and instead emulates a full 64bit division (numerator, divisor, dividend) in a hidden call to “__divdi3”, which has terrible performance.

GCC does what’s required of the C spec (full 64bit division, and then demoting to 32bit), but it doesn’t bother trying to optimize the case.

Conversely, GCC does optimize the case where two 32bit variables are multiplied into a 64bit variable:

uint32_t a,b;

uint64_t ab;

…

ab = (uint64_t)a*(uint64_t)b;

GCC does not perform a 64bit multiplication above, it calls the x86 MUL instruction, which multiplies 32bit numbers and returns a 64bit number.

Fun times!

2011-06-11 4:44 am

voidlogic
Perhaps I expect too much. But after reading the paper- It makes me wonder what kind of peer review process Google has. Not that this paper does not have potential, but considering the obvious flaws discussed here and else where, and the general quality of the writing I feel like I am reading a draft of this paper.

I can’t imagine many advisors in a Academic setting signing off on this paper as is- I would imagine a faculty advisor would have a lot of constructive criticism and would help the author create a great paper. This paper is not horrible, but it is just surprising considering the source- Google.

P.S. Those who are interested in this topic might also enjoy the following online database: http://shootout.alioth.debian.org/
2011-06-11 4:50 am

Magma
Can Google really be serious about a performance benchmark and not include C? I’m not implying that they need to use it to build gmail but it’s a pretty common baseline.

C++ has a lot of issues in a big project that make any micro analysis moot. RE: Boost? STL? Did the framework you’re using pay attention to performance… everywhere? How about the heap vs. stack issue that overtakes bigger projects.

Who cares about a loop? Yup. It’s speedy, that’s great until run into this guy:

“Umm, all my functions are like virtual and stuff, ya know, like, in case I want to like inherit somethin’.”

Brilliant… Good feelings gone.

Scala, Java, and.. Go… why not PHP or Python while you’re at it? Odd.

2011-06-11 9:54 am

edgeofsanity
I agree, C++ looks accident compared to new languages. Reflection, for example, is a great way to implement runtime evolable systems. I have seen software systems, where people tried to build reflection by hacking g++. What an overhead…

Besides that its two-level compilation scheme and user extensible syntax really complicates things like syntax highlighted editors and aspect oriented development. Aspects are proven to reduce development for certain tasks. And we know what an editor like eclipse can do with development time.

Besides this, as an academician I can say this paper is direct reject. But certainly, it is good to see companies coming to conferences. We will learn something from them, and maybe they will take home our ideas and implement something usable.

However, if my PhD brought this paper to me (a asst. prof of software engineering) or to my prof, our reply would be rewrite and think more. In software engineering, you have other problems than simple performance. Maintainability, evolvability , reliability… The language of choice depends on most of these requirements. Oww and also, on some occasions multi-language (multi-paradigm) programing is really helpful. Some problems simply don’t fit into OO.

2011-06-11 10:10 am

bouhko
Well, Google might be one of the few company where having a one percent performance increase might lead to multi-million savings. That’s due to the scale of their product.

I’m not really convinced about reflection in large-scale systems. It usually seems like a good idea at first, but when you actually come back some year later and try to understand what’s happening, it just adds another layer of complexity.

Almost all the other components of software engineering you mention – Maintainability, reliability – are fully achievable with C++ if you have competent programmers and some guidelines (like avoid pointer arithmetic and all the horrible things you can do). It might take a bit more work than with Java, but it’s certainly doable and once again, if you’re at Google-scale, taking 20% more development time to develop something 2% faster might be worth it.
2011-06-13 8:25 pm

ebasconp
Besides that its two-level compilation scheme and user extensible syntax really complicates things like syntax highlighted editors and aspect oriented development.

What two-level compilation scheme are you talking about?

In software engineering, you have other problems than simple performance. Maintainability, evolvability , reliability… The language of choice depends on most of these requirements. Oww and also, on some occasions multi-language (multi-paradigm) programing is really helpful. Some problems simply don’t fit into OO.

Maintainability, reliability and all that stuff and out of topic for the article’s document: It just talks about performance and, though C++ is not as clean or simple as Java, it is (together with C) dominating the industrial software development.

2011-06-13 8:46 pm

Alfman verbose=1
ebasconp,

<offtopic>

“What two-level compilation scheme are you talking about?”

I think he’s referring to include files and the C preprocessor. If so, I agree with him that it’s a regrettable mess.

“though C++ is not as clean or simple as Java, it is (together with C) dominating the industrial software development.”

C/C++ are often the defacto languages of choice for embedded and systems development.

I like the C logic syntax alot (more than ada/pascal languages). But both C and C++ have usability issues which would have been handled differently if they were designed today. I’m not sure if any new language can get enough momentum to take over from C in systems development?

</offtopic>

2011-06-11 4:22 pm

computrius
Is anyone really suprised by this? You have a language whose executable is native machine code vs languages that are run by an interpreter of byte code. Im not sure benchmarks were needed to say that c++ is going to produce a faster executing program.

2011-06-11 8:32 pm

Magma
Whaaa? are you trying to imply that Java is not compiled or as fast as C++?? OMG Heretic! Heretic! Heretic! Seize the infidel!

2011-06-11 8:19 pm

StaubSaugerNZ
In *general* the fastest language is (yukk) FORTRAN. This is becasue F77 lacks pointer aliasing that hamper optimization for C and C++ compilers.

But who wants to develop in FORTRAN? or C++ for that matter?

James Gosling reports that the French scientific computing institute INRIA already evaluated Java for High-Performance Computing (HPC) in 2008 and found it faster than C (which is generally faster than C++):

http://blogs.oracle.com/jag/entry/current_state_of_java_for

They found the principle limitation for HPC was the Java networking library.

This study was in 2008 and the JVM has not gotten slower in that time.

2011-06-11 8:35 pm

Magma
Turbo Pascal was pretty fast. Never really caught on, except for Delphi and a few other niche languages.

I think that Free Pascal typically ranks very high high in performance benchmarks.

2011-06-12 8:17 am

galvanash
Turbo Pascal was pretty fast. Never really caught on, except for Delphi and a few other niche languages.

I programmed in Delphi for a few years back when it was still fairly popular. Borland’s Pascal compilers were very fast as far as compilation speed goes… Compared to pretty much any C/C++ compiler it was at least an order or magnitude faster (or more even – mostly due to it’s single pass nature).

However, it did very little as far a optimizing the performance of the resulting executable (also at least partially due to it being a single pass compiler). It was fairly dumb to be honest. It did default to using registers for parameter passing and avoided building a stack when it could, but other than that it didn’t do much of anything interesting.

I don’t know much about free pascal, but I see it at least has SSE support, something that Delphi never had back in the day (it was purely x87 unless you did SSE explicitly using 3rd party libs). I would be quite surprised to see a single pass pascal compiler doing the kind of trans-formative optimization that you get from GCC or other modern compilers. There are many types of optimization that are simply not possible without multiple passes.
2011-06-13 4:53 pm

f0dder
Turbo Pascal, fast? Heh.

As noted belov, the compiler was fast, especially in the TP6 days – compiling directly to RAM was a nice productivity boost on the PIO harddrives back then.

But the compiler was naÃ¯ve, and with the slow CPUs back then you often needed to resort to inline BASM for even simple stuff… anybody remember Sally’s Peephole Optimizer? And anybody else around who rewrote (parts of) the CRT?

Things didn’t improve much with Delphi either, except CPUs had become fast enough that it wasn’t so relevant anymore.

2011-06-13 2:36 am

obsidian
Love to see them benchmark the D programming language.

I’ve been following D for a while and am rapidly falling in love with its syntax. Similar in many ways to C++, but generally has a *much* cleaner design.

C++ has a lot of “private this, friend that” noise.

The syntax in general is quite “noisy” and busy.

I’ve very rarely seen that kind of “syntax noise” in D. In D, you can throw templates around as much as you like, and it never seems to get as messy as C++.

Ok, D has its warts (the main couple being the separate Tango and Phobos libraries, and the “three char types, three string types” thing. However, the former is apparently being solved in D 2.0, and the latter is only a minor thing IMO.

Edited 2011-06-13 02:44 UTC
2011-06-13 9:02 am

Dasher42
What I like about C++: that you can use it like a slightly improved C, or use RTTI, STL templates, garbage collection, virtual function lookups, but the tradeoffs aren’t there unless you choose to use them. On the other hand, the carpal-tunnel-inducing syntax has led me to use Python for most of my work.

Now, the closest projects to offering the speed and ease of programming of both, in my mind, are PyPy with its compiler, and Genie, the alternative preprocessor for the Vala project.

http://codespeak.net/pypy/dist/pypy/doc/faq.html

http://live.gnome.org/Genie