Linked by Thom Holwerda on Mon 12th Mar 2012 19:00 UTC, submitted by yoni
Privacy, Security, Encryption "And just when you thought the whole Stuxnet/Duqu trojan saga couldn't get any crazier, a security firm who has been analyzing Duqu writes that it employs a programming language that they've never seen before." Pretty crazy, especially when you consider what some think the mystery language looks like "The unknown c++ looks like the older IBM compilers found in OS400 SYS38 and the oldest sys36.The C++ code was used to write the tcp/ip stack for the operating system and all of the communications."
Thread beginning with comment 510439
To view parent comment, click here.
To read all comments associated with this story, please click here.
Neolander
Member since:
2010-03-08

Take a look at the AMD64 calling convention then... It seems that they have spent so much effort into making it faster through increased register use that now, only optimizing compilers can understand the logic behind it...

Reply Parent Score: 2

Alfman Member since:
2011-01-28

Neolander,

I haven't done asm for amd64, but it'd make sense that they've done something more optimal than passing via stack considering the extra registers.
http://en.wikipedia.org/wiki/X86_calling_conventions

"The registers RCX, RDX, R8, R9 are used for integer and pointer arguments (in that order left to right), and XMM0, XMM1, XMM2, XMM3 are used for floating point arguments. Additional arguments are pushed onto the stack (right to left). Integer return values (similar to x86) are returned in RAX if 64 bits or less. Floating point return values are returned in XMM0. Parameters less than 64 bits long are not zero extended; the high bits contain garbage."

(more info about the stack omitted)


However the point I was trying to get at is that any fixed calling convention is always going to require more shuffling simply for the sake of getting parameters in the right place.

Here's a pointless example:

int F1(int a, int b) {
int r=0;
while(b-->0) r+=F2(a,b);
return r;
}
int F2(int a, int b) {
while(a--) b+= F3(b);
return b;
}
int F3(int a) {
return a*(a+3);
}

Obviously in this case it makes the most sense to inline the whole thing, but maybe we're using function pointers or polymorphism which makes inlining impractical. It should be fairly easy to make F2 work without stepping on F1's registers, and the same goes for F3 so that no runtime register shuffling is needed at all between the three functions.

The moment any calling convention imposed however, moving/saving/restoring registers becomes an unavoidable necessity.

Of course, today's pipelined processors are good at doing register renaming and what not to reduce the overhead of such shuffling. However one inefficient scenario has always stood out like a sore thumb, and it perturbs me when I program in high level languages, it's the inability to return more than one unit of data from a function call. The CPU has no such limitation, and BIOS programmers routinely return more data points as needed, even using CPU flags which the caller can use for conditional jumps. I find this model works extremely well in ASM, but alas C programmers are forced to overload the return value (using the sign bit) and/or return extra values using memory pointers.

Reply Parent Score: 2

yoursecretninja Member since:
2006-01-02

I don't have anything directly on topic to contribute to this, but... I want to say that this thread is very interesting and informative; exactly the kind of thing that made me a regular reader of OSNews.

Reply Parent Score: 1

acobar Member since:
2005-11-15

Indeed, you raised very interesting points about the drawbacks of having a calling convention (CC).

Disclaimer: there are more than 15 years since I last coded in asm.

About the multiple data return (MDR), perhaps, it would create a nightmare for compilers writers for, perhaps, not so much benefit? We also should note that one of the key points of a CC is also to allow code efficiency. For example, if a function returns an integer, the only thing you need to do before call it is save the return register, for example, eax.

You do:
push eax ; save it as eax will be used as rval
push ff0 ; 2nd arg - 8 bytes
push ebx ; 1st arg
call randomf
add esp, 12
mov [edi], eax ; get rval
pop eax
; restore eax

Suppose you had a MDR operator, like =* for example, and you could declare a function to be like int : float getboth(int i, float f).

You write:
m:q =* getboth(1, 2.0);


Everything nice but what are the implications if you write:
m : q =* getboth(1, 2.0) * getboth(2, 1.2);
?

You now must extend the syntax of the whole language so that this kind of construction can be useful and, to make code efficient, would need to reserve two registers to cope with the return values. Now, imagine you would like to return, say, 16 values on processors with few resources. You would run out of registers.

Also, on C compilers now you just use a reference and the compiler may altogher try to eliminate the associated pushes and pops.

Reply Parent Score: 2

Neolander Member since:
2010-03-08

I haven't done asm for amd64, but it'd make sense that they've done something more optimal than passing via stack considering the extra registers.
http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling...

Sure, I was just arguing that the set of registers which they have picked seems to only make sense in the context of specific compiler implementations. Why do they use R8 and R9, as an example ? Why RAX, RCX, RDX, but not RBX ? How is a regular C compiler supposed to figure out what is a system call and what isn't in order to use R10 properly, and why is only one syscall parameter getting that optimization ? The set of registers which they have picked has no apparent internal logic, and I cannot see how an ASM dev could remember it all except by keeping the doc at hand at all time or memorizing it in a brute force fashion.

However the point I was trying to get at is that any fixed calling convention is always going to require more shuffling simply for the sake of getting parameters in the right place.

Here's a pointless example:

int F1(int a, int b) {
int r=0;
while(b-->0) r+=F2(a,b);
return r;
}
int F2(int a, int b) {
while(a--) b+= F3(b);
return b;
}
int F3(int a) {
return a*(a+3);
}

Obviously in this case it makes the most sense to inline the whole thing, but maybe we're using function pointers or polymorphism which makes inlining impractical. It should be fairly easy to make F2 work without stepping on F1's registers, and the same goes for F3 so that no runtime register shuffling is needed at all between the three functions.

The moment any calling convention imposed however, moving/saving/restoring registers becomes an unavoidable necessity.

A possible problem which I would spontaneously see with the examples is that in the cases that you mention, unless I'm misunderstood, inlining is not performed because the compiler is unable to efficiently detect the relationship between F1, F2, and F3 at compile time. If so, how could it make sure that the functions are not stepping on each other's registers ?

Besides, I am not sure that compilers have to follow calling conventions for anything but external library calls, for which some kind of standard is necessary since the program and the library are compiled separately. As an example, when inlining is performed, calling conventions are violated (or rather bypassed), and no one cares.

Of course, today's pipelined processors are good at doing register renaming and what not to reduce the overhead of such shuffling. However one inefficient scenario has always stood out like a sore thumb, and it perturbs me when I program in high level languages, it's the inability to return more than one unit of data from a function call. The CPU has no such limitation, and BIOS programmers routinely return more data points as needed, even using CPU flags which the caller can use for conditional jumps. I find this model works extremely well in ASM, but alas C programmers are forced to overload the return value (using the sign bit) and/or return extra values using memory pointers.

Indeed, the inability of C and C++ to return any other status information that "operation failed" without using fancy tricks have bothered me more than once too. I typically use structures to get around that, but that too can quickly become a bother.

Ideally, any language would support tuples like Python's, where you can shove a set of inhomogeneous objects into the returned "value" of a function without caring what happens under the hood. But I suspect that this can be hard to optimize properly.

Edited 2012-03-14 04:47 UTC

Reply Parent Score: 2