Linked by Thom Holwerda on Mon 12th Mar 2012 19:00 UTC, submitted by yoni
Privacy, Security, Encryption "And just when you thought the whole Stuxnet/Duqu trojan saga couldn't get any crazier, a security firm who has been analyzing Duqu writes that it employs a programming language that they've never seen before." Pretty crazy, especially when you consider what some think the mystery language looks like "The unknown c++ looks like the older IBM compilers found in OS400 SYS38 and the oldest sys36.The C++ code was used to write the tcp/ip stack for the operating system and all of the communications."
Thread beginning with comment 510449
To view parent comment, click here.
To read all comments associated with this story, please click here.
Alfman
Member since:
2011-01-28

Neolander,

I haven't done asm for amd64, but it'd make sense that they've done something more optimal than passing via stack considering the extra registers.
http://en.wikipedia.org/wiki/X86_calling_conventions

"The registers RCX, RDX, R8, R9 are used for integer and pointer arguments (in that order left to right), and XMM0, XMM1, XMM2, XMM3 are used for floating point arguments. Additional arguments are pushed onto the stack (right to left). Integer return values (similar to x86) are returned in RAX if 64 bits or less. Floating point return values are returned in XMM0. Parameters less than 64 bits long are not zero extended; the high bits contain garbage."

(more info about the stack omitted)


However the point I was trying to get at is that any fixed calling convention is always going to require more shuffling simply for the sake of getting parameters in the right place.

Here's a pointless example:

int F1(int a, int b) {
int r=0;
while(b-->0) r+=F2(a,b);
return r;
}
int F2(int a, int b) {
while(a--) b+= F3(b);
return b;
}
int F3(int a) {
return a*(a+3);
}

Obviously in this case it makes the most sense to inline the whole thing, but maybe we're using function pointers or polymorphism which makes inlining impractical. It should be fairly easy to make F2 work without stepping on F1's registers, and the same goes for F3 so that no runtime register shuffling is needed at all between the three functions.

The moment any calling convention imposed however, moving/saving/restoring registers becomes an unavoidable necessity.

Of course, today's pipelined processors are good at doing register renaming and what not to reduce the overhead of such shuffling. However one inefficient scenario has always stood out like a sore thumb, and it perturbs me when I program in high level languages, it's the inability to return more than one unit of data from a function call. The CPU has no such limitation, and BIOS programmers routinely return more data points as needed, even using CPU flags which the caller can use for conditional jumps. I find this model works extremely well in ASM, but alas C programmers are forced to overload the return value (using the sign bit) and/or return extra values using memory pointers.

Reply Parent Score: 2

yoursecretninja Member since:
2006-01-02

I don't have anything directly on topic to contribute to this, but... I want to say that this thread is very interesting and informative; exactly the kind of thing that made me a regular reader of OSNews.

Reply Parent Score: 1

Alfman Member since:
2011-01-28

yoursecretninja,


Yea, I love the technical topics and to understand the OS internals... Optimizing stuff is a challenge I always enjoy, but it's an archaic concept these days. I only wish I could land a decent job where my skills were actually appreciated, it's a struggle. On my own I'm working on a secure dedupping P2P backup protocol, which is alot more interesting than my day job.

Reply Parent Score: 2

acobar Member since:
2005-11-15

Indeed, you raised very interesting points about the drawbacks of having a calling convention (CC).

Disclaimer: there are more than 15 years since I last coded in asm.

About the multiple data return (MDR), perhaps, it would create a nightmare for compilers writers for, perhaps, not so much benefit? We also should note that one of the key points of a CC is also to allow code efficiency. For example, if a function returns an integer, the only thing you need to do before call it is save the return register, for example, eax.

You do:
push eax ; save it as eax will be used as rval
push ff0 ; 2nd arg - 8 bytes
push ebx ; 1st arg
call randomf
add esp, 12
mov [edi], eax ; get rval
pop eax
; restore eax

Suppose you had a MDR operator, like =* for example, and you could declare a function to be like int : float getboth(int i, float f).

You write:
m:q =* getboth(1, 2.0);


Everything nice but what are the implications if you write:
m : q =* getboth(1, 2.0) * getboth(2, 1.2);
?

You now must extend the syntax of the whole language so that this kind of construction can be useful and, to make code efficient, would need to reserve two registers to cope with the return values. Now, imagine you would like to return, say, 16 values on processors with few resources. You would run out of registers.

Also, on C compilers now you just use a reference and the compiler may altogher try to eliminate the associated pushes and pops.

Reply Parent Score: 2

Alfman Member since:
2011-01-28

acrobar,

"About the multiple data return (MDR), perhaps, it would create a nightmare for compilers writers for, perhaps, not so much benefit?"

I guess the multiple return has pros and cons on two fronts:
1. What would be the necessary impact on compiler implementations and calling conventions under the hood?
2. How would this language feature change the way high level software is written?

I won't speak towards #1 since that would deserve a much more in-depth analysis than either of us can commit to for this conversation.

As for #2, there's at least one extremely common use case that crops up over and over again, and that's a function which returns both a status and a data value. This pattern is so common I wouldn't mind a language addressing it specifically.


long pos=ftell(file);
if (pos<0) {
printf("error %s\n", str_error(errno));
}

This convention which is so common on linux has some problems. For one, pos cannot distinguish between a high bit being a legitimate position or an ftell error. Therefor, because of overloading, the range is half of what it should be. Another is the use of the TLS global errno to return a status. Maybe it's a necessary evil, but it's still not pretty and it is still compulsory to flag the error using a returned value. Other times the return value/type cannot be overloaded for the error case at all.

All these problems can be easily/efficiently solved in ASM using output flags and registers, but as you rightly observed the question is how to create a clear syntax to deal with it.

One approach is to adapt the perl error syntax which I find pretty clear.

my $pos = ftell(FILE) or die($!); # what to do on error

Of course perl supports multi-value returns directly too.
my ($a, $b) = ftell(FILE); // this requires just one input register to be occupied, leaving the rest free
// The syntax may be rough for "one-liners", but the $a and $b temp variables can reference the registers as is without any copying.

C forces us to offload the value temporarily into memory
int pos;
if (!ftell(FILE, &pos)) { error... } // This burns one more input register than the prior version, and also wastes two memory accesses.


"Also, on C compilers now you just use a reference and the compiler may altogher try to eliminate the associated pushes and pops."

I think C only has leeway to do this for inlined functions. Inter-procedural code cannot be optimized without breaking calling conventions.

Reply Parent Score: 2

Neolander Member since:
2010-03-08

I haven't done asm for amd64, but it'd make sense that they've done something more optimal than passing via stack considering the extra registers.
http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling...

Sure, I was just arguing that the set of registers which they have picked seems to only make sense in the context of specific compiler implementations. Why do they use R8 and R9, as an example ? Why RAX, RCX, RDX, but not RBX ? How is a regular C compiler supposed to figure out what is a system call and what isn't in order to use R10 properly, and why is only one syscall parameter getting that optimization ? The set of registers which they have picked has no apparent internal logic, and I cannot see how an ASM dev could remember it all except by keeping the doc at hand at all time or memorizing it in a brute force fashion.

However the point I was trying to get at is that any fixed calling convention is always going to require more shuffling simply for the sake of getting parameters in the right place.

Here's a pointless example:

int F1(int a, int b) {
int r=0;
while(b-->0) r+=F2(a,b);
return r;
}
int F2(int a, int b) {
while(a--) b+= F3(b);
return b;
}
int F3(int a) {
return a*(a+3);
}

Obviously in this case it makes the most sense to inline the whole thing, but maybe we're using function pointers or polymorphism which makes inlining impractical. It should be fairly easy to make F2 work without stepping on F1's registers, and the same goes for F3 so that no runtime register shuffling is needed at all between the three functions.

The moment any calling convention imposed however, moving/saving/restoring registers becomes an unavoidable necessity.

A possible problem which I would spontaneously see with the examples is that in the cases that you mention, unless I'm misunderstood, inlining is not performed because the compiler is unable to efficiently detect the relationship between F1, F2, and F3 at compile time. If so, how could it make sure that the functions are not stepping on each other's registers ?

Besides, I am not sure that compilers have to follow calling conventions for anything but external library calls, for which some kind of standard is necessary since the program and the library are compiled separately. As an example, when inlining is performed, calling conventions are violated (or rather bypassed), and no one cares.

Of course, today's pipelined processors are good at doing register renaming and what not to reduce the overhead of such shuffling. However one inefficient scenario has always stood out like a sore thumb, and it perturbs me when I program in high level languages, it's the inability to return more than one unit of data from a function call. The CPU has no such limitation, and BIOS programmers routinely return more data points as needed, even using CPU flags which the caller can use for conditional jumps. I find this model works extremely well in ASM, but alas C programmers are forced to overload the return value (using the sign bit) and/or return extra values using memory pointers.

Indeed, the inability of C and C++ to return any other status information that "operation failed" without using fancy tricks have bothered me more than once too. I typically use structures to get around that, but that too can quickly become a bother.

Ideally, any language would support tuples like Python's, where you can shove a set of inhomogeneous objects into the returned "value" of a function without caring what happens under the hood. But I suspect that this can be hard to optimize properly.

Edited 2012-03-14 04:47 UTC

Reply Parent Score: 2

Alfman Member since:
2011-01-28

Neolander,

"Sure, I was just arguing that the set of registers which they have picked seems to only make sense in the context of specific compiler implementations. Why do they use R8 and R9, as an example ? Why RAX, RCX, RDX, but not RBX ?"

Ah well now I can't answer that (or your other questions). Back with real mode addressing the choice of registers was more significant, but now...it may be somewhat arbitrary? I'm not sure about the conventions for special AMD64 cases.


"...unless I'm misunderstood, inlining is not performed because the compiler is unable to efficiently detect the relationship between F1, F2, and F3 at compile time. If so, how could it make sure that the functions are not stepping on each other's registers ?"

My counter argument is that if a human programmer can see the relationship, so too should an ideal compiler. Of course it can only prove relationships for internal dependencies which are available at compile time, but I think that's a given.

As for the reason not to inline, besides the two I already listed (function pointers and polymorphism), one might be circular recursion. Another obvious one is size/cache optimization. Another reason might be "tail calling" where a function can perform a jump directly into another function instead of a call followed by a ret. Sometimes these end up being 100% free in the context of conditional logic which would require a jump anyways, so nothing is saved under the inline code path.

Note: GCC is already able to optimize away tail calls so that the function below will run indefinitely without running out of stack.

int forever(int x) {
printf("%d\n", x);
return forever(x+1);
}


"Besides, I am not sure that compilers have to follow calling conventions for anything but external library calls, for which some kind of standard is necessary since the program and the library are compiled separately. As an example, when inlining is performed, calling conventions are violated (or rather bypassed), and no one cares."

Yes that's the theory, but in practice GCC always uses the calling convention. I think C functions default to being externally callable to aid in external linking. In fact the whole methodology of compiling to objects and then linking together into a static binary is an obstacle for any compiler which would like to perform interprocedural optimization.


"Ideally, any language would support tuples like Python's, where you can shove a set of inhomogeneous objects into the returned 'value' of a function without caring what happens under the hood. But I suspect that this can be hard to optimize properly."

Well I don't know about python's implementation. However sometimes I find it helpful to stop looking at things as well defined mathematical functions and instead look at it like a sequence of code blocks, always moving forward by "jumping" to the next block with more parameters for it. Call and ret are simply different mechanisms for locating the next block, but otherwise do the same thing. So there's no difference in what I can pass from one block to the next.

To highlight this:
function A() {
{blockA1}
B()
{blockA2}
}

function B() {
{blockB1}
}

We can 'Unthink' the function abstraction to get:
{blockA1} -> {blockB1} -> {blockA2}
There's no fundamental reason the parameters from B1 to A2 cannot be just as rich as the parameters from A1 to B1.

Edit: I wanted to highlight how input and output parameters are really different perspectives on the same thing, and they would not need to have two different optimization mechanisms if it weren't for differences in higher level function semantics.

Addendum: It would be pretty cool to have a language where a "return" would be syntactically the same as a function call with type-checking and everything.

function A() {
B() returns(int a, char b);
printf("%d %c\n", a,b);
}

function B() {
return(4,'c');
}

Edited 2012-03-14 15:10 UTC

Reply Parent Score: 2