Introduction to x64 assembly

Thom Holwerda 2013-04-05 General Development 31 Comments

“For years, PC programmers used x86 assembly to write performance-critical code. However, 32-bit PCs are being replaced with 64-bit ones, and the underlying assembly code has changed. This white paper is an introduction to x64 assembly. No prior knowledge of x86 code is needed, although it makes the transition easier.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

31 Comments

2013-04-05 6:24 pm
Tuishimi
…I haven’t had to use Assembly Language in something like 16 years but I have fond memories of it. And then it was VAX/Alpha Assembly Language, not x86.

2013-04-05 7:48 pm
moondevil
Me neither, but I have been having fun lately by porting a compiler done at University back in the 90’s to a workable state and cleaning the code to be GPL compliant.
So this article is quite refreshing.

2013-04-07 12:56 am
bram
Yeah, too much legacy crap.
Just the register naming mess alone is enough to put you off.
Architectures without all that legacy are so much cleaner.
I enjoyed 68000 and Cell-SPU assembly a lot.

2013-04-05 8:25 pm
transputer_guy
I wonder if I even kept my Vax assembler lang ref, that was almost a high level language architecture right there, codes that could be up to 20 bytes long or so. Pascal and other structured statements almost right in the instruction set and the antithesis of the coming RISC chips from Stanford (MIPS) and Berkeley (SPARC) and others.
My true love was the 68000 and later, so easy to do 32b code without any fuss or muss, even self modifying types of code till protection came along.
I never got into x86 until I got the books like “Inner Loops” by Booth and some others. That finally made a case to look it over and learn a bit about Pentium pipeline pairing, but by then the Athlons and newer Intel chips made any optimization attempts moot. No predictability at all, very un risc to me.
Plain C code now runs as fast as any assembler because the op odes are free, the memory references are free as long as they are in cache. Assembler is likely only worth doing for DSP codecs.

2013-04-05 6:53 pm
Drumhellar
With each new processor adding a new generation extending the ISA for more capability (On it’s own, a good thing), while never dropping support for older features (On it’s own, a good thing), x86 is a horrible mess.
I mean apart from changing the names of registers as they went from 16- to 32- to 64-bit (Which is cool), between Intel and AMD there’s x87, MMX, SSE, SSE2, SSE3, SSE4, AVX, 3D-Now! and 3D-Now!+ (unofficial name), XOP, FMA4, and CVT16, all for math. Some registers overlap between extensions and some don’t.
Non-math instructions extensions PAE, AMD-V and AMD-Vi, Intel VT and Intel VT-d, VT-c, and EPT.
Note that this may not be an exhaustive list, though listing them was exhausting.
Also, poo on you, Intel, for using “x64”. It’s AMD64, and just because decided to do a dick move and purposely make your 64-bit extensions slightly incompatible in a few corner cases so you could call it something other than AMD64 doesn’t mean you have a different 64-bit implementation. No. All it means is that you have a broken AMD64 implementation.

2013-04-05 7:41 pm
Neolander
Well, I’d say that the fix for x86 extension proliferation is easy: find that forgotten corner of the AMD64 spec where it states that the architecture mandates the presence of SSE2, NX and CPUID extensions among a few other things, then leave the rest to compilers, unless you really need a feature (like VT) or have to get the most out of a specific range of hardware for some reason.
I don’t think it’s a very good idea to rely on hardware-specific features which are not guaranteed to be present on future processor generations, even if it is the case in practice for now. Just look at how little of these x86 extensions Atom processors support: tomorrow, these little chips will likely be good enough for the average Joe’s computing needs…
Edited 2013-04-05 19:41 UTC

2013-04-05 7:46 pm
moondevil
This is one of the reasons why it is so hard to use Assembly nowadays by comparison with the old 8 and 16 bit CPUs.
Not only are the CPU doing lots of manipulations with your code as if they were compilers, the amount of instruction sets to remember is just too big.

2013-04-05 8:30 pm
transputer_guy
The nice thing about the early 32b x86 RISC books like “Inner Loops” was they made it quite clear which instructions should be used in assembler and which to ignore completely. So several hundred codes was reduced to a very small set of basic ops, almost all reg to reg and the load store. Basically the Pentium was a improved 486.
As for 64 bit codes, I’ll have to look into that.

2013-04-05 9:37 pm
moondevil
Do you remember the Pentium programming series in Dr Dobbs from Michael Abrash?
They were all about making x86 Assembly developers how to write code to minimize processor stalls, wrong branch predictions and cache misses.
Issues that weren’t a problem before.
2013-04-05 9:57 pm
tylerdurden
They were all about making x86 Assembly developers how to write code to minimize processor stalls, wrong branch predictions and cache misses.
Issues that weren’t a problem before.
That could very well be because up to the 386, the x86 family had been unpipelined in-order stack machines for all intent and purposes.
All the issues you mention are intrinsic to most in-order, pipelined, superscalar designs.
Edited 2013-04-05 22:00 UTC
2013-04-05 10:06 pm
transputer_guy
I did get Dr Dobbs from time to time, but I also have the Michael Abrash book too (Zenn of code optimization, + graphic prog), a lot similar to the Inner Loops. I like the latter because I was only interested in certain types of asm code like JPEG DCT and well inner loops. Its always near by.
Ultimately I let the C compiler do the work of compiling C fragments that are 1 to 1 to asm opcodes, all inline. It just looks nicer than opting into the uglier asm syntax. I never learnt to use the mmx or sse stuff at all, I copped out.
2013-04-06 5:20 am
moondevil
On the toy compiler that I am cleaning up, referenced in another post, I am actually porting the runtime from C to Assembly.
It is not big, so the porting effort for new architectures is not much bigger than porting the compiler’s backend, removes the dependency on a C compiler for the runtime library and it is fun anyway.
This is not a commercial product, so I can allow myself to do this type of stuff.
Edited 2013-04-06 05:25 UTC
2013-04-06 6:47 am
TempleOS
Are you calling my compiler a toy and accusing me of not writing it?
It was a toy when I started in 2004, but now it’s good.
http://web.archive.org/web/20040606212724/www.simstructure.hare.com…
I optimized by merging little instructions into big x86_64 ones, but the CPU didn’t care because it breaks them back into little instructions.
x86 has “1.5 args” per instruction — it’s CISC. You might as well use 2 instructions for every one and it won’t matter.
I started with a stack machine.
1+(2+3)*4
PUSH 1
PUSH 2
PUSH 3
POP A
POP B
A=A+B
PUSH A
PUSH 4
POP A
POP B
A=A*B
PUSH A
POP A
POP B
A=A+B
PUSH A
POP RESULT
Here’s a mini x86_64 compiler:
http://www.templeos.org/Wb/Demo/Lectures/MiniCompiler.html
Edited 2013-04-06 06:59 UTC
2013-04-06 7:04 am
moondevil
You have to work on your personality, who the hell was talking about your stuff?
I was referring to my own work,
http://www.osnews.com/permalink?557794
Make sure you read the comments properly before replying.
2013-04-06 7:11 am
TempleOS
No. On my Internet, everything is about me. You are CIA and I am in prison.
You think this guy worries about being on the dole?
http://www.youtube.com/watch?v=VuCCgQsyq8s
Edited 2013-04-06 07:26 UTC
2013-04-08 11:38 am
dnebdal
Oops, read page 2 before answering. Ignore this.
Edited 2013-04-08 11:39 UTC

2013-04-05 9:09 pm
Drumhellar
Well, Atom has this problem as well. All Atoms support up to SSE3, some have support for SSSE3 (An extension to SSE3), and the newest support Intel VT-x.
Obviously, they don’t support the AMD extensions (3D Now!, XOP, FMA4, and CVT16)
All this stuff adds complexity to the front end (One of the main targets for reducing power consumption for the Atom), but at least the back-end stages doesn’t get significantly more complex.

2013-04-06 8:26 am
Neolander
Well, Atom has this problem as well. All Atoms support up to SSE3, some have support for SSSE3 (An extension to SSE3), and the newest support Intel VT-x.
Obviously, they don’t support the AMD extensions (3D Now!, XOP, FMA4, and CVT16)
All this stuff adds complexity to the front end (One of the main targets for reducing power consumption for the Atom), but at least the back-end stages doesn’t get significantly more complex.
My point was not that Atom processors do not suffer from extension proliferation, but that x86 extensions are not guaranteed to last forever. Especially considering that in computing history, any time computer hardware has started to get dangerously close to “good enough”, hardware guys have come up with a more constrained computer form factor that called for less capable CPUs and thus yet another new performance race.
Today, cellphones SoCs are getting so fast that Apple, Google and Microsoft have a hard time keeping OS bloat high enough to drive hardware sales. So I’m pretty sure that somewhere in R&D labs, the Next Big Thing is closing in pretty fast. And that its earliest iteration will have an incredibly primitive CPU by modern standards.
Unless, of course, everything goes cloud at this point, bringing back the mainframe age. In which case CPU extensions could still become irrelevant, but this time it would be because no one cares about the performance of individual CPUs when the main issue is spreading work on thousands of these.
Edited 2013-04-06 08:32 UTC

2013-04-06 8:52 am
TempleOS
My point was not that Atom processors do not suffer from extension proliferation, but that x86 Today, cellphones SoCs are getting so fast that Apple, Google and Microsoft have a hard time keeping OS bloat high enough to drive hardware sales. So I’m pretty sure that somewhere in R&D labs, the Next Big Thing is closing in pretty fast. And that its earliest iteration will have an incredibly primitive CPU by modern standards.
My OS is x86_64 but I do not use paging and everything runs in ring-0, so a primitive CPU could be used.

2013-04-05 8:38 pm
transputer_guy
I see in the Intel article it says that sometimes modern compilers and optimizers make mistakes. I have sometimes experienced that but always wondered what I’d done wrong and looking at x86 code from a compiler is never much fun.
One bug I see a often is that release and debug code can be so different that one never works while the other does, until I stumble on the C code that caused it.
2013-04-06 1:28 am
TempleOS
http://www.templeos.org/Wb/Demo/Lectures/64BitAsmQuiz.html

2013-04-06 1:34 pm
transputer_guy
That is not a 64b asm test, it is a test of Intel/AMD x86 knowledge. Having looked at it and the Intel article is enough to not want to bother with this level of detail. Your quiz includes lots of arcane stuff that seems to have little to do with general purpose 64b in general.
For me a 64b asm should look like a flat register space with all registers acting the same (except R0) and all named Ri. As i gets bigger, the opcode may get more prefixed to allow ever more registers. At some point the registers might even spill into memory stack space, from i==64 to some limit. This kind of opcode allows 3 ops, z<=x fn y, a pleasure to code, almost like C with 1 op statements.

2013-04-06 10:53 pm
TempleOS
You have a valid point, except the implementation is heavily influnced by arcane crap. You have to know this for writing a C compiler. ASM is a stupid thing to do, otherwise.
You have only have very limited access to 64-bit immediates. You have lots of instructions that take 32-bit immediates but you can only load a 64-bit immediate into a register.
One really important fact is that the upper 32-bits get set to zero on any 32-bit size instruction. This is your bread and butter.
You can only branch +/- 32 relative addressing. I put all code in the lowest 2 Gig.
2013-04-08 11:43 am
dnebdal
At least the new registers on amd64 are named r8 â†’ r15, which is marginally sane. Shame about the first half, of course.

2013-04-08 2:42 pm
Kochise
First half is for legacy reason, just like segmented memory lasted that long. Check for ARM or 68K assembly, you’ll touch heaven…
Kochise
2013-04-08 4:16 pm
transputer_guy
Back when I spent some time on the x86 asm in 32b, I just used R0-7 as labels, never used them in their archaic 16/8 bit modes, so it wasn’t that bad. If I did use the x64 asm, I would do the same again, no interest in the cruft.
But 16 registers or more is even better and the other ISAs had that decades ago, and even more so when you can have 3 operands per opcode, that just does the same work as 2 or more x86 opcodes.

2013-04-08 4:48 pm
TempleOS
You have no choice but to use 32-bit because there are only 32-bit immediates, mostly.
The only 64-bit immediate is for any register but only like this
MOV R8,0x1234567890ABCDEF
All other addressing modes have 32-bit immediates.
CALL REL32
No
CALL REL64
If you want long call?
MOV RAX,0x123456789ABCDEF
CALL RAX
Edited 2013-04-08 16:49 UTC
2013-04-09 6:29 pm
Alfman verbose=1
transputer guy,
My personal opinion is that it isn’t so bad to be able to work with multiple word sizes on x86. It can eliminate shifting/masking bits on other processors.
The other cruft and inconsistencies are more unfortunate though, like instruction prefixes and variably sized op-codes. These basically mandate a CISC processor to recompile x86 code into easier to execute microcode. AMD could have implemented a better 64bit architecture if it weren’t for the market demand for “x86” processors. For better or worse, AMD64 has given x86 new life for at least another decade.
Intel’s 64bit mistake was designing it’s processor for enterprise and failing to produce a 64bit processor for the consumer market. Small developers like me who were interested in developing for itanium’s advanced features couldn’t afford one. Very little software would get rewritten to make use of it’s explicit parallelism. Ultimately, without native software, the architecture was destined to be judged by how well it ran x86, and in this respect it was an abysmal failure. x86 requires insanely complex pipelines and dependency logic to infer parallelism, which the itanium didn’t implement.
So for now, we’re basically stuck with x86 for a while longer. 64bit ARM processors will be very promising for future general purpose computers but as long as microsoft insists on neutering both the hardware and software (aka win-rt), it’s going to seriously hurt any transition in the desktop space.
Edited 2013-04-09 18:31 UTC
2013-04-10 5:03 pm
transputer_guy
No real argument there, it will be interesting to see how the ARM64 plays out both on servers and handhelds far away from Microsoft and Intel.
I still have to make some time for my Pi board. It’s been a long time since I spent much time on ARM, a TDMI7 made for Intel by GEC (UK).

2013-04-06 7:55 am
TempleOS
Talk to God. Tongues ouija board, more or less.
http://www.templeos.org/files/TSGodSetUp.zip
2013-04-06 8:53 pm
kalcytriol
64bit ARM assembly (for AArch64) is much more interesting:
http://www.extremetech.com/computing/139455-cortex-a57-takes-arm-to…
http://www.arm.com/products/processors/cortex-a50/cortex-a57-proces…
Edited 2013-04-06 21:01 UTC