What do you all think is technically wrong with x86, why can't it be fixed, and which CPU and/or ISA (instruction set architecture) should follow? Why?
Also, one question that I couldn't answer myself by searching the web: For years, x86 suffered from too few general purpose registers (8), that were only recently raised to 16 with x86-64 (in long mode). What was the reason the number of registers in x86 was not increased earlier and higher? Why limit it to just 16 when other CPU's often offer something around 128 registers or so? Too expensive?
Post a Comment
Up until 10-15 years ago a considerable amount of game and application development was still carried in assembly language. At that time, the x86 ISA seemed poor compared to that of other chips such as 68000 and ARM. However, unless writing device drivers or a compiler, most programmers don't see the ISA at all these days. It seems a shame to some that the weakest design achieved dominance. These days a CPU is a black box - you put code and data in one side and data comes out the other end. The performance, cost, and power consumption are the factors that differentiate chips for most people.
Sometimes people on this site comment that they wouldn't mind having a PPC Linux workstation for example. I'm always left scratching my head as to why.
Obviously the current chips would be more powerful, or more power efficient, or both, if they didn't run x86 code as they have to convert it into RISC code on the fly. This constitutes a waste of resources.
In the case of a netbook CPU, things are a bit different as power efficiency is as important as speed and legacy compatibility. In other words, extra battery life might be better than the ability to natively run Windows applications.
I always presumed that registers must be implemented as some sort of really high speed RAM located within the CPU core. I suppose that it adds more complexity and cost to the chip. As I say, a CPU should be considered as a black box. Perhaps the designers of x86-64 chips discovered that the performance gains of more than 16 registers were less than other measures such as longer pipelines, L1 cache or faster clockrates.
First off, thanks for your answer!
When you say that the CPU is a black box, do you mean that this is a good thing in general? Let's pretend for a moment that the ISA of today's dominant chip was the best one imaginable: Which one would it be? And would it increase the chance that more dedicated assembly programmers would fine tune their code such that it could even offer better performance than what is possible today, just because the ISA fits the human brain better? Would maybe the CPU in that scenario not be considered a black box, for our own benefit?
This is not obvious to me, however. It is claimed for years that x86 code was inefficient WRT performance, but how is that in line with the reality that the fastest CPU's (disregarding supercomputing) are in fact x86 CPU's? And that even Intel themselves, despite huge amounts of money for research and development that went into the Itanium, have not been able to create a CPU that was substantially faster than their own x86 offerings?
As to power efficiency, I can't say much about it, and you may be right. :-)
Yes, probably. Unfortunately, not much information about it can be found on the web...
"When you say that the CPU is a black box, do you mean that this is a good thing in general?"
I think that people sometimes become emotionally attached to a hardware platform. I was too, but these days, all I care about is performance and cost in a desktop CPU.
I don't think that any changes to the current ISAs would cause application developers to go back to assembly language programming en masse because software has become too complex to program at such a low level. In other words, time is a resource. Why spend three times as long working on the code for a possible, small speed benefit? In some cases, compilers can perhaps beat human assembly language programmers. On any typical programming project, three times the manpower could be better spent than wading through assembly language.
So, as no one even sees the ISA now, I don't think it matters what the underlying technology is.
This is an interesting question. If I'd choose a dream CPU of my web server it would be the highly threaded Sun T2000 CPU, but I'd suffer immensely having that for a desktop CPU. This might change in 5-10 years when/if desktop software will be highly parallel/threaded though.
MC68000 to some extent, MIPS, ARM, PowerPC etc are usually examples of "good" architectures, but how would they deal with the growth pain of 64"? Either new 64" dedicated designs that would do poor performance software emulation of 32" or they'd do a new protected mode for 64" à la a modern x86 processor. And in AMD64 mode, from the little I've read, the x86 is pretty nice.
The more cores you cram onto a die, the more memory bandwidth becomes an issue and CISC is more compact than RISC and therefore CISC architectures (or CISC instructions, RISC translation) might make sense more than a clean RISC only architecture.
This is an interesting question. If I'd choose a dream CPU of my web server it would be the highly threaded Sun T2000 CPU, but I'd suffer immensely having that for a desktop CPU. This might change in 5-10 years when/if desktop software will be highly parallel/threaded though.
MC68000 to some extent, MIPS, ARM, PowerPC etc are usually examples of "good" architectures, but how would they deal with the growth pain of 64"? Either new 64" dedicated designs that would do poor performance software emulation of 32" or they'd do a new protected mode for 64" à la a modern x86 processor. And in AMD64 mode, from the little I've read, the x86 is pretty nice. "
I think nice is relative.
Anyway, MIPS and PowerPC (and PA-RISC) had relatively easy transitions to 64-bit. Their 64 bit instruction sets are supersets of the 32 bit ones, and 32-bit code runs without emulation or different protection modes. The only instructions that are different in 64 bit mode are rotates. Twos complement allows 32 bit code to use 64 bit registers just fine for most operations.
Oh, and the big RISC processors all transitioned in the early-mid 90's before they even had 64 bit OSes available.
The more cores you cram onto a die, the more memory bandwidth becomes an issue and CISC is more compact than RISC and therefore CISC architectures (or CISC instructions, RISC translation) might make sense more than a clean RISC only architecture.
Instruction density is only half the story. Many instructions and memory bandwidth on x86 is used simply for shuffling values into and out of registers, overhead that is generally not required on 32 register instruction sets. So, x86 may well require more instructions to execute for a given function, even if the mount of memory used is smaller. And with code sizes being dwarfed by data sizes these days, less and less bandwidth is being used by instructions.
And lets not forget the mighty Alpha. Given the resources Intel pour into chips and fabs, what could Alpha have achieved given the same resources. Compare the sizes of the original Pentium and the original Alpha. ~3.1 million transisters versus ~1.7m, which CPU do you think you could cram more of onto a die?
I think nice is relative. "
Seen from the perspective of an assembly language programmer?
So, put (very) simply, you say that code density does not matter much anymore since data structures are so much larger than code sections today?
I didn't understand, however, what you meant by the "overhead that is generally not required on 32 register instruction sets." What 32 register instructions sets do you mean? Could you clarify that? Thanks!
Are you thus saying that we would have faster CPU's now, if Alpha had the same amount of money put into its development as the x86?
MC68000 to some extent, MIPS, ARM, PowerPC etc are usually examples of "good" architectures, but how would they deal with the growth pain of 64"? (...)
Now, leaving such problems aside, and assuming that one of the CPU's you mentioned were in fact today's dominant chip set without the need to be backwards compatible.
Do you think that just because these are "good architectures", todays programmers would write more efficient code, maybe even in assembly language? Given a "good architecture", this should be easier than on a bad one (as x86), shouldn't it?
Do you think that if the Sun T2000 CPU that you mentioned was dominant, that programmers would already have created suitable highly threaded desktop software?
Or, in other words, to what extent does this unfortunate, outdated and right from the beginning "bad" x86 chip design hold us back today, and to what extent is it just the way things work in modern computing (as rhyder claims)?
>>how would they deal with the growth pain of 64bit?<<
Uh? You'd better research better your facts before writing, MIPS and PPC have 64bit variants and without too much trouble I'd say.
As for instruction compactness, there are way to have a very RISC-type CPU and get very good instruction compactness such as ARM Thumb2 ISA, granted it's 32bit only..
the problem with the x86 at the time of the design and introduction of itanium, was its limited IPC - even with superscalar decoders, IPC higher than 1 is rarely reached, because of interdependent and / or sequence disrupting and/or memory accessing intructions in the sequence - and according to chip-architect.com even recent Athlon64's / Core's don't change this situation
then, it was thought that the best way to achieve intruction parallelism was to implement it in the code and support it at the ISA level, and actually the itanium IS fast, with higher performance/clock than, for instance, x86, iff used with carefully optimized code
the problem with the itanium is that it imposes an entirely different paradigm to the ML programmer / compiler: the latter can (or must) rely on a great many registers, but he must handle instruction parallelism (then ordering, then dependency) on its own, and thus a great deal of complexity is moved from the out of order execution logic to the developer (or compiler)
and, in a world where ML is nearly avoided by all but embedded or operating systems developers, and those developers who do program in assembly language, are mostly used to X86, a system that imposes a dramatic shift in addition to losing backwards compatibility, is inherently relegated to a niche and suffer the development pace (if any) of a niche segment
So you assume that carefully crafted IA-64 assembly code that runs on the fastest Itanium would outperform equally well optimized x86 code on the fastest x86 CPU?
That sounds interesting. Are you aware of any performance comparisons that would back up such claims?
(I'm not necessarily talking about "hard facts" such as scientific research here, I'd also be interested in something like personal experience and such - even anecdotes might me interesting.)
I believe that hand-optimized code for the IA-64 would be hard to find, not only for the reasons that you gave, but also since the IA-64 programming model seems to be so complex that even Intel recommends against human optimizing attempts:
developers are not advised to use this as a guide to assembly language programming for the Itanium architecture.
From Intel® Itanium® Architecture Software Developer’s Manual, page 149 in 24531705.pdf, available at http://developer.intel.com/design/itanium/manuals/245317.htm
Are the compilers already good enough to provide a real performance advantage over x86?
All in all, thanks heaps for your interesting answer. I wasn't even aware of the claim that Itanium CPU's might be faster than x86 if only the code was optimized for IA-64.
From what I've been reading on the web, Itanium seemed like a CPU that didn't even have the potential to be faster than x86!
me too, since the PPC is a decent RISC but not such special one - for instance, it lacks something i deem vital in order to avoid uncessary branching (ie conditional execution), is not orthogonal, and like all RISCs, has what can be considered a drawback from a developer's point of view (later on this)
it's not that it's so much better than x86 in this sense...
interestingly, today's processors actually have tens or hundreds of registers (they just don't expose that many as GPR's, and use them for register aliasing -an ability crucial to out of order code execution- instead) but it's not the main factor for low efficiency. or better, low silicon usage effectiveness - complexity mostly arises from the processor front end (instruction fetch/align/decode) and dynamic execution (schedule / reorder) logic - for both, sophistication ( then complexity ) is needed in order to cope with today's performance needs, and so is not entirely ISA-mandated
iirc the most recent PPC, the G5, used microcode and drew the same if not more, power than a contemporary Athlon64, to achieve more or less equivalent performance...
but the hype around netbooks is a recent thing - if we think we are seeding X86-irrelevance, we're going to reap in the long period
maybe we even don't ever do, since being netbooks just smaller notebooks, people rightly expects them to run the same SW they run on normal notebooks (if not even notebooks they already own), starting with X86 versions of windows
nowadays it's not like that anymore but unfortunately it's at that time that the original x86 ISA was conceived
in principle, more general purpose registers give programmers room to think and optimize functions composed of register to regitser only operations, without external memory accesses other than loading arguments and saving the result
this is a code design approach that entirely fits with the philosophy of a RISC, load/store architecture
BUT, while 8 GPR's are too few for this kind of code design (which, btw, is less intuitive for many assembly programmers -at least all those i know, maybe too used to CISC architectures- because of the added indirection and the need to track which register holds which datum taken from which VM address, in contrast to simply treating VM addresses as variables), it has been found that few algorithm really require as many as 32 registers, while 16 should suffice for most algorithms
on the other hand, larger GPR banks require a smarter compiler to optimize register allocation and usage and in some cases actually take longer to save (thus impose overhead on the context switch operation)
then, i'd consider the 16 registers in the X86-64 isa as an example of "compromise design" - a decent compromise, but nonetheless one that could be done MUCH earlier ( in the pentium or pentium pro if not 386, days)





