Booting the IBM 1401: how a 1959 punch-card computer loads a program

Thom Holwerda 2021-02-23 IBM 22 Comments

How do you boot a computer from punch cards when the computer has no operating system and no ROM? To make things worse, this computer requires special metadata called “word marks” that can’t be represented on a card. In this blog post, I describe the interesting hardware and software techniques used in the vintage IBM 1401 computer to load software from a deck of punch cards. (Among other things, half of each card contains loader code that runs as each card is read.) I go through some IBM 1401 machine code in detail, which illustrates the strangeness of the 1401’s architecture and instruction set compared to a modern machine.

I simply cannot imagine what wizardry these newfangled computers must’ve felt like to the people of the ’50s, when computers first started to truly cement themselves in the public consciousness. Even though they’ve been around for twice as long, I find a world without cars far, far easier to imagine and grasp than a world without computers.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

22 Comments

2021-02-24 4:55 am

The123king
As someone who collects older computers, i just find it fascinating just how advanced computers have come. If you compare them to something like the motorcar, cars have barely changed at all since the 1930’s. Sure, modern ones have more bells and whistles, but a car from the 1930’s runs and drives nearly the same as one from the 2010’s

However, plonk someone in front of a computer from even the 1980’s and i bet they’d be lost on how to use it. Go further back, like the late 60’s and early 70’s and you’d struggle to switch it on, let alone get a command prompt. Going further back to the 1940’s and 50’s,. and you’re on the cusp of the computer revolution, and the things were made from relays and tubes, totally unrecognisable to most people today.

There is something i’ve learned from studying mainframes and minicomputers, and the tech revolution from the 40’s to the 90’s though. And that is that machines today have kind of stagnated. Even 20 years ago, there was at least half a dozen ISA’s in common use, with a lot of big tech companies even designing their own architectures (DEC, HP, IBM, Motorola just to name a few). However, today, everyone has standardised on ARM and x86, and to be honest, this isn’t very good. Big developments in architectures have stagnated, performance improvements are marginal, and the need to upgrade computers regularly has largely become a thing of the past (a good thing!). But with essentially a monoculture in computing, we have the same issues as a monoculture in agriculture, pests and disease (malware, viruses) can affect wide swathes of the technology sector, rapidly spreading to nearly every piece of computing equipment deployed in the world.

I think i got a bit off-track here… Old computers are cool though, and everyone should own at least something from the 80’s, like a C64 or Sinclair spectrum.

2021-02-24 3:09 pm

cb88
PCs have stagnated and become eternally bloated… but that doesn’t mean there is no innovation going on, for instance GPUs are going through alot of changes right now.

The stagnation over the past decade was dude to AMD running out of gas and not being able to compete with Intel… the pedal is back to the metal there for a few years at least. Also we now have EUV lithography… which has reinvigorated gains at the fabs.

Also… a PC from 5 years ago is vastly slower than a current one, and will be annoyingly slow to use in many situations.

2021-02-25 6:03 am

The123king
Also… a PC from 5 years ago is vastly slower than a current one, and will be annoyingly slow to use in many situations.

Citation Needed

2021-02-25 12:45 pm

sukru
The modern analogy would be a Raspberry Pi 400, or a similar “hobby” computer where you are closer to the actual parts.
https://thepihut.com/blogs/raspberry-pi-roundup/raspberry-pi-400-teardown

When kids see a tablet, they only interact with the screen, and does not get to know what is behind that glass. However there are some good kits that help you make your own tablet, laptop, or even robots:
https://www.amazon.com/Freenove-Quadruped-Compatible-Raspberry-Processing/dp/B071HDJMJ4
2021-02-28 6:14 pm

Xanady Asem
Why are people so obsessed with Instruction Sets?

The whole reason why we have coalesced in 3 main ISAs it’s because for the most part the instruction sets were a problem solved long ago. Very few people are programming in assembly code directly, so it really makes no difference for most intents and purposes that we have x86, ARM, and RISC-V. With some minor outliers here and there.

Furthermore, ISA and microachitecture were decoupled long ago. And from that perspective we have now more diversity in processor designs than we have ever done before.

You have now vendors like Intel, AMD, Apple, Qualcomm, Huawei, Samsung, IBM, Amazon, etc. Making their own processors with volumes in one year that eclipsed any of the dinosaurs did during their entire lifetimes.

I couldn’t disagree more with what you wrote. If anything this is one of the periods where there has been the biggest microarchitectural variety. Who cares if the ISAs are fewer? If anything that is a good thing, it means the portability problem has been seriously reduced.

Seriously, you have now machines on the palm of your hand that get more computing power than a supercomputer 20+ years ago… running all day on a battery!

Y’all think that “stagnating” seriously?

2021-02-28 9:34 pm

Alfman verbose=1
Furthermore, ISA and microachitecture were decoupled long ago. And from that perspective we have now more diversity in processor designs than we have ever done before.

Yes, but as with everything there are tons of tradeoffs to consider. A complex ISA leads to a complex decoding pipeline. which leads to more decoding latency and a greater dependency on uop caching. All things being equal, it’s better to have a simpler pipeline. In practice the primary reason we stick with x86, developed for 8bit processors, is for backwards compatibility.

I couldn’t disagree more with what you wrote. If anything this is one of the periods where there has been the biggest microarchitectural variety. Who cares if the ISAs are fewer? If anything that is a good thing, it means the portability problem has been seriously reduced.

Seriously, you have now machines on the palm of your hand that get more computing power than a supercomputer 20+ years ago… running all day on a battery!

Y’all think that “stagnating” seriously?

Well, part of the problem with having too few viable ISAs is the lack of competition. For better or worse, the wintel monopoly kept x86 ahead of everything else, and for better or worse our patent system has blocked competing x86 implementations. We’re fortunate that intel granted AMD a license, but in general this lack of competition hurts us. ARM only managed to grow it’s market share thanks to new markets where the wintel monopoly hadn’t reached (thank god).

2021-02-28 11:17 pm

Xanady Asem
The instruction decoding nowadays takes single digits of the overall power/area percentage. It has been a non-issue for almost 2 decades, compared to other major limiters to performance. Seriously, people need to move on.

Complex ISAs pay the price in terms of more complex decoding logic in their fetch engines, whereas simpler ISAs pay the price in higher requirements in instruction cache size and bandwidth. There’s no free lunch, for example Apple’s M1 needed to have the largest L1 cache in a processor to match x86 performance.

Yes, backwards compatibility is the main reason why we have x86. Intel tried a few times and the market spoke: most customers buy computers to run software not to program in assembly. It’s also because once intel figured how to execute x86 out-of-order the ISA became a non-issue.

In the end the market will correct Intel’s misdeeds, they became stagnant and now everybody else is going to eat their lunch. In a decade, intel could be a significantly weaker company, now that ARM vendors have matched the x86 performance and have access to superior fab technologies. With windows and OSX on ARM being performant, x86 could see a very reduced footprint in the next few years.

2021-03-01 1:42 am

Alfman verbose=1
javiercero1,

The instruction decoding nowadays takes single digits of the overall power/area percentage. It has been a non-issue for almost 2 decades, compared to other major limiters to performance. Seriously, people need to move on.

Except that adding complexity isn’t just a con due to taking up more silicon area, it’s also a con because it adds prefetch latency. People like myself aren’t merely criticizing x86 because I don’t like intel (I’m using an x86 box now), but because there are real tradeoffs that make it less optimal than it could be.

Complex ISAs pay the price in terms of more complex decoding logic in their fetch engines, whereas simpler ISAs pay the price in higher requirements in instruction cache size and bandwidth.

This doesn’t contradict anything I’ve said though. There are tradeoffs everywhere and we can’t just handwave them away simply because it’s a microarchitecture. The uop cache is relatively small and the core can still end up bottlenecked due to slow prefetch.

There’s no free lunch, for example Apple’s M1 needed to have the largest L1 cache in a processor to match x86 performance.

You must mean intel’s performance, since M1 still behind AMD performance. The most obvious advantage apple has over intel is the fab process size. It isn’t clear who would win if they were using the same fab technology. It would be interesting to see how much of a difference it makes, if you know of any papers about this, please do link them as that could be informative 🙂

Yes, backwards compatibility is the main reason why we have x86. Intel tried a few times and the market spoke: most customers buy computers to run software not to program in assembly. It’s also because once intel figured how to execute x86 out-of-order the ISA became a non-issue.

It’s less of an issue, but it doesn’t eliminate it altogether. It is still advantagous to have a more efficient prefetch. Less complexity is often better and reduce the need for tradeoffs in other areas. x86 has accumulated tons of cruft over the years in order to repeatedly extend the x86 architecture, but the result is just not as optimal as it should be in terms of code density/alignment/complexity/etc. IMHO this is one of the reasons both intel and AMD struggle to compete with ARM CPUs on low power applications, even x86 atom processors designed for low power usage aren’t as efficient.

In the end the market will correct Intel’s misdeeds, they became stagnant and now everybody else is going to eat their lunch. In a decade, intel could be a significantly weaker company, now that ARM vendors have matched the x86 performance and have access to superior fab technologies.

Indeed, but on the other hand intel’s products are in stock as shortages rampage the rest of the industry. I didn’t think it was possible, but the shortages got even worse than last month and the street prices are rising to astronomical levels to reflect that. Retail cards have been out of stock everywhere, but up until last month you could still buy them preinstalled in a new computers. This month however, that changed and the boutique computer vendors themselves don’t have any stock either and they have to put customers on very long waiting lists months out with no guaranties…it’s really getting ugly.

https://www.hardwaretimes.com/nvidia-aib-partners-getting-less-than-20-units-of-the-rtx-3080/

Last year I was hoping 2021 would be better, but some analysts are predicting shortages to persist through 2022. My disappointment aside, it does give intel a chance at catching up and preventing AMD and others from taking too much marketshare.
2021-03-01 3:23 am

Xanady Asem
Prefetch latencies in the fetch engine are irrelevant, since they are absorbed by the out-of-order execution and the sizing of the buffers. Plus they should not be visible to the programmer anyways.

Again, a RISC machine will need far larger buffers and i-cache BW for it’s prefetch than the x86. So there’s no ‘free’ lunch, there’s nothing magical about a simple instruction that makes it more “efficient” you’re simply kicking the complexity can elsewhere.

I feel some of you are obsessed with problems that were solved decades ago. Both AMD and Intel have some of the best architectural teams which figured out how to size the speculative structures in their fetch engines so that at this point their execution clusters are not bottlenecked by the x86 decoding process. The fact that x86 cores haven’t been at the top of the performance curve for basically 2 decades now, should have been enough of a hint that x86 is most definitively not a bottleneck. Alas…

If you scale AMD’s zen core to the same node process, you get very similar power/performance numbers as an Apple M1. The main problem why intel/amd haven’t scaled down to the places where ARM went is basically a cultural one: They simply lacked the expertise to build SoC’s, since they were focused on other areas. Just like it took a long time for ARM SoC designers to scale up to where AMD and Intel have been selling their systems.

I hate repeating myself; x86 decoding is not a limiter to performance nor power. It’s a single digit overhead, it’s basically noise. Y’all obsessed with things that were solved long ago.

Ever since out-of-order has been a thing, the ISA has been basically irrelevant. At this point, what makes or breaks ARM/x86 in terms of performance and power is their microarchitecture and fab node technology, not the ISA. In fact modern high performance CISC and RISC designs are 95% the same at this point in terms of where the transistors are used for.

For example, when I worked at Intel. We designed out future x86 microachitectures using a simulator that used the Alpha ISA. Just because the ISA was irrelevant for the most part.
2021-03-01 5:32 am

Alfman verbose=1
javiercero1,

Prefetch latencies in the fetch engine are irrelevant, since they are absorbed by the out-of-order execution and the sizing of the buffers. Plus they should not be visible to the programmer anyways.

Prefetch latencies are not irrelevant. When uops are not available in the cache, the pipeline can & will under perform.

Again, a RISC machine will need far larger buffers and i-cache BW for it’s prefetch than the x86. So there’s no ‘free’ lunch, there’s nothing magical about a simple instruction that makes it more “efficient” you’re simply kicking the complexity can elsewhere.

I didn’t say there was a free lunch, the basic principal though is that everything is a tradeoff. You can’t look at a single piece and say “X” is irrelevant or “X” has been solved because you have to look at the architectural trade-offs as a whole. There are real opportunities to improve on the ISA by reducing complexity without drastically effecting code density. The instruction modifiers and code alignment are good places to start and one could effectively improve prefetch performance without increasing latency, power, or transistor counts.

I feel some of you are obsessed with problems that were solved decades ago. Both AMD and Intel have some of the best architectural teams which figured out how to size the speculative structures in their fetch engines so that at this point their execution clusters are not bottlenecked by the x86 decoding process. The fact that x86 cores haven’t been at the top of the performance curve for basically 2 decades now, should have been enough of a hint that x86 is most definitively not a bottleneck. Alas…

The debate has never been “solved” in favor of CISC. x86 won thanks to the monopoly and tons of money being thrown at it. Secondly your fact is flat out wrong, x86 chips have long existed at the top of performance curves. And thirdly a CPU’s position in the marketplace is in no way proof for the absence of a bottleneck.

I hate repeating myself; x86 decoding is not a limiter to performance nor power. It’s a single digit overhead, it’s basically noise. Y’all obsessed with things that were solved long ago.

Your saying that doesn’t make it true though. The fact is that prefetch complexity does add both latency and power consumption. To an extent the uop cache can mask some of this, particularly for tight loops, but instruction sequences that don’t fit in the cache will result in pipeline stalls, which is a very good reason to minimize prefetch complexity. On top of that, the transistor budget that would have gone to instruction decoding engine are now up for grabs and can be allocated to further improve performance in other areas, like additional L1 cache for the M1.
2021-03-01 3:07 pm

Xanady Asem
The x86 decoder is only less than 4% of a modern x86 core. It’s is a small price to pay specially since it gives ‘AMD and Intel tremendous value added in terms of backwards compatibility.

The ISA debate of RISC vs CISC was settled ever since aggressive out-of-order superscalar has been a thing, it doesn’t make much of a difference. ISA and microarchitecture were decoupled long ago, The execution engines of a modern high performance ARM and Intel core are basically the same.

The fact that x86 has been at the top of the performance curve for 20+ years, and that it took ARM 3+ decades to catch up with it in terms of IPC should have hints that the x86 decoding is not a main limiter to performance. And that the intel/AMD people figured out how to size their fetch engines.

BTW you have it backwards, it’s not like reducing the complexity of the x86 decoder would free up resources to increase performance, but rather the simpler ARM instruction requires those extra resources to match the complex x86 decoder performance. The M1 needs a monstruous 128KB L1 to keep it’s super speculative fetch engines happy, whereas the x86 part needs smaller L1 to keep the same issue rate for its execution engines. So either you have simpler instructions with increased i-bandwidth, or you have complex instruction with increased decoder resources. At the end of the day, both fetch engines end up producing the same issue rates to their execution engines with just about similar transistor budgets. It really it’s irrelevant nowadays.

If you scale a AMD Zen3 core and an Apple Lighting core to the same node, you end up with just about the same power/performance/area numbers. So the ISA is not what defines an architecture nowadays. In fact, AMD was basically using the same core for their own ARM project that went nowhere.

Yes, the x86 ISA has some really daft things. But those are very rare corner case instructions that execute once in a blue moon. And compared to the size of things like Caches, large register files, reorder structures, and parallel units, the overhead in the decoding of those instructions is really a non-issue, if it gives both Intel and AMD backwards compatibility.
2021-03-01 5:18 pm

Alfman verbose=1
The x86 decoder is only less than 4% of a modern x86 core. It’s is a small price to pay specially since it gives ‘AMD and Intel tremendous value added in terms of backwards compatibility.

The ISA debate of RISC vs CISC was settled ever since aggressive out-of-order superscalar has been a thing, it doesn’t make much of a difference. ISA and microarchitecture were decoupled long ago, The execution engines of a modern high performance ARM and Intel core are basically the same.

There have long been differing opinions on this very topic. Just because that’s your opinion doesn’t make it settled. Also, something using ~4% of die doesn’t mean it’s only 4% of the power usage or 4% of the overhead. The actual transistors that are used and their duty cycles naturally changes on the specific workload.

If you’ve got AVX heavy workloads, it can have completely different CPU bottlenecks than say a PHP script or game, etc. Naturally workloads that fit in the uop cache are not going to incur much prefetch overhead, but those that do not increasingly will.
2021-03-01 7:33 pm

Xanady Asem
It’s not just “my opinion,” I work in the computer architecture field, and my PhD was on design complexity of all things. I’m simply, in a polite way, letting you know that CISC vs RISC instruction decoding hasn’t been an issue for decades as far as the computer architecture community is concerned.

x86 decoding hasn’t been a major limiter to performance since the P5 and specially after the P6. And again, that x86 cores have been among the top performing CPUs for 2+ decades should have been a hint that CISC decoding is not a limiter.

Are there some inefficiencies associating with x86 decoding, absolutely. But that’s what computer architecture is: an art of sizing and tradeoffs. There are tons of inefficiencies with RISC as well in other areas.

the x86 decoding takes <5% of the power/area budget, The speculative execution and re-ordering takes orders of magnitude more area and power resources, and that's the same for either RISC or CISC designs.

In fact high end RISC designs like POWER or even Apple's ARM do something similar to the x86 fetch engine, and break down the RISC instructions into smaller nano-ops to be scheduled into their execution engines.

I don't know what your last comment is supposed to mean, are you mistaking use case/workload with instruction decoding?

FWIW That you're still harping about something that takes less than 5% of the budget is a testament about how good the RISC-guys marketing/FUD was back in the day. And I am willing to bet that you coming of age into the computing field must have been in the late 80s/early 90s time frame.
2021-03-01 9:46 pm

Alfman verbose=1
javiercero1,

It’s not just “my opinion,” I work in the computer architecture field, and my PhD was on design complexity of all things. I’m simply, in a polite way, letting you know that CISC vs RISC instruction decoding hasn’t been an issue for decades as far as the computer architecture community is concerned.

You’ve only shared your opinion and not facts. There is longstanding disagreement about CISCs versus RISC among researchers. x86 had the benefit of a monopoly and mountains of money getting thrown at it, which historically has not put it on an even playing field. Still, given enough investment in alternatives it’s quite conceivable a simpler RISC architecture will come out permanently ahead in terms of both performance and efficiency.

Are there some inefficiencies associating with x86 decoding, absolutely. But that’s what computer architecture is: an art of sizing and tradeoffs.

Well that’s what I’ve been saying all along. It’s not enough just to make assertions about one part, you have to consider all the tradeoffs and also consider how the CPU performs under various workloads.

the x86 decoding takes <5% of the power/area budget, The speculative execution and re-ordering takes orders of magnitude more area and power resources, and that's the same for either RISC or CISC designs.

You are still making the mistake of confusing power and area. They’re fundamentally different and not interchangeable! If you’ve got code that only uses a small subset of the CPU’s transistors on the CPU die, those transistors can nevertheless consume the majority of the CPU’s power for a specific workflow. It really isn’t enough to look at die area and assume that its overhead is proportional to die size.
2021-03-02 12:35 am

Xanady Asem
You are not up to date on what the computer architecture research community is concerned about if you really think CISC vs RISC is still a lively debate among us. It makes for a good fire chat for the old folk telling old war stories, but that’s mostly it.

That x86 processors have been consistently at the top of the performance curve it’s not my “opinion.” It’s a simple verifiable fact and it should having given you a massive hint that CISC decoding may not be as great of a limiter to performance as you may think it is.

Furthermore we know the power density for the design libraries so there’s a good rough correlation/estimation between area and power. It’s a common metric among the microarchitectural community to refer to area as an indicator of complexity for a structure.

That you’re still concerned about something that is down to single digits, basically error/noise, at this point. When most of the limiters in a modern core are in the 90+% of the rest of the chip which mostly the speculative/out-of-order/cache/memory controller stuffs, it’s a testament to how good the RISC guys marketing was in the 80s.

You think that the main reason why that mythical super simple RISC machine hasn’t taken over the world because of some evil machination by the CISC overlords. Which means that computer architecture is not you area of expertise.

The fact is that every RISC vendor ended up with super complex architectures, that were for the most part similar to their CISC competitors. Because that the ISA is fixed in length and register to register doesn’t matter as much when things like pipelining/superscalar/SMT/out-of-order are at play.

Apple’s Lighting, IBM’s POWER9, or SPARCM8 don’t get their performance because they use RISC instructions. They get their performance from the huge register files, very wide execution engines, speculative execution, and reordering structures. And that is the same for Intel’s Core and AMD’s Zen.

There have been times when IBM has actually had RISC cores that were bigger than Intel’s or AMD’s, for example.

Again, ISA and microachitecture are long decoupled. Yes, x86 does the trade off between reduced instruction bandwidth requirements by using more complex decoding, and ARM does the opposite they have to have more complex fetch/speculation/buffering and use a more straightforward decoding path. You’re concerned about spills in the x86 uOp trace caches, but ignore the horrible latencies in the more common I-fetch miss for ARM.

In the end after 5 decades, it turns out RISC vs CISC doesn’t freaking matter when all else is equal in the system.

It really it’s time to retire those acronyms because it keeps confusing people, like you, who still think ISA complexity is what defines/drives architecture. Both acronyms are almost irrelevant nowadays.
2021-03-02 2:48 am

Alfman verbose=1
You are not up to date on what the computer architecture research community is concerned about if you really think CISC vs RISC is still a lively debate among us. It makes for a good fire chat for the old folk telling old war stories, but that’s mostly it.

That is your opinion, but it doesn’t make anything I said wrong.

That x86 processors have been consistently at the top of the performance curve it’s not my “opinion.” It’s a simple verifiable fact and it should having given you a massive hint that CISC decoding may not be as great of a limiter to performance as you may think it is.

This line of reasoning is circumstantial at best and does not provide direct evidence that x86 does not suffer from ISA complexity. As already mentioned, x86 has benefited from a very uneven playing field. This may well change if x86 looses it’s monopoly status, which could finally happen this decade…we shall see!

Furthermore we know the power density for the design libraries so there’s a good rough correlation/estimation between area and power. It’s a common metric among the microarchitectural community to refer to area as an indicator of complexity for a structure.

Die area can be an indicator of complexity, sure, but the complexity/size of a subsystem is NOT equivalent to power consumption BECAUSE you’re overlooking things like duty cycle, which is very dependent on workload.

Say 90% of the chip has a 10% duty cycle and 10% of the chip has 100% duty cycle under a given workload…well guess what, that 10% of the chip can be responsible for as much power as the rest of the chip.

That’s the point, you should not be assuming something isn’t a bottleneck or doesn’t consume a disproportionate amount of power based on the proportion of transistors used on the die. Things are much more complicated than that and if you have the credentials you say you do then you should understand that I am right.

Anyways, even if you want to naively insist that 4% is a stand-in for power, using your own number 4% is still an advantage for RISC with more thermal headroom, etc.

It’s ok for us to have a difference of opinion, but continuing to ignore my points and focusing on arguments from authority has been disappointing. I don’t expect you’re going to change your tune though, so perhaps it is best we agree to disagree.
2021-03-02 7:21 am

Xanady Asem
Sigh. Here we end up again.

Are you seriously lecturing me using basic intro to EE concepts. LOL

The area of a structure is a very good estimator regarding it’s complexity with respect to the overall design. We don’t design for the pathological corner use cases. We average a bunch of representative use cases (what you call a workload I guess). It turns out that area, coupled with the power density of the design rules, is a good first level approximation to understand where the design trends lie.

That’s how structures are sized, by looking at what tradeoffs provide the best average for all the representative use cases.

It would be very naive to think both Intel and AMD, with decades experience at this point in the matter, have somehow missed some fundamental inefficiency in their decoder or that their architectures are fundamentally unbalanced and unable to absorb the latencies in that structure.

There’s very little literature in the top conferences has been devoted to instruction decoding complexity in the past decade. Because it stopped being a main limiter to performance long ago, there are some iterative refinements. But it’s not a major issue.

The x86 decoder is less than 5% of the overall budget for a modern x86 design. it doesn’t mean that the ARM equivalent can do the same performance with 5% less size. Because RISC instructions require their own complexity else where in the fetch engine.

Most modern architectures are decoupled, they have a fetch engine that speculatively fetches instructions and keeps the caches warm, and breaks down the instructions into nano ops to be fed asynchornously to a massive out of order execution engine.

Turns out that under that framework, both the RISC and CISC instruction encoding approaches end up getting similar design answers in terms of power and area to get the same performance. If you scale a Zen3 core and the big core in Apple’s M1 to the same node tech, you get ridiculously similar power/performance and area numbers.

There’s a reason I said that RISC and CISC are concepts that should be phased out, since they are meaningless at this point. Because they confuse some people, like you, into thinking there’s “woo magic” that makes RISC somehow intrinsically different/better than an x86 core, when that’s not the case.

Specially if you look at thing like POWER being a rather extensive ISA, and SPARC taking a decade and a half longer than x86 to implement out-of-order execution. Or Apple needing 128KB L1 to match the IPC rate of a modern x86.

I gave you a quantitative fact: x86 performance has been consistently at the top of the results in SPEC, specially in terms of IPC. Which from a quantitative point of view, it should give you a hint that x86 decoding is not limiting their performance. And your reply is some idiotic tangential conspiratorial nonsense about “evil bad” intel is.

The main reason why Intel cores have been so performant, is because intel employs some of the best computer architects in the field.

Other companies that a have access to CPU design teams with similar competence, like IBM, Oracle, and Apple end up achieving similar single thread performance.

This is a very exact and quantitative field and community, it’s most definitely no the idiotic novela you are making it to be.

Intel cores get most of their performance because of it’s microarchitectural resources they are able to design/execute into their cores. When RISC competitors do the same, and fab a core with similar microarchitectural resources they get similar performance. Which again it is an indication that microarchitecture, not ISA complexity is what drives/limits performance nowadays.

I have no expectations on you taken a minute to comprehend what I wrote. I will go with Asimov on your agree to disagree. I’m writing this in the hopes that someone else who is interested in expanding their info in this specific matter and is not threatened when somebody, who works in the field, is simply distilling/relaying information back.

Cheers.
2021-03-02 10:45 am

Alfman verbose=1
javiercero1,

Sigh. Here we end up again.

Are you seriously lecturing me using basic intro to EE concepts. LOL

With all due respect, we end up here because you’re ignoring the cracks in your assumptions and then rather than addressing it you’ve stuck fingers in your ears and spew arguments of authority, there’s nothing that can be learned from that.

The main reason why Intel cores have been so performant, is because intel employs some of the best computer architects in the field.

So? This doesn’t say anything about the merits of x86. Intel’s done an amazing job keeping it relevant. The micro-architecture is brilliant, but that alone doesn’t solve the complexity problem, it merely throws it behind a very limited cache. And not for nothing but x86 doesn’t even have optimal code density. It’s extensive use of prefixes and variable sized instructions are not only responsible for prefetch latency, but they are detrimental to code density too. You cannot just handwave the problems of an ISA away just because you’ve got a microarchitecture. To be fair to intel and amd, it’s not their fault. it’s just what we get when we recycle a processor with 8bit roots.

I have no expectations on you taken a minute to comprehend what I wrote. I will go with Asimov on your agree to disagree. I’m writing this in the hopes that someone else who is interested in expanding their info in this specific matter and is not threatened when somebody, who works in the field, is simply distilling/relaying information back.

You haven’t said anything that I don’t understand, we just disagree and you’re refusing to address any criticism. Oh well, you’re entitled to your opinion just like everyone else, cheers.
2021-03-02 1:05 pm

Xanady Asem
No dude, we keep getting down these rabbit holes because I am simply amused how unwilling you are to just listen and extend your knowledge base.

From a code density stand point, studies have shown that IA32 is a bit more dense than ARM7, and X86_64 is slightly more efficient than ARM8.

Yes, the x86 has some serious warts among it’s ISA and programming model(s). But the really pathological long x86 instructions that trigger microcode calls are not that common. And with x86_64 the set has been made much more regular.

And again I have to reiterate, the x86 pays the price in sligthly increased complexity when fetching and breaking down the x86 instructions into nano ops with respect to the ARM design. Whereas the ARM part needs slightly increased simpler instruction fetch resources to schedule as similar volume of nano-ops.

As far as Intel and AMD are concerned the price to pay for that complexity is perfectly justified because it gives their CPUs a better proposition value in terms of access to installed application bases.

Every body that tries to go for the most single thread performance, invariably finds out it’s the microarchitecture not the ISA that makes the most difference.

In a x86 the out-of-order absorbs most of the warts of it’s baroque programming model, like the retarded/starved register architecture of x86. Whereas in RISC machines, the out-of-order takes care of some of the poor instruction scheduling limitations cases of their compilers.

A modern out-of-order core is huge compared to an equivalent in-order implementation. Most it’s complexity is not on the instruction decoding. When compared on the huge multiport register files, the wide execution units, the reorder and write back buffers, the memory controller, the caches, etc. The complexity in supporting the odd x86 instructions in the fetch engine is tiny. And once they are converted down to nano-ops, the ARM and x86 cores just look 90+% the same.

The only ones still obsessed on ISAs are the Berkeley people, who have now spent 40 years in their quest for the holy grail of the most optimal ISA. And yet they ended up producing monstrosities like SPARC which was even harder to get to execute out-of-order than x86.
2021-03-02 1:57 pm

Alfman verbose=1
javiercero1,

And again I have to reiterate, the x86 pays the price in sligthly increased complexity when fetching and breaking down the x86 instructions into nano ops with respect to the ARM design. Whereas the ARM part needs slightly increased simpler instruction fetch resources to schedule as similar volume of nano-ops.

Yes, the x86 has some serious warts among it’s ISA and programming model(s). But the really pathological long x86 instructions that trigger microcode calls are not that common. And with x86_64 the set has been made much more regular.

Wow, I honestly didn’t expect you to acknowledge this. Anyways I’m glad that you did because we can finally agree that ISA does make a difference, the question becomes how much and what are the tradeoffs.

And again I have to reiterate, the x86 pays the price in sligthly increased complexity when fetching and breaking down the x86 instructions into nano ops with respect to the ARM design. Whereas the ARM part needs slightly increased simpler instruction fetch resources to schedule as similar volume of nano-ops.

I’m well aware of the value the market places on backwards compatibility, this has kept x86 at the front of the pack for decades. It is extremely difficult to change this kind of momentum. ARM is finally making inroads, largely thanks to new markets where intel wasn’t dominant and x86 wasn’t competitive.

Every body that tries to go for the most single thread performance, invariably finds out it’s the microarchitecture not the ISA that makes the most difference.

It depends though. The more software that uses SIMD to really push the CPU’s execution units, the less impact instruction prefetch will have as a proportion of overall load. This is the ideal scenario for CICS. And to be sure, many workloads can benefit from these sorts of intensive vector operations, but at the same time these workloads are the same kinds of workloads that translate well to multi-core parallelism. However with many single threaded workloads, the sequential logic can be far more difficult to parallelize and in these cases keeping all the execution units busy naturally becomes more difficult and suddenly instruction prefetching can become a bigger bottleneck.

The only ones still obsessed on ISAs are the Berkeley people, who have now spent 40 years in their quest for the holy grail of the most optimal ISA. And yet they ended up producing monstrosities like SPARC which was even harder to get to execute out-of-order than x86.

If it weren’t for the stubbornness of the software industry, I think on paper architectures with explicit parallelism are superior to those that achieve parallelism through speculative execution in long pipelines. And let’s not forget about the security dangers of speculative execution. But I concede that changing the industry’s momentum is extremely difficult. However on the bright side, GPGPUs offer us great opportunities for explicit parallelism. Traditional CPUs architectures are unable to keep up and the gains are too high to ignore. So it is my prediction that computationally intensive vector processing will continue to migrate off the CPU to GPUs and the CPU will ultimately be relegated to supporting roles with sequential workloads.

2021-02-24 11:41 pm

JeffR
Fascinating article.
2021-02-27 1:39 pm

Iapx432
This computer had an OS. The difference was that it was outside the machine and was run on (by) operators. To the programmer, sitting in a different building and handing hand written code to a data entry person and getting a printout that evening, it was a fine OS, for its time. Not a perfect OS, but then what OS is?