The Hidden Currents Powering Intel’s Next Gen Chips

Submitted by Nicholas Blachford 2005-08-18 Intel 47 Comments

“At next week’s Intel developer forum, the firm is due to announce a next generation x86 processor core. The current speculation is this new core is going too be based on one of the existing Pentium M cores. I think it’s going to be something completely different.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

47 Comments

2005-08-18 5:49 pm

ma_d
It would be cool if they have something more interesting to show for the time we’ve spent with Prescott than just releasing Pentium M with a fast FSB, 64bit, and dual core…

But hey I’m just psyched to see Pentium 4 finally die.
2005-08-18 5:55 pm

puddleglum
Like A G5 maybe??? Would make it easier for Apple
2005-08-18 6:10 pm

Buck
That’s pretty exciting! I really hope it’ll be something different, like Nicholas said, not a Pentium-M derivative. If it’s indeed the latter, we may safely conclude that any innovation in IT industry has finally expired and gone to meet its maker.
2005-08-18 6:30 pm

Anonymous
Almost everything about processor design that creates large scale power has to do with latency avoidance at all possible costs. When you throw that out the window and realize that massively pervasive threading can hide most latencies then processors can be become pretty simple again and can be had for pennies a pop (even a 486 like core at 1GHz).

Over the last 15yrs Pentiums have gotten only about 30x faster (Toms Hardware P100 to P4) and have used a clock that is also 30x faster, but the DRAM memory wall has barely moved (maybe 100ns to 60ns RAS cycles). The logic, transister cost though went up by maybe 100 fold just to get the clock to Mips performance to scale. This is very bad engineering.

Imaging building a suspension bridge that went 10x the distance but used 100x more steel to do the job.

By having 16 cpus each running 4x slower, (say 1000Mip each), the memory wall falls down to more manageable levels. One could go much further, several companies already can pack several hundred simple cpus onto a single chip but then memory bandwidth in/out of chip becomes the killer. Even a high end FPGA can pack >100 small cpus but only at 150Mips each sigh.

The real way to move forward is to thread the memory system too so that DRAM can issue new memory requests at rates far closer to cpus clock rates. Micron has RLDRAM which does new cycles every 2.5ns and has latency of 8 clocks (20ns). Hide those clocks with 8 way threaded cpus and the DRAM now looks like only a few times slower than cpu again. Now it takes alot more than just that to make a big jump forward.

As long as single threaded PCs use conventional almost single threaded 60ns DRAM, they are going to burn watts.

transputer guy

2005-08-18 7:13 pm

Ronald Vos
Thanks for the info, tg.

For the rest: remember this article is speculation. But it does describe a way to move away from x86’s hindrances while maintaining compatibility. I’m just not sure how anyone wants to implement ‘translating binaries and then saving them’ on modern day’s more security-paranoid/aware OSes. I’m not sure users apreciate their files being modified without their awareness, especially developpers developping cross-platform.

2005-08-18 7:26 pm

Anonymous
As far as I can remember, x86 has only been a frontend since the original Pentium was released.

And I blame AMD for prolonging the life of the x86 ISA, by going the “simple” route and spit out their Opteron and derivative CPU’s.

It has long been my belief that Intel originally didn’t want to extend x86 to 64bit, but instead start over with a new ISA that is better optimized and more futureproof, when the time for that was good.

2005-08-18 6:41 pm

shotsman
Intel have tried once and burnt a lot of HP’s money in the process with itanic.

So does tha Apple deal give them a second shot at it?

I don’t know but unless Apple have seen real working CPU’s already then I don’t think Mr Jobs would follow HP and spend their billions on some wild gamble.

It would be refreshing to get rid of the X86 dinosaur once and for all. There are lots of far more elegant CPU architectures in existance that have not been able to compete with the X86 marketing machine.

The inquirer article is worth a read. A someone who has had an interest in CPU arch since intel made the 4004 there are a lot of good points in their crystal ball glazing.

A VLIW with lots of L1 Cache(DEC could do it with the ALPHA 10 years ago INTEL so why can’t you?) and realy minimal extra gubbins would acheive their desired bangs per watt figure.

Ok, lets look beyond this and into the future.

I see LOTS and I mean LOTS of accountants saying NO, Never in a million years when companies try to implement a server using one of these beasts. Why? this is due to the per CPU arcane licensing schemes used by many major software vendors. Even if this thing ran like shit off the proverbial shovel, the software licensing could kill it. I’m not going to mention those major software vendors who use such a licensing method buy IMHO, only Oracle come out of it with some credibility with a licensing scheme that considers an LPAR system to be really different beasts and only charge for those LPARS which actually run Oracle. Are you listeneing Mr IBM etc etc

We shall have to wait until the announcement for the reality. All this speculation could be redundant BUT Intel does need to do something about removing theire X86 noose from around their necks or AMD will kick the sh1t out of them for some time to come.

2005-08-18 6:46 pm

eMagius
Intel does make many non x86 CPUs — just not for general purpose computing.

It’s clear the home/workstation market wants x86[-64] chips. If Intel were to shift away, then they’d be letting AMD “kick the sh1t out of them”.

2005-08-18 6:57 pm

Anonymous
Just one small error in a new chip could give AMD a clear victory.

Also, AMD can run a RISK like ISA in the Opteron. THey could change out the X86 front end portions and probably get some benefits.

The problem for Intel is that the new chips MUST be better, faster, and stable, or AMD will mop the floor with them.

Opteron is gaining a lot of steam. Maybe the new chip is really a 64 bit Pentium M with an on board memory controller.

Anyone remember Sun? They have been talking about massively multicore systems for a long time.
2005-08-18 7:02 pm

Anonymous
Intel’s focus and the speculation on how big of a shift this will be reminds me of the days when asynchronous processors were supposedly “the next big thing” Prcoessor timing takes a lot of the power and cuts down on the processors scaling in terms of what it can achieve for an x increase in hertz. I think it was Toshiba made a pager using asynchronous processors, and the articles back then spoke of a Pentium (1) that was modified and got roughly 3X the speed IIRC. It is extremeley difficult since they have to actually care about every wires length and have them be a lot more consistent

I doubt they will be going to asynchronous, but a CPU like Cell or like this article talks about, would make it easier then current processors to make the transition. IIRC Using asyncrhonous with the mindset of throwing more transistors at the problem would be a nightmare in scaling to faster design, but adding multiple cores shouldnt increase the complexity too much.
2005-08-18 7:03 pm

Anonymous
Having read other stuff by him, I’m going ignore his current article and just wait for Intel’s announcements next week instead. As well as wait for Hannibal @ ArsTechnica to talk about any news.

2005-08-18 7:30 pm

Anonymous
took the words right outta my mouth…

2005-08-18 7:05 pm

Guppetto
He’s right, rolling out a Pentium M derivitive with hyper threading and 64 registers isn’t going to do anything to quell the well known fact that AMD is kicking their ass right now on the 64 bit and multicore fronts. However, if you change the entire paradigm, then comparing Intel’s performance to AMD’s gets cloudier. More importantly, it forces everyone to change how they apprach performance, software design, and most importantly, pricing. The notion of cost per core completely goes away as well, because translation makes it irrelevent how many cores are being used.
2005-08-18 7:16 pm

Anonymous
The author ignores several important considerations. The most important being that the OS kernel has to be written to the new processor. If simply changing ISA was a slam-dunk performance/watt win, it would have been done long ago.

Creating a new ISA is huge, requires new kernels, new drivers, not to mention all the internal stuff. Itanium is supposedly a superior architecture, but it simply hasn’t taken off at all.

There may be large-scale structural changes in the processor, they might have even designed a new microISA for internal processor usage, but I’m betting that they aren’t crazy enough to have come out with a new ISA for desktop platforms.

Not that I wouldn’t welcome a switch from the antiquated x86 architecture, but Intel simply isn’t that radical with it’s bread and butter.
2005-08-18 7:20 pm

Anonymous
Itanium was a side project. It was Intel’s attempt to monopolize the CPU market by moving away from x86. Although Intel would have liked Itanium to succeed, the company’s future never depended on it.

Fast forward to today. x86 accounts for the vast majority of Intels fortune. AMD is eating Intel’s lunch and gaining marketshare. Intel has no room to fail. If it fails in the x86 market, it doesn’t have anything to fall back to.

Intel truly is betting the company on this new architecture. It has put all it’s resources in winning this war. Intel has put all it’s resources in taking whats good in Itanium and moving it over to x86. It’s cpu has to be better than AMD and in a very big way. The next Intel cpu is going to be really really big and revolutionary. It is also going to have lots of patents to keep AMD and others from following them. This is the 158 billion dollar Intel gamble. I wouldn’t bet on Intel failing.
2005-08-18 8:21 pm

nimble
Intel would be crazy to do anything like this because existing, mostly single-threaded, applications (and benchmarks) would perform very badly indeed.

That’s the very reason why Apple rejected the Cell and the Xenon.

And the whole concept still has to prove its viability anyway. Attempts so far were always hindered by the difficulty of parallelising of applications.

2005-08-18 8:57 pm

Anonymous
Exactly which apps do most people run that so need to run urgently on single threaded cpus. Yeh right, those damn irelevant benchmarks.

Almost every application I use (as an engineer) IDEs, Winamp, Video players, FireFox, Open Office, FPGA tools either is multithreaded or could or should be. Now in University level CS they still teach algorithm theory as if the world really was single threaded with few exceptions. And the real problem is the model of concurrency in most programming languages is completely hosed if it even mentions locks. For some of us out there concurency isn’t the big bad wolf its been made out to be, but we don’t work at the level of locks either. Some languages are actually quite good at expressing concurrency and can exploit large nos of cpus or logic elements (example, occam, Verilog, VHDL).

It must of been Amdahl that cursed parallel computing by suggesting that after 7 cpus, performance starts to go down, even the 7th cpu doesn’t really add much. It all depends on where you are coming from (him big old IBM mainframes 1960s IT SW).

Most users of x86 don’t run anything that needs to be excusively singlethreaded. There are ways of looking at what goes on in a typical Windows box and seeing lots of small cpus working together, even replacing the SSE,MMX with more simple cpus MIMD rather than SISD+SIMD. During quiet periods which is actually 99% of the time most of those cpus can shut down and save power with no effort, compute on demand.

There really is no need for any full blown application to be single threaded period except toy programs but there are some reasons why human parallelizing of some tasks remains difficult.

One often hears about big O notation, of programs that take O.N.. time that take even more time when parallelized onto n cpus. The anti parallel guys forget though that these n cpus are often an order or more cheaper than single threaded monsters and that the less efficient improvement gained using n cpus is offset by the much lower cost of each cpu. If taken to a logical extreme, these cpus can reduce to nothing more than FPGA LUT tables in their 100Ks and the software is fully parallelized hardware.

transputer guy

2005-08-19 6:27 am

nimble
Exactly which apps do most people run that so need to run urgently on single threaded cpus. Yeh right, those damn irelevant benchmarks.

Almost every application I use (as an engineer) IDEs, Winamp, Video players, FireFox, Open Office, FPGA tools either is multithreaded or could or should be.

You might well be right there, but the fact of the matter is that benchmarks do matter (if “only” for marketing), that most applications are single-threaded, and that programmers don’t like to use multi-threading unless absolutely necessary (e.g. for GUI responsiveness).

Therefore Intel would be committing commercial suicide if they went with the transputer concept while AMD is perfectly happy to provide the market with what it’s used to.

And as Rayiner already pointed out, the hardware needed for extracting instruction-level parallelism through out-of-order execution is actually fairly small compared to the real bugbear: the large caches that make up for excruciatingly slow main memory.

2005-08-18 8:51 pm

rayiner
The article is entirely speculative and completely without substance.

If it was just a Pentium M variant I don’t think there’d be such a fuss about it… No, this change is bigger.

The far more likely scenario is that Intel is hyping up the processor to cover for what is really just an incremental upgrade. That fits Intel’s historical marketing profile.

Steve Jobs showed a graph with PowerPC projected at 15 computation units per watts and Intel’s projected at 70 units per watt. Intel must have figured out a way to reduce power consumption 4 fold.

That does not logically follow. Even taking Steve Job’s 4x number at face value, Intel doesn’t have to reduce power consumption 4-fold. The 4x decrease is relative to the G5, not relative to Intel’s current Pentium M. The G5 is a relatively power hungry chip with relatively poor integer performance. The current P-M probably has on the order of 2-3x better performance/watt than the G5. It would not take something radically different (a process shrink would suffice) to hit your 4x.

The forthcoming Cell processor’s SPEs at 3.2 GHz use just two to three Watts and yet are said to be just as fast as any desktop processor.

Except they are not, not for the kind of code people run on PCs. The SPEs are SIMD FP monsters, but ever since PC graphics cards started handling transform and lighting on-chip, single-precision SIMD FP on the CPU has been relatively unimportant. That’s why nobody really cares when a new version of SSE comes out, and why Athlon 64’s school P4’s in gaming despite the latter’s very significant advantage in certain FP benchmarks.

but they could use some of the same techniques to bring the power consumption down.

There is nothing magic in the SPEs. The SPE’s don’t use a lot of power because they don’t do much of anything besides SIMD FP. They have long pipelines, no cache, little parallelism, no out-of-order execution, no branch prediction, etc. Intel using these “techniques” in a next-gen CPU would be suicide. The thing would basically be a Pentium 4 taken to its logical conclusion — massive theoretical FP performance, but quite useless for use as a central processing unit.

Out of order execution seems to be pretty critical to x86 performance

Out of order execution is critical to integer performance. The poor performance of the SPEs and the PPE on integer code is proof of that. Lack of architectural registers has jack to do with it. There is a reason RISC CPUs like the Alpha and POWER are massively out of order! Indeed, if you take a look at the two varients of SPARC: Sun’s and Fujitsu’s, you’ll see that Sun’s is in-order and has shitty integer performance, and Fujitsu’s is out-of-order and has great integer performance.

The Itanium line, also VLIW, includes processors with a whopping 9MB of cache.

Becuase VLIW kills your code size and the Itanium is a $3000 chip with an enormous die area!

Intel has a lot of experience of VLIW processors from its Itanium project which has now been going on for more than a decade.

Most of the experience shows that the theoretical advantages of VLIW processors are mitigated by the fact that nobody can write a decent compiler for them! Intel is not stupid enough to bet its desktop processor business on Itanium technology. It’d be suicide.

indeed it has already been developing similar technology to run X86 binaries on Itanium for quite some time now.

One which works very poorly, for the simple reason that the Itanium cares a hell of a lot more about code scheduling than the Alpha did, and a binary translator doesn’t have enough high-level information to do proper optimization on the translated code!

Switching to VLIW means they can immediately cut out the hefty X86 decoders.

Except the x86 decoders aren’t that hefty, and the relative percentage of area spent on x86 decoders has been shrinking for years to the point where it’s not a big deal anymore. Moreover, the decoder/cache tradeoff is a stupid one. Look at the Athlon64 die: http://www.chip-architect.com/news/Opteron_780x585.jpg

It would be liberal to say that the decoding portion takes up 5% of the overall die. A conservative estimate of the size of the L2 cache would be 50% of the die. Even making the cache 10% larger wipes out the benefit of eliminating the decoding section entirely. When you realize that we’re talking about around a 15% cache size increase just to keep up with the increased size of (Itanium-like) VLIW code, we’ve just pessimized the design! Throw in an extra meg or two to hold translations, and your design just plain sucks…

The branch predictors may also go on a diet or even get removed completely as the Elbrus compiler can handle even complex branches.

With what, magic? The same magic compiler technology that was supposed to save Itanium? You know why the Opteron is kicking Intel’s ass? It doesn’t require magic compiler technology! CPUs that don’t choke on branches will become even more valuable in the future, as software moves to dynamic languages (even C# and Java are dynamic relative to C code).

Intel could use a design with a single large register file covering integer, floating point and even SSE, 128 x 64 bit registers sounds reasonable

Look at how much die space the Itanium spends to make its massive register-file accessible in one clock-cycle. There is a reason chip designers segregate integer and FP registers. Having a unified register file means having lots of ports on it, which is a bitch to route and can bottleneck your clockspeed.

The heavily multi-threaded code advantage is bullshit. It took the industry a decade just to recompile their fricking apps to run natively on 32-bit machines. It’ll be another decade minimum before heavily multi-threaded code is common, and that is wishful thinking on my part.

More on why this concept sucks in part 2 of my post (damn length limit…)
2005-08-18 8:55 pm

ecko
This would make a lot of sense. One of the reasons itanium failed was because of the lack of an upgrade path from x86. x86 performance was miserable on itanium. A transmetta like solution would very smart.

I just thought of something else. If the chip uses software to translate instructions, this could be a way apple locks commonidty hardware out of OS X. They could remove an important instruction or 2 from their ISA or even add one or two into the translation firmware.

It’ll be interesting to see what intel does, the P4 really is a marvel of engineering(I’m an AMD guy too), it’s just intel didn’t see the brick wall they were heading into in terms of clock speed.
2005-08-18 9:04 pm

Anonymous
sounds all nice, i’ll belive it when i see it.

rSl
2005-08-18 9:16 pm

rayiner
Okay, now, let me outline why it’d be suicide for Intel to persue a design anything like the one outlined in this article.

Basically, it’s safe to assume that Intel will follow the Pentium-M design philosophy, at least for its next generation of processors. First, let’s figure out exactly what Intel needs out of its next processor.

1) Integer performance. For the time being, GPUs cover most of the FPU needs of PCs. As long as the CPU is fast enough to encode/decode HD media formats (eg: MPEG4-HD, WMV-HD, etc), it’s fast enough. For people who really need lots of general purpose FPU performance, namely the scientific processing market, the Itanium has them covered, and has a decent and stable niche there. And of course, the all-important server market couldn’t care less about FP performance.

2) Low power usage. Laptops are outselling desktops now, so low power usage is a must.

3) Scalability to reasonable multi-core designs. For the near-future, multicore is going to mean being able to run a couple of apps at once without bogging the machine down. In the desktop market, it’ll be at least a few years before you start to see apps that can scale reasonably to 2-way and 4-way designs, and a decade or more before you see apps that can scale well to a dozen cores or more. In the server market, you already have apps that can scale to 64-way+ designs, but in that space, a 400mm^2 chip is entirely reasonable.

Now, the overriding requirement for Intel is this:

4) Low risk. Intel cannot afford to release another Itanium or another Prescott. It wouldn’t kill them, but would easily give AMD the 25% market share number that they are aiming for by 2009.

The above 4 requirements naturally lead to the following design directions, all of which suggest a design based on the Pentium-M:

1) A short pipeline. Anything silly like the SPE’s 18-stager is out. FPU code doesn’t care so much about pipeline length, but integer code does, and integer performance per watt is going to be the main critereon for the next-gen Pentium.

2) A large, low-latency cache. Integer code is cache-happy, but it likes low-latency cache. That means a cache of “several megs” is out, because of the latency hit. It also means that wasting half the cache for translated code is out, because it’d be much better to just have half the cache at a lower latency.

3) Superlative branch prediction. The Pentium-M’s branch prediction is great, and good branch prediction gets you a whole lot of integer performance for a relatively small amount of die space (relative to a meg of cache for holding VLIW translations!)

4) 32-bit x86 ISA. In the desktop market, Intel will probably push the 32-bit ISA quite hard. The next Intel CPU might be natively 64-bit, but it also might not. It’ll likely support x86-64, but it might be handled as multiple 32-bit operations, just like current P4’s handle x86-64 code as multiple 16-bit operations (the P4’s integer pipeline is strange…)

5) No dependence on external technology advancements. Intel cannot afford to again risk releasing a product that requires the market to catch up to it. It’ll not release a product that requires particularly good compilers (because most vendors won’t bother), or a product that requires developers to shift suddenly to massively-parallelized code. This is the primary reason why Intel’s next processor will be a multi-core Pentium-M. The design doesn’t care too much about scheduling, and can scale comfortable to 2 to 4 cores. That’s all Intel needs, and really all they can afford to gamble on.

2005-08-19 8:59 am

nimble
Spot on. Rayiner for Inquirer Processor Editor!

The above 4 requirements naturally lead to the following design directions, all of which suggest a design based on the Pentium-M:

1) A short pipeline.

Oh well, at least one thing the Pentium 4 and particularly Prescott with its 31 stages have succeeded in: changing pipeline terminology.

On it debut the Pentium Pro’s 12 stage pipeline was considered excessively long when other processor designs had at most eight stages. The PPC604 e.g. had six.

current P4’s handle x86-64 code as multiple 16-bit operations

That’s almost deliberate obstruction. But surely it does have a 64-bit unit for address arithmetic, doesn’t it?

2005-08-19 5:49 pm

rayiner
Ha ha, I think I lied about the 16-bit thing. The P4 had 16-bit “fast” ALUs (it handled a 32-bit operation in two cycles). They ran at 2x the clockspeed, so they were like 2 regular 32-bit ALUs. But I just realized that no 64-bit P4 ever came out. The Prescott (P5?) has 2 32-bit ALUs. They can handle 32-bit operations in 1 clock cycle (but with some limitations that mitigate much of their advantage), but still handle 64-bit ops in multiple clock cycles. The “slow” ALU and the AGUs are all 64-bit.

Intel’s basic problem WRT 64-bit support is that they’ve got the double-speed ALUs. Just getting those ALUs to 32-bits was a pretty trick of engineering. They weren’t in a position to just make then 64-bits wide to support x86_64 code.

2005-08-19 6:16 pm

Nicholas Blachford
You make a lot of good points, and in many cases you’re right but you’re looking at it from the point of an engineer, not a company who’s sole purpose is $$$.

SPEs speed

I worded this badly, it means they are fast at the sort of things SPEs are designed for, not everything. As for their integer performance, I don’t think they’ve released any benchmarks.

Based on Itanium.

Itanium is only one implementation of a VLIW architecture, Transmeta and Elbrus are others. The Elbrus designers didn’t like Itanium and claimed they could do a lot better than Merced, this was 10 years after their first VLIW design.

I think Intel will be very useful for Intel to learn from but I don’t think a new VLIW design will be Itanium based. It’ll be closer to the Transmeta designs (which Elbrus thought were much better).

Single threaded performance.

Most PCs are bought by corporations or home owners to run Word, surf the Web and read email. Single threaded performance really isn’t that important to those people.

The entire point is going this road will mean smaller cooler cores, if Intel can put more cores on a chip they’ll sell more. Yes the enthusiasts will all think this sucks and go and buy AMD but that hardly matters since they do that anyway and in any case are only a small part of the market.

Intel will be able to get more sales by marketing the number of cores and selling variation such as 16, 14, 12 and 10 cores, if AMD are only doing 4 or 8 cores they’re in trouble. Benchmarks wont help, there’ll be sets provided by both sides, difference is Intel has more marketing clout.

Cache

A low latency cache is good of course but Intel shows a clear trend of going to ever larger cache sizes quickly. A smaller low latency cache has to be balanced against a larger, higher latency cache which avoids going to memory more – which has a latency positively massive in comparison. If they manage 16 cores at 65nm I’d expect them to include a huge external L3 / L4.

Risk taking

Intel *can* afford to take risks, they have the money to do several new core designs simultaneously. It’s AMD who can’t take too many risks, if they get it wrong they could be in trouble.

However, Transmeta have proven the technology works. It’s not radical new technology.

That said AMD have experimented with this sort of design as well, one of the alternatives to the K8 was VLIW based, even the G5 contains VLIW like technologies.

Interestingly IBM and AMD are getting very cosy these days, IBM have exactly the kind of this technology I was talking about in their R&D labs.

Multithreading

Everyone is going multi-core now, multithreading may not be easy but it’s not exactly a new idea.

BTW – I’m not the first to have predicted this, someone on Ace’s hardware pointed out a similar prediction from 2002 – for AMD.

2005-08-20 2:31 am

re_re
I believe that AMD will typically be the innovator and Intel will follow. Generally the smaller company innovates, the bigger company takes advantage of it.

Anyway, I hope Intel messes this up, I would really like to see AMD hit the 25% market share mark. I would also like to see AMD stay second because Generally, the company with the upper market share does not have the best product because they do not need to.

I see AMD as the Innovator, the one who keeps Intel on its toes.

2005-08-18 9:48 pm

butters
In the beginning, software was tough to write, mostly because the hardware was tough to program. Then in the 90s hardware got really good in a hurry. Memory became plentiful, execution units became blazing fast, and compilers for higher-level object-oriented languages became efficient on these platforms.

Software became very easy to write. Anyone could learn to program, even business and creative writing majors. In most cases, you didn’t need to know anything about the underlying architecture or the limitations of the hardware.

Then transistors got really small, frequency went sky high, and everyone went out and got 350W power supplies. However, this was not helping the burgeoning pervasive computing and mobile computing markets.

So hardware companies, looking to satisfy these new markets while pulling chip yields and profit margins out of the danger zone, take a different approach. They say, ok, software developers, you’ve had it pretty easy for a while, but we’re going to send you a series of warnings that the era of multithreading is upon us. You’ll have 5-10 years to understand that there are limits to how fast you can do a single task, and unless you divide your code into multiple simpler tasks, your software will not be able to scale.

Software developers say, that’s fine, hardware designers, as long as your chips can figure out how to split our code up for us, we’ll be happy. One hardware designer came out with Hyperthreading, an attempt to appease the simple-minded software developers by automagically extracting thread-level parallelism from their code. The results were mixed, because most code was written with the expectation that stuff will be serially, like normal people think in their heads. Sometimes, this HT technology made code slower, and the software developers were not pleased, instead preferring another hardware designer who figured out how to make memory latency a little faster.

Hardware designers all over the world continued singing the praises of multithreading, talking about multicore processors, virtualization, and distributed computing. Software designers continued to say that they didn’t want multithreading. They wanted fast singlethreaded performance.

At some point in time, software guys are going to have to realize that there is no such thing as dramatically faster singlethreaded performance. Not within a reasonable power envelope and die area, at least. Multithread or get left behind.

2005-08-19 6:12 am

nimble
One hardware designer came out with Hyperthreading, an attempt to appease the simple-minded software developers by automagically extracting thread-level parallelism from their code.

I think you’ve misunderstood hyper-threading, it’s simply a way to run two threads on a single core, invented in order to utilise the ridiculously long Netburst pipeline a bit better.

To the software developer it looks much the same as a two processor machine.

Extracting thread-level parallelism is up to the programmer, very occasionally with a bit of help from a clever compiler.

2005-08-18 11:07 pm

2501
The Jackito TDA uses 7 processors(parallel processing)).

I wonder if Intel is heading into that direction too.

check this out….

http://www.jackito-pda.com/hardware/overview.php

-2501
2005-08-18 11:40 pm

Anonymous
BeOS (:
2005-08-19 7:15 am

Anonymous
1. Expecting existing systems to use many smaller cores in hyperthreading mode and getting efficiencies out of it with standard programs requires that the OS is aware of those “processors” and can and will use them effectively. Sorry, there’s too many systems that don’t use virtual processors, and too darned many applications that cannot reasonably be SMP enabled to where such an architecture makes sense for a single process. Now, if you actually WANT to have a bunch of things running all at once that don’t interact, sure, you could do that; the cost would be that there needs to be a HUGE amount of cache to make any single thread that’s running be even remotely worthwhile. Sure…..

2. While TransMeta did make their CPU’s do translation of code in software into another ISA, it would require a huge amount of cache to make it worthwhile on the fly: such a critter would not reasonable as a multicore device, because of cache, and huge translation latencies. It doesn’t matter how fast the interrupt handler code operates if it can’t be translated before the interrupt is no longer too late, and where are you going to put the code for everything else? In more cache?

3. Once again, while you could (in theory) then store all the translated code out to main memory, that would either require doppelganger memory systems (one for original x86 code that hasn’t been translated as well as data, and another memory system for translated code) which would be a horrible mess, or an OS that takes a processor that does that into account, perhaps by handling it as a special type of device driver to work with translated code. No, doesn’t seem like a wise move for something that’s backwards compatible with what’s on the shelf now.

4. Nicholas should actually spend some time writing actual code that’s heavily threaded, and prove that what he’s created is *correct* as well as efficient. While what Sun has announced they will do with one of their future processor solutions supports a huge number of threads, they aren’t likely to be threads that interfere with each other and still get decent performance. A huge number of algorithms simply can’t be made super-parallel, and those that can (in theory) may require so much locking that the overhead would make it impractical.

I think I’ll stop there: his article is pure entertainment, much like the Cell article was in the wild claims. I think he needs to stop taking Star Trek technology as science and start treating it as fiction until proven otherwise
2005-08-19 11:01 am

Anonymous
…that Intel’s new chip is going to incorporate Nicholas Blachford’s anti-gravity technology for XTreme Performance!

I mean seriously, why does anybody give this charlatan any credence at all? He’s a total amateur and not even a gifted amateur.

Where does all his chip design experience come from exactly?

He reminds me of people that post to comp.arch with their idea of putting 256 486s on a single die. They actually think that they are the only person to have ever thought of such a simple idea. In reality, lots of people have thought of it, but most have been intelligent and experienced enough to also know why it’s not realistic. You know, minor practical obstacles like the insane amount of memory bandwidth required, the thousands of pins that would be needed on the chip package and the incredible difficulty of writing such pervasively multi-threaded code.

But Blachford doesn’t have that kind of real world experience so he keeps coming out with these ridiculous fantasies. And the chronically naive lap it up because it fits in with their hopes and dreams.

I can just picture hundreds of OSNews readers looking at the article and thinking “Wow! A chip like that would be perfect for BeOS! BeOS could be the number 1 OS!”. No, sorry. You’re all fucking deluded.

Blachford – get a job and stop polluting the Internet with your tragic fantasies.

2005-08-19 12:27 pm

CPUGuy
Well, really, had BeOS been developed all this time it really would be the perfect OS for this. The API forces programmers to use mutliple threads in the applications. So every application that has ever and will ever be in existance for BeOS supports multiple processors.

2005-08-19 12:51 pm

nimble
The API forces programmers to use mutliple threads in the applications.

How did it do that? Did BeOS terminate your application when it didn’t spawn a new thread for a couple of seconds or something?

Seriously, all an API and OS designer can try to do is to make multi-threading as convenient and efficient as possible.
2005-08-19 5:18 pm

Givas
Well, really, had BeOS been developed all this time it really would be the perfect OS for this. The API forces programmers to use mutliple threads in the applications. So every application that has ever and will ever be in existance for BeOS supports multiple processors.

Having multiple threads for apps, windows, input devices, etc. is good for GUI latency, but it does not help much if you want to watch a movie, for example. In BeOS (almost) all algorithms doing real work are sequential and this is in no way different to Windows and Linux. Parallelizing those algorithms is very difficult and error-prone, when using C++. Also, hardware-threading is too heavy-weight for massive concurrency (tens of thousands of threads).

Forget BeOS; it does not at all make writing parallel algorithms simpler. Instead, use a programming language that is made for concurrency. Intel should also have a plan how to make software utilize their cores.

2005-08-19 2:27 pm

dillee1
Alpha had tried this.

Crusoe had tried this.

Itanium had tried this.

And the x86 ISA still pwn the world.
2005-08-19 3:50 pm

Anonymous
Really the question I have to ask is, is writing your code to use large amounts of threads really that hard? I’m currently writing a video editing program in which I plan to use three seperate threads just for playback. One thread each for OpenGL displaying, decompression and disk reading. It’s not that hard really, it’s just a matter of looking at the program the right way from the get-go.

2005-08-19 4:17 pm

nimble
Really the question I have to ask is, is writing your code to use large amounts of threads really that hard?

Well, it’s usually quite a bit harder than the equivalent sequential version. Thread communication, the danger of deadlocks and more complex program state during debugging make it more difficult.

And of course there’s the vexed question of how and into how many threads you should actually partition your application.

Too few and you might not fully utilise any given machine; too many and communication overhead overtakes gains in throughput.

So unless you’re writing software for a particular machine with a fixed number of cores and known communication overhead, you end up having to guess.

I’m currently writing a video editing program in which I plan to use three seperate threads just for playback. One thread each for OpenGL displaying, decompression and disk reading. It’s not that hard really

While that’s very commendable it won’t actually gain you much performance, because the decompression is the only CPU-intensive task there. OpenGL is handled by the GPU (unless you’re using software emulation of course) while hard-disk reading is handled by the DMA controller.

The one way you could gain performance is by somehow splitting up and multi-threading the decompression algorithm, but only if it’s actually the bottleneck in the system.

2005-08-20 3:15 am

Anonymous
Too many here keep mentioning laptops as the reason for this change.

I wonder about some of you and your business acumen. The desktops and laptops are not the big money makers.

It’s about $$$ computer clusters $$$$ for big business, science and government.

There is a big SHIFT towards new architecture of graphics cards for science. If Intel doesn’t get a toehold , someone might like Nvidia or ATI or Sony with Cell.

http://www.gpgpu.org/

Intel does a good job of reading the tea leaves.
2005-08-20 10:59 am

Anonymous
I know the idea I’m reitterating here is most certainly going to turn out to be completely incorrect, but does anyone think it could be possible that Intel might be basing these new chips on a flavor of Itanium?

If you rip out the slow as molases in january hardware x86 emulator and about half the cache, these things could possibly be made to eat up less power. Software x86 emulators I’ve heard are pretty damned fast on Itanium (go figure), and who’s to say that said chips couldn’t use a thin layer of software to emulate x86-64 allowing one to use ordinary compilers on top of it?

If nothing else I’m guessing it’d at least be better than the crap AMD64 knock-off they are selling now that are based on the craptacular Pentium 4s…

2005-08-20 2:11 pm

Lazarus
In my uninformed opinion, I’d also imagine that Intel won’t be using the Itanium core for these new chips, but in a way, I can see it making a strange sort of sense for them to do so.

They’ve poured billions of dollars into the development of a decent microprocessor architecture and one of the two largest reasons why it’s a disaster is because of their own inability to market it effectively. I also remember reading something about software x86 emulators running x86 code quite a bit faster than the hardware one on the chips.

If they were to write such software that emulates AMD64 on top of it, then you wouldn’t need to deal with the weirdness of the EPIC architecture in your own programs, as only the compatibilty layer would need to deal with the EPIC instruction set.

In addition to the fact that they’ve already got the things designed and in production, there would be some benefits to utilizing the Itanium. It’s a true 64 bit chip (doesn’t AMD 64 only have 48 bit memory addressing?), and it has the oh so helpful NX bit, unlike the Pentiums that I’m aware of. And like it says in the article, being based on something that for all intents and purposes is a VLIW instruction set like Transmeta’s chip, it should be rather easy to emulate other instruction sets on top of it (Intel was talking about virtualization technologies weren’t they? Well, whatever.).

Since I first heard of the Itanium, I’ve always liked alot of the ideas it incorporates, and were they to leverage it in the design of their next generation x64 chips, I can only see that as a win for both them, and for people who buy the things.

Again, all that said, I still don’t see it happening.

2005-08-20 11:42 am

jziegler
[disclaimer] I’m not an CPU engineer. I’m not even a great programmer. [/disclaimer].

You keep mentioning that multi-threaded apps would make use of multiple processors. On the other hand, most apps are not written this way.

I think that if you have multiple single-threaded apps running, they would make use of multiple processors as well. Kernel, X server (or other displaying server), window manager (or equivalent), music player and the app you are really “using” – the one which has focus (browser, shell, text editor).

2005-08-20 1:40 pm

nimble
I think that if you have multiple single-threaded apps running, they would make use of multiple processors as well. Kernel, X server (or other displaying server), window manager (or equivalent), music player and the app you are really “using” – the one which has focus (browser, shell, text editor).

Yes, but even taken together the programs you mention do not fully utilise even a single core.

It’s about CPU-intensive stuff, e.g. a dual core would allow you to play a (single-threaded) game at full speed while encoding an MPEG in the background.

2005-08-20 4:20 pm

jziegler
Right. What I had in mind was, they could utilise a number of less-performing but much cheaper cores. So, a multiple-core computer with simpler cores than today’s processors would still be very well usable as a desktop machine.

2005-08-21 6:40 am

nimble
What I had in mind was, they could utilise a number of less-performing but much cheaper cores.

Yep, that’s what Microsoft went for with the new X-box.

It would be really interesting to see some numbers on how much bigger a complex out-of-order core actually is compared to a simple in-order one and how much it buys you in terms of pipeline utilisation.

So, a multiple-core computer with simpler cores than today’s processors would still be very well usable as a desktop machine.

I wouldn’t dispute that.

But existing CPU intensive software would run significantly slower, so in a way it’s not backwards compatible, and that’s one thing the PC market really hates. Plus, the poor benchmarks would create a big marketing problem.

Apple can kind of afford the occasional incompatible change to their platform because they have a fairly captive market. Intel can’t do that while AMD is around, as the Itanium vs AMD64 saga has demonstrated.

2005-08-20 1:30 pm

nimble
The only hint is some comments from Intel apparently saying the processor will be “structurally different” but will have no problems running the same apps.

So the evidence for all this speculation is somewhere between extremely flimsy and non-existent.

“Structurally different” to what anyway?

To the Pentium 4? That could just be the shorter pipeline.

To the M? Well, they could adopt some of the better ideas in the Pentium 4, e.g. the trace cache.

To the Pentium D? In contrast to simply sticking two Pentium 4s on a single die, the new chips are going to be more truly dual-core, sharing L2 cache and memory controller.
2005-08-23 4:54 am

Anonymous
” I think it’s going to be something completely different.”

No it’s not.

It’s going to be a dual-core P4-M, with SSE3.

Commodity wins my friend.