Intel: ‘We Can Transform Single Thread to Multithread’

Submitted by Dan Warne 2008-04-01 Intel 30 Comments

Intel today revealed it can convert single threaded software to multithreaded mode without any code modification. The new ‘speculative parallel threading’ process monitors software and examines whether its processes can be run in parallel. If they can execute succesfully, the software can be recompiled to run as a multithreaded app. Intel says it has realised that programmers are going to need machine help to get software running as multithreaded. “We can’t blame the programmers,” an Intel spokesman said. “The industry has been complaining for 30 years about how difficult parallel programming is.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

30 Comments

2008-04-01 4:20 pm

Glynser
April 1st?

2008-04-01 4:20 pm

fithisux
You read my mind!!!
2008-04-01 4:36 pm

hobgoblin
date on the original article says 2. april…
2008-04-01 7:27 pm

pysiak
Funny… Article’s dated April 2nd. There are no references or links. The content is quite big. What are your bets?

2008-04-01 5:04 pm

Dryhte
I really hope this is true. I think it’s stupid that we’re all buying n-core computers but most of the software we’re running doesn’t do parallel processing…

2008-04-01 6:09 pm

Mapou
It’s true. There is at least one startup in Massachussets that has been trying the same approach. The problem is, it’s just not very effective. For example, I doubt that it can parallelize a quicksort algorithm, a perfect candidate for parallel processing. If it were any good, Intel, Microsoft and AMD would not continue to pour hundreds of millions of dollars into research labs around the world to find a solution to the parallel programming problem.

Truth is, Intel is scared. They are already comitted to the multithreaded route and, at this late stage of the game, they can’t go back and change their multicore strategy without losing billions of dollars. Too bad. In my opinion, that’s exactly what’s going to happen. To find out why multithreading is not part of the future of parallel programming, read ‘Nightmare on Core Street’:

http://rebelscience.blogspot.com/2008/03/nightmare-on-core-street.h…

2008-04-01 8:32 pm

flanque
I read that article as well as the alternative it suggested. It’s actually really really good reading for a quiet evening in.

I’m still having difficulties getting my head around the alternative approach with signals, nodes, etc

I’d highly recommend anyone who’s interested in programming to read it.
2008-04-02 1:33 am

Almafeta
It’s true. There is at least one startup in Massachussets that has been trying the same approach. The problem is, it’s just not very effective.

For now, it’s not very effective.

10 to 20 years down the road, compilers producing multithreaded apps out of singlethread code might just be another type of optimization.

2008-04-02 2:54 pm

Morph
10 to 20 years down the road, programmers producing singlethreaded code might just be another type of anachronism

2008-04-02 2:49 am

Frobozz
For example, I doubt that it can parallelize a quicksort algorithm, a perfect candidate for parallel processing.

Should it really have to though? Wouldn’t it be better to have a standard library of routines and have it designed to use threads so the developer doesn’t have to think about it?

Something I don’t understand is why the C++ standard library hasn’t been updated to include a threading API.
2008-04-02 7:17 am

PlatformAgnostic
After following the large number of links with repeated information to get to that author’s point, I think it’s pretty mundane. They’ve tried this fine-grained message passing approach before in high performance computing applications (there was a company called Thinking Machines that did this) and found it to be difficult to program for a useful task. A bunch of DoD money was wasted on these projects.

There are cool algorithms which you can do in such fine-grained systems (for instance, GPGPU algorithms for data processing), but most day-to-day computer operations do not really require this work. Maybe if this sort of research pays off, we might have a good speech or vision recognition system, but it won’t make MSWord run faster.

2008-04-01 10:13 pm

StephenBeDoper
It depends on how one uses their computer. As long as you have more than one simultaneous process running on your computer, then the OS should be able to balance those processes between the available cores/CPUs.

Even if every process running on your computer is single-threaded, you’ll still see some net benefit (even if the individual processes don’t run any faster than they would on a single core).

2008-04-02 1:03 pm

hobgoblin
indeed, but that hardly scale beyond 2-4 cores.

thing is that the only stuff that need on-going attention from a cpu is automated processes that scan, convert or otherwise work on massive piles of data.

and ones you hit a couple of those running at ones, your running into problems with supplying them with the raw data they work on.

most cpu time today is spent waiting for user input for a spreadsheet, editor or similar.

2008-04-02 10:10 pm

StephenBeDoper
indeed, but that hardly scale beyond 2-4 cores.

With most consumer-grade OSes, that applies to SMP in general (regardless of whether single-threaded or multi-threaded apps are being used). Add enough cores/CPUs and you will eventually get to a point of “diminishing returns” – the overhead from balancing work between the CPUs will outweigh the advantages of having additional CPUs/cores.

and ones you hit a couple of those running at ones, your running into problems with supplying them with the raw data they work on.

Exactly. I/O is usually more of a performance bottleneck than the CPU these days.

2008-04-01 5:13 pm

BiPolar
So, basically the single threaded Windows GUI will become a multi-threaded one.

Yeah, right… Can they also turn lead to gold already?

2008-04-01 7:05 pm

joshv
So, basically the single threaded Windows GUI will become a multi-threaded one.

Yeah, right… Can they also turn lead to gold already?

The Windows GUI is heavily threaded. Explorer.exe shows about two dozen threads right now.

2008-04-01 7:34 pm

BiPolar
And how that scales when you open new Explorer windows? 1, 2 new threads per new window?

I would not call *that* a heavily threaded *GUI*.

Not at least in the BeOS classic way: one thread per window/menu bar/popup menu/alert/whatever 🙂

2008-04-01 8:56 pm

Bending Unit
Sounds like a nightmare. 1 thread per widget really wouldn’t do any good.

2008-04-02 1:05 pm

hobgoblin
i dont see much difference between it and a IPC setup of some sort…
2008-04-02 9:12 pm

sorpigal
Sounds like a nightmare. 1 thread per widget really wouldn’t do any good.

Have you used BeOS? With many apps open the CPU load for its GUI is quite low (i.e., basically nonexistant) and the responsiveness is far better than anything else I’ve seen.

They seem to be able to do one thread per widget without any down sides this user can see.

2008-04-01 10:38 pm

StephenBeDoper
Not at least in the BeOS classic way: one thread per window/menu bar/popup menu/alert/whatever 🙂

The main differences with BeOS are that:

– threading is enforced by the BeOS APIs, so a developer has to go pretty far out of his way to write native software that *isn’t* multi-threaded

– BeOS gives GUI-related threads higher priority than threads that are invisible to the user

It’s not really a matter of multi-threading being absent from Windows software.

2008-04-01 5:16 pm

Hiev
Im my experience the use of only one thread in some cases is deliverated and when the need of the task demand more than one thread then the programmer is aware of it and use multithreading.

So, my fear is that this feature screw those task that are ment to be in only one thread.

Edited 2008-04-01 17:20 UTC

2008-04-01 5:58 pm

anevilyak
If you look at the article, what they seem to be implying is that they do code analysis on the program to extract any execution streams that can be done in parallel without harming the original outcome…ergo it’s more or less transparent in theory, so it shouldn’t harm the cases where the code is “intended” to run single-threaded for synchronization reasons or whatnot.

2008-04-01 7:18 pm

PunchCardGuy
Well, most large-scale parallel processing has and will probably continue to be run on specialized hardware designed specifically for such tasks. An obvious example of such specialized hardware is the GPU.

I am shaking my head a little at this so-called crisis. I can see where this issue might become a serious concern once general purpose CPUs with 10s or 100s of cores come out. However, mainstream CPUs that are most in use today have at most only 4 cores.

Sure, we want to ensure that the total computing workload on a system is spread evenly across these cores. One way to achieve this is by running certain applications that are designed to run in multiple threads, based on what it senses are the capabilities (i.e. number of cores) on the CPU it is running on. Some high-end workstation applications and games can do this. But we find more of this capability on servers, which are running DBs, Web servers, etc. many of which are designed from ground up to spread their workload across multiple cores.

But getting back to the desktop, look at what most folks are running on them. Windows (XP or Vista), Linux, *BSD, etc. Each of these OSs has multiple background tasks and daemons loaded all the time, each of which need CPU time when they need to execute. A multicore-aware OS (like the above named) can allow these tasks to run truly simultaneously (depending on how many are scheduled for execution at any one time and the number of available cores) instead of having to time-slice each one on a single core. This allows for a smoother experience when doing things like simultaneously downloading/uploading a torrent, burning a DVD and playing a foreground game. OS specific tasks also can run in parallel. I am sure many of the readers of this thread recall the random, unexplained periods of unresponsiveness on WinXP using a single core system. These pretty much went away when upgrading to a multicore (dual or quad) system, didn’t they?

So anyways, I can see that there will be a problem fully utilizing a 50-core CPU on a typical desktop system. But isn’t 4 cores enough, at least for the next 5 years? I guess we will have to wait and see…

2008-04-01 8:35 pm

flanque
You may be right, but threaded languages have been around for a long time now and we’ve still not mastered them. I think this is more of a, “learn to walk before you run” scenario. We wouldn’t want to wait until we have 50 cores to worry about the problem.
2008-04-02 1:15 pm

kkenn
It’s not primarily about desktop machines, it’s about servers. Sun is already selling relatively cheap 64 CPU-thread systems, and progress in CPU engineering has been moving in the “double the number of cores, not the clock speed” direction for several years already. In a few years servers will routinely have dozens of CPU cores (or virtual CPUs, as in hyperthreading), and that means the software needs to be ready.

Desktop machines will be following a couple of years behind on that curve, but the trend is coming there too.

2008-04-02 1:13 am

bnolsen
Multi threading isn’t a software problem as much as it is an engineering problem. Someone needs to understand how the system works to determine how to properly thread.

In my view the move from single threaded to multi threaded is as big as the move from c to c++ back in the day. And it’s really not that hard when it comes down to it.
2008-04-02 1:54 pm

axilmar
Besides this probably being an April 1st joke, there can not be a general algorithm that converts any single-threaded application to a multi-threaded one. The halting problem prohibits it. In general, you can not know which data can be operated in parallel, unless you examine the data first. But examining the data is a single-threaded operation, and therefore you’re back to square 1.
2008-04-02 3:50 pm

Kochise
It’s been almost 30 years that highly concurrent languages, especialy functional languages like Clean or Erlang, were designed to focus on these issues ! We can blame the programmers for sticking with C and C++…

Kochise
2008-04-03 5:48 pm

looncraz
I’ve said for many years now that this could be done.

It isn’t even as hard as one would believe. CPU #0 is processing instructions on a given dataset, it is TRIVIAL ( in hardware – with a native multi-core design – thanks to long-lived OOE [Out-of-Order-Execution] ) to determine if the next instructions are dependent upon instructions currently in the pipeline.

All you need, basically ( this is overly simplified ), is to run down the execution path until you find the next set of instructions not dependent on solutions which have not yet been calculated, and throw those onto the next available core.

Another item of interest that I haven’t seen yet is what I call “sleeping units.” These “sleeping units” refer to the FPU / ALU or whatever else can do stand-alone computations within the CPU. When doing integer instructions the FPU could likely be sleeping, and the FPU should have no issues doing integer math really fast given a few tweaks.

In that case, if the integer pipeline is full and the FPU is available, USE IT! This gets more involved but is no more difficult than splitting a single execution path into multiple execution paths.

But, back to the article, the only real issue I see in converting, say, a quick-sort routine is the common local variables often used for storage. If the variable is declared within the loop you incur an overhead on a single-CPU, but could help in detecting parallelizable ( proper grammar, but not a word-yet ) singular execution paths.

If the given loop were allowed to act on that variable freely while the first and second runs are being executed in parallel you have memory concurrency issues, which would require locking. This locking would generally limit performance tremendously.

Avoiding this is “easy:” All you really need to do is to use a “versioning” system in which results for a given version are stored in the CPU’s L2 ( shared or a common L3 ) and that the correct version is used for the correct execution path. This alone would increase performance many fold. And it would allow programs which never benefited from the large CPU caches to suddenly receive MASSIVE bumps in performance.

After all, almost every program on earth has AT LEAST one loop ( you can be promised that this is the case on GUI apps on Windows ). Game performance would be staggering, not to mention the wonderful affect it would have on program logic!

Bad multi-threading design in applications would become a near non-issue if you could write a normal loop and it would automatically spread out across all available cores!!

I hope someone at AMD/Intel reads this and sees what I mean 🙂 It could be done in the next generation of CPUs, possibly even just the next stepping of AMD/Intel CPUs. It just ain’t that hard to grasp if done right.

( Of course, the CPUs would require a versioning logic center, and there would be some penalty / delay in regards to this, but the benefit will outweigh the costs, and improvements in the versioning system would help boost performance and marketability ).

An 8/16/32/64/128-bit identifier for version could be utilized and a fixed location in the L2/3 shared cache could be utilized for any specific version. The CPU # could be utilized as a multiplier for memory location.

Ergo, CPU #0 would store Version (V) 1 of Split-Thread Set 2 (STS) at : Offset + V*STS. From here length could be made fixed or variable, fixed would be faster in many cases, but less flexible and unusable in more complex cases ( i.e. cases where more than 3 local standard-typed variables were used ( bytes ) ).

Variable length adds the requirement of remember the length or using some form of chaining ( such as adding an indexer per area [ or maybe a block header ] to determine which Ver and STS to which memory block “belonged” ).

Oh well, the prospects sure look good 🙂

–The loon ( actually saying something for once )