“I want a good parallel computer”

Guest post by torb 2025-06-29 Hardware 22 Comments

The GPU in your computer is about 10 to 100 times more powerful than the CPU, depending on workload. For real-time graphics rendering and machine learning, you are enjoying that power, and doing those workloads on a CPU is not viable. Why aren’t we exploiting that power for other workloads? What prevents a GPU from being a more general purpose computer?
↫ Raph Levien

Fascinating thoughts on parallel computation, including some mentions of earlier projects like Intel’s Larabee or the Connection Machine with 64k processors the ’80s, as well as a defense of the PlayStation 3’s Cell architecture.

22 Comments

2025-06-29 4:39 pm
Alfman verbose=1
I agree with the author. This idea isn’t really new. We’ve had some discussions about an operating system where the GPU took over instead of the CPU. At the extreme, the CPU could be completely eliminated with drivers written to run on the GPU itself. Or we could cheat and keep auxiliary CPUs around as hardware accelerators for the GPU, which is a funny thought.
Ether way though the primary OS functionality could be GPU based. This could work and has a lot of promising potential especially given the FLOPS and bandwidth advantages that GPUs have over any CPU. This is all theoretically doable, however we should consider the cons as well. GPUs use more energy and are probably overkill for the majority of tasks that haven’t already been migrated to a GPU. For example, I think it would be cool to have productivity applications/web browsers/etc that run entirely off the GPU, but is there really a benefit from the user’s perspective given that things are already mostly fast enough on a CPU?
My vote would be for the GPGPUization of SQL databases. There’s always demand for databases to go faster. A GPGPU DB running optimally would easy set new records even performing complex queries with ease.
Also as an aside, it is problematic that cuda, the dominant GPGPU programming framework, is so tightly vendor locked. OpenCL would be a better choice for openness and portability sake, but unfortunately it isn’t that competitive with cuda. These frameworks are still dependent on CPUs for bootloading, it would be interesting to see just how much can be moved to GPU within the context of existing APIs.
Log in to Reply

2025-06-29 4:53 pm
Alfman verbose=1
Another application would be GPGPU compilers since developers spend lots of time waiting for builds to finish. And not for nothing, but even obvious applications like GIMP and Inkscape would benefit from GPGPU-ization. Many applications that would benefit are probably being held back by a large code base that would be difficult to “port” without re-writing the whole thing.
Log in to Reply

2025-06-30 6:43 pm
osvil
If it was so easy to port most of the code to be efficiently run by massive parallel machines like GPUs are, then it would be already done.
And there are those bits that require serial code, context swaps and all that. Like, for example, handling interrupts, dealing with slow peripherals, etc.
Wouldn’t it be nice to have a node or two in your compute cluster to deal with those tasks and keep the other processors simpler so that you can add them by the hundreds? Guess what… those specialized processors are the conventional CPUs. In a perfect world they would just orchestrate everything and just let the computation happen elsewhere. In the real world not everything can be run GPGPU-way. Not efficiently.
BTW, that was quite clearly the approach with CELL. It got some impressive figures un Roadrunner back in time, but it required quite a bit of tuning to get the numbers, even if it was something as studied as matrix multiplication (the star for scientific computing).
You won’t ever get your “general purpose massive parallel computer” because it makes no sense. It just makes much more sense having some cores for general purpose (cpus) and they have parallel accelerators like GPUs, NPUs or matrix multiplicación units. Accelerators don’t need to support everything, each core is simpler, you can fit more cores in the same area.
This is nothing new. Is the trend. Apple chips are like that. AMD APUs as well. Some POWER designs coupled with nvidia GPUs are like that as well. Even Cell had that vision in its design (heterogeneous processing with control units (ppu) and compute units (spu)).
Log in to Reply

2025-06-30 8:55 pm
Alfman verbose=1
osvil,
If it was so easy to port most of the code to be efficiently run by massive parallel machines like GPUs are, then it would be already done.
Yes, I fully agree. It would be difficult to achieve, and not necessarily rewarding. My interest is academic 🙂
And there are those bits that require serial code, context swaps and all that. Like, for example, handling interrupts, dealing with slow peripherals, etc.
That’s not necessarily a problem. A clever parallel algorithm might be able to handle 8 X m.2 drives or 8 X 10gbe adapters in parallel to create high performance router/NAS/etc. I concede that existing hardware and standards may not lend themselves to this design. But at least in principal it could be an interesting possibility.
BTW, that was quite clearly the approach with CELL. It got some impressive figures un Roadrunner back in time, but it required quite a bit of tuning to get the numbers, even if it was something as studied as matrix multiplication (the star for scientific computing).
They were ahead of their time. I am going to get flack for saying this, but I think human coders are the limiting factor. We are notoriously fault-prone and we just don’t cope well with complexity. Even when it comes to well understood software faults, we still keep making them and we always will. The only way to correct for human error is to use languages/tech that doesn’t let us make those errors. I think this applies to parallel algorithms too. Better languages and abstractions could help us here, but industry is notoriously stubborn so I’m not predict any dramatic changes.
You won’t ever get your “general purpose massive parallel computer” because it makes no sense. It just makes much more sense having some cores for general purpose (cpus) and they have parallel accelerators like GPUs, NPUs or matrix multiplicación units. Accelerators don’t need to support everything, each core is simpler, you can fit more cores in the same area.
CPU engineers have done a great job at accelerating sequential workloads, but it comes at great cost with modern general purpose CPUs incorporating a large footprint of transistors that don’t directly perform program calculations. Instead much of the complexity in modern CPUs is compensating for sequential programming. GPUs are programmed differently with the parallelism being very explicit and the resulting cores can be much simpler and scale far more than CPUs. This is self evident.
This is nothing new. Is the trend. Apple chips are like that. AMD APUs as well. Some POWER designs coupled with nvidia GPUs are like that as well. Even Cell had that vision in its design (heterogeneous processing with control units (ppu) and compute units (spu)).
I agree it’s not new, it’s literally the second sentence in the comments 🙂
Log in to Reply

2025-06-30 12:08 pm
chriscox
Arguably a bit harder than it sounds. It’s not like there’s some superspeed x86_64 compatible CPU inside of your GPU. So, offloading vector operations (for example) or certain floating point… well, just being honest, there are some accelerator libraries out there. Just as using those accelerators for use in media transcoding are there. So… it “is” being done. And of course, AI (we have to say AI of course).
Can more be done? Maybe. When you consider the lion share of “work” has very little large scale CPU work, I think leveraging the GPU processing power as priority is very industry (use case) specific. And… I’d argue, it’s already there in a lot of those cases (??).
Log in to Reply
2025-06-30 2:04 pm
Xanady Asem
Larrabee and the Connection Machine were coprocessors that used a host as a (relatively for the time) beefy scalar processor doing all the heavy OS and data “massaging” work.
This topic comes up very often. And it is a good reminder why formal education in CS/CE is so important. As things like Amdahl’s law, the average % of io/branching/compute code distribution (in non compute-bound kernels) common for OS/scalar workloads, etc, etc.
Log in to Reply
2025-06-30 4:57 pm
runciblebatleth
Eric S. Raymond has written about this before, although the article seems to have been lost some time in his myriad blog migrations. The gist of the article is that people try this every ~10 years and keep getting kicked in the teeth by Amdahl’s Law. The serial part is always the bottleneck, so we keep the fastest single threaded performance chips as the CPU, even if they’re moderately (by GPU standards) parallel.
Log in to Reply

2025-06-30 8:04 pm
Alfman verbose=1
runciblebatleth,
The serial part is always the bottleneck, so we keep the fastest single threaded performance chips as the CPU, even if they’re moderately (by GPU standards) parallel.
Amdahl’s law is mathematically sound, but it doesn’t necessarily imply the assumptions of the person applying it are correct. For example, one developer could assume a single threaded task can’t be further optimized and apply Amdahl’s law to declare the max theoretical performance. However it doesn’t necessarily disprove another developer coming up with a more parallel algorithm with a superior ratio.
For example, say we have a ripple counter, which increments bits sequentially while propagating carries one at a time. With this algorithm, the Nth bit can’t possibly be ready until the signal propagates through all previous flip flops.
https://www.elprocus.com/a-brief-about-ripple-counter-with-circuit-and-timing-diagrams/
If we just apply Amdahl’s law without considering other algorithms, we might be lead to the false conclusion that it can’t be further optimized, however a more parallel algorithm does exist in the form of a synchronous counter that has flip flip set on the clock pulse (because we precompute the carries in parallel).
https://www.elprocus.com/types-of-electronic-counters/
Even though this is an electronics example, I hope it’s clear this can happen with software as well. I would like to push the idea that a lot of tasks that we assume are sequential might actually have more parallel solutions that can go faster.. I don’t believe average software is close to getting the most out of parallelism. IMHO the bigger reason not to (at this point) isn’t so much that Amdahl’s law is holding us back, but rather that serial software is already “good enough” especially in the face of the additional complexity parallel algorithms would require.
Parallel computers are academically very interesting, but not necessarily more practical than sequential ones. New compiler technology could change the dynamic in the future.
Log in to Reply
2025-07-01 4:43 pm
Xanady Asem
Yeah. This is a very common experience, every few years or so enthusiasts have to be reminded that 9 pregnant women can’t make a single baby in 1 month…. 😉
The serial part of the algorithm is going to be a bottleneck, just as the I/O or whatever is the slowest part of the system…
Log in to Reply

2025-07-02 1:26 pm
Alfman verbose=1
Xanady Asem,
Yeah. This is a very common experience, every few years or so enthusiasts have to be reminded that 9 pregnant women can’t make a single baby in 1 month….
The serial part of the algorithm is going to be a bottleneck, just as the I/O or whatever is the slowest part of the system…
I think everyone here can accept Amdahl’s law, but it’s premature (how ’bout that pun) to assert Amdahl’s law is actually the bottleneck that prohibits us from improving modern software. I think its fair to say your example assumes nature’s way of making babies can’t be speed up, but you haven’t taken any steps to disprove other methods. Hypothetically a highly parallel cell printing machine could be physically viable. On it’s own, Amdahl’s law doesn’t prove that anybody’s done due diligence in ruling out other mechanisms that might produce the same result.
For example, consider two digit multiplication from elementary school math…
AB * CD = 100*A*C + 10*(A*D+B*C) + B*D
(the 100* and 10* operations are fixed position shifts rather than muls that need to be performed).
Someone might assume these steps are all elementary and it takes 4 signal digit muls to multiply two two-digit numbers. And it’s easy to generalize to numbers with more digits. However there’s actually a more complex solution that uses fewer muls with the same mathematical result.
AB * CD = to 110*A*C + 10*(A-B)*(D-C) + 11*B*D
(the 110* and 11* operations are fixed position shifts/adds rather than muls that need to be performed).
We used 3 muls to get the same result. This is a trivial example, but the savings are higher with larger examples. Likewise if you need to multiply/divide the same numbers repeatedly, converting them to logarithms can convert those operations into additions/subtractions. Tricks like this may not be so obvious in the context of very complex software. Most developers aren’t doing this level of optimization, optimization comes with real costs, however it’s often at least mathematically possible.
The main take away is that yes, we can be prove that no more parallelism can be squeezed out of a specfic algorithm, that alone doesn’t prove there isn’t another more optimal algorithm though. It’s non-trivial to prove.
Log in to Reply

2025-07-02 2:13 pm
Xanady Asem
For those lacking formal education in CS, this will be of help:
https://en.wikipedia.org/wiki/Amdahl%27s_law
Cheers.
Log in to Reply

2025-07-02 3:42 pm
Alfman verbose=1
Xanady Asem,
For those lacking formal education in CS, this will be of help:
I have a formal education in CS and I know about Amdahl’s law. Instead of attacking the person, attack the argument.
The wikipedia article doesn’t refute my points and even wikipedia’s examples are in terms of “assumptions”:: ie “assume that we are given a serial task which is split into four consecutive parts…” and “Assume that a task has two independent parts,…”. Amdahl’s law is valid in terms of the assumptions as they apply to a specific task. However what Amdahl’s law does NOT do is prove the assumptions are true of any given task, that’s left up to the person applying Amdahl’s law and if he were here he would agree.
2025-07-02 4:14 pm
Xanady Asem
Pointing out the formal definition of what Amdahl’s law is and does was clearly needed, in order to make sure the discussion was freed from men of straw. If you feel that as an attack to the person, that’s on you.
2025-07-03 12:57 am
Alfman verbose=1
Xanady Asem,
Pointing out the formal definition of what Amdahl’s law is and does was clearly needed, in order to make sure the discussion was freed from men of straw. If you feel that as an attack to the person, that’s on you.
You’re not fooling anybody. You replied to me specifically.
I thank you for bringing up Amdahl’s law. It’s a great subject to bring up here and I would welcome an intellectual discussion about it. However, these personal attacks are throwing away the opportunity for a meaningful discussion. I may not get what I want, but: please show me that you can respond on point using friendly language without any underhanded personal attacks expressed, implied or otherwise.
2025-07-03 3:21 pm
Xanady Asem
k

2025-06-30 7:52 pm
kurkosdr
Yes, the problem with modern GPGPUs is that you are restricted to “high-level” programming and you can’t assume memory layouts or other fine-grained details for the compiled code. GPU vendors like this because it allows them to throw out the ISA completely if they want (as long as “high-level” compatibility with OpenCL for example is maintained), not to mention they don’t have to agree on a common ISA. It is what it is.
Log in to Reply

2025-07-01 3:29 am
The123king
Isn’t that the exact principles behind Java. Abstract the hardware from the software, so the software is highly portable.
Log in to Reply

2025-07-01 4:15 am
thegman
Or any high level language really, like C. With C you don’t have to consider memory layout other than stuff that is handled for you like alignment. The fact is you can write C for a 1970s PDP and it’ll compile and work on a modern computer.
Log in to Reply

2025-07-01 10:37 am
The123king
The difference is you have to recompile the code. With OpenCL, one binary will run on hardware that is architecturally vastly different
Log in to Reply

2025-07-01 5:47 pm
Xanady Asem
That’s not a “problem,” that literally is a feature we worked real hard to establish in the field.
Lack of abstraction was a principal limiter for the adoption of parallel systems. It made the programmer’s life a nightmare, when they had to deal with too many of the idiosyncrasies of the parallel programming model for a specific architecture.
Log in to Reply

2025-07-01 8:17 am
cybergorf
Well there is already “Dawn”:
http://users.atw.hu/gerigeri/DawnOS/index.html
An SUBLEQ virtual CPU (one instruction) based OS that offers “SMP support, up to basically unlimited CPU cores” that also can run on the GPU via OpenCL.
Full GUI and gfx mode.
Log in to Reply

2025-07-03 1:39 am
Alfman verbose=1
cybergorf,
Well there is already “Dawn”:
…
An SUBLEQ virtual CPU (one instruction) based OS that offers “SMP support, up to basically unlimited CPU cores” that also can run on the GPU via OpenCL.
Full GUI and gfx mode.
+1, very interesting project. This speaks to me…
Why technologies like Dawn and SUBLEQ are important?
-Because the detailed instruction manual of todays desktop (x86) CPU-s and IO system is approximately 50 000 page long, and no one completely understands how they work.
That said, they speak a big game but I have my doubts in practice. There’s a huge gap between making an architecture work in a simulator/FPGA and a silicone reality for it. New architectures are almost never well situated to complete with the giants regardless of their merits because they just don’t have the resources, they don’t have the software, and they don’t have the customers. The top dogs are basically holding all the cards.
It would still be interesting to see this go somewhere though!
Log in to Reply

22 Comments

Leave a Reply Cancel reply