Pixel processing problems: on the road to pixel perfection

Thom Holwerda 2014-03-15 Games 25 Comments

The GameCube GPU is a complex, tight-knit piece of hardware with impressive features for its time. It is so powerful and so flexible, it was used unmodified within the Wii architecture. For a comparison, just imagine a SNES running with an NES’s graphics system. This is completely unheard of, before or since. The GameCube is a remarkable achievement of hardware engineering! With its impressive capabilities, emulating the GameCube’s GPU has been one of the most challenging tasks Dolphin has ever faced.

Fantastic in-depth look at specific parts of the GameCube/Wii GPU, written by the developers of the Dolphin emulator.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

25 Comments

2014-03-15 5:15 pm

Carewolf
Don’t try to do integer math with floating point. It works for normal math, but as soon as anything like bitwise operations or integer overflow is expected (and it always is in pixel handling), it just won’t work.

2014-03-15 5:21 pm

Drumhellar
I’m pretty sure they already knew that, but graphics hardware simply couldn’t do integer math, hence the kludges.
2014-03-15 5:22 pm

jchadwick
They knew this, but integer math on GPU was not all that common until recently. Back in 2003, DirectX 9 and Shader Model 2 were brand new. It’d be until 2006 that integer math would be added in DirectX 10, and due to the failure of Windows Vista and the delay of commodity graphics cards catching up, only now can you be fairly safe using integers in shaders. (Remember that Dolphin can run on computers with fairly low-end graphics hardware.)

2014-03-15 5:57 pm

Carewolf
They could have used SIMD and done it on CPU, or more recently have used OpenCL or CUDA.

2014-03-15 6:05 pm

dpJudas
They could have used SIMD and done it on CPU, or more recently have used OpenCL or CUDA.

Using SIMD on the CPU would make it a software renderer! They already have one of those.

Using compute shaders or OpenCL is probably the only way they will be able to ever get it completely right. The vertex shader output is passed on to the perspective division and I assume that part was also done with integers on the GPU they are emulating. Only way to do this would be to use compute shaders for the entire thing.

But using compute shaders would restrict even further which GPUs are able to run the emulator. And speed might suffer because they lose the hardware zbuffer and other features.

Edited 2014-03-15 18:05 UTC
2014-03-15 6:55 pm

jchadwick
Today, a fast-enough software renderer may be plausible, but only due to rapid improvement in CPU performance in the past couple years. And of course it’d still raise the requirements of Dolphin more than this merge does.

2014-03-17 12:49 pm

Carewolf
There has been practically no increase in CPU power the last couple of years. AMD and Intel have been focusing on power consumption not processing power, which has made it stall. The last big bump in computing was four years ago with Intels core2, and that was also after a period of slump.
2014-03-18 12:55 am

ssokolow (Hey, OSNews, U2F/WebAuthn is broken on Firefox!)
There has been practically no increase in CPU power the last couple of years. AMD and Intel have been focusing on power consumption not processing power, which has made it stall. The last big bump in computing was four years ago with Intels core2, and that was also after a period of slump.

…which still counts as speed increases for some of us.

I had a computer from about 6 years ago with an Athlon64 X2 5000+ (fastest 65W TDP AMD chip at the time) and, when my motherboard died last January, the fastest AMD thing I could get at 65W TDP was an Athlon II X2 270, which runs my heavy tasks at about 2/3rds the CPU load of my old CPU.

(Still dual-core since the main things that bog down here are single-threaded stuff like emulation or Firefox’s main thread)

Next time, I’m hoping the power consumption work will have borne enough fruit for me to finally go to quad-core at 65W TDP without sacrificing speed. (After all, I DO compile things and do other stuff which can be parallelized… it’s just not as a high a priority to speed that up more.)

Edited 2014-03-18 01:00 UTC

2014-03-15 9:11 pm

Drumhellar
Sadly, it doesn’t fix the problem of an utter lack of shadows and darkness in “Luigi’s Mansion”, but from the bug reports, they still haven’t figured out why there are no shadows.
2014-03-16 1:31 am

sergio
but It needs a lot of hw resources, in my Macbook Air i5 some games are unplayable (Metroid Prime for example).

It runs perfect in my new Mac Pro (the “basic” quad core model) though 60fps… but hey, it’s kind of crazy to use expensive hardware to emulate Wii/GC when you can have them for less that 50 bucks haha (yeah they don’t have HD like dolphin does that’s true, but for me it’s the same, don’t care about graphics at all xD)

2014-03-16 7:51 pm

bebop
I think that the Mac port using OpenGL is significantly slower than using OpenGL on a hardware accelerated linux or Direct X on Windows.

From what I understand, and this could be totally incorrect, the Mac GL drivers are not great and the speed that the hardware can actually do is nowhere near the speed the driver gets out of the card.

2014-03-17 1:33 am

sergio
yeap, I suspect that too.

Mac OS X sucks in this kind of things… games that run perfect on Windows usually needs twice the resources in Mac OS X. bummer

Edited 2014-03-17 01:34 UTC

2014-03-17 9:03 am

Ultimatebadass
That depends on the game I think, some are using DirectX->OpenGL translation layers (EvE online for example) which makes things slower.

The engines that have native opengl backends and actually use the “core” (3.x or 4.x) opengl profile on osx *usually* run fine. For example I compared scores in Heaven benchmark on my hackintosh (i7-950 with nVidia GTX670) and using same settings the scores are comparable.

On the other hand, some opengl applications under OSX for reason that is unclear to me still use the “compatibility” profile (that means opengl 2.1 only) and that MAY be a reason why things are slower. Xplane for example runs at about 1/2-2/3 FPS compared to my windows installation. Maya viewports are also way more “choppy”, but that could be due to some wierd Qt bug (disabling status line display speeds things up…)

OpenGL on osx only just caught up with the rest of the world, barely (10.9 finally supports 4.x version) but as long as apple holds the keys to osx opengl drivers it’s always going to lag behind a bit.

2014-03-17 11:22 am

wigry
The biggest difference between OpenGL and DirectX is the general model of control. With DirectX developer or application manages the resources and drivers are thin. With OpenGL, the developer or application issue only general commands and driver takes care about all the resources and communication. So the OpenGL performance is DIRECTLY bound to the driver quality and implementation.

In case of Apple and OSX, Apple developes the drivers inhouse and they have been much more successful with AMD / ATI cards than with NVidia. Hence the OpenGL performance with NVidia cards is very lousy while with AMD cards it is OK. No wonder Apple went AMD-only with new Mac Pro.

So DirectX performance is dependent on the developer skills of the particular application and OpenGL performance is bound to the driver developer. So with OpenGL, single driver must meet the requirements of every application out there. With DirectX the developer can fine tune and optimize infinitely ensure the best performance.

And X-Plane uses OpenGL 4.x if available.
2014-03-17 1:17 pm

Ultimatebadass
And X-Plane uses OpenGL 4.x if available.

Will have to double check, but I could’ve sworn last time i checked in the graphics settings menu on the bottom it said “2.1-somethingsomething”.
2014-03-17 5:49 pm

Ultimatebadass
Yup, checked it. XPlane 10.25 64bit, OSX 10.9.2 shows OpenGL renderer as 2.1 in graphics options menu.
2014-03-18 6:37 am

wigry
Found this explanation from Aerosoft forums:

X-Plane tries to use modern features although it claims to be an OpenGL 2.x program since it isn’t allowed to run with higher versions under Mac Os X since it still uses OpenGL 1 calls. And that’s the problem. All modern cards only use modern multi shaders that are used by OpenGL 3 and 4, while OpenGL 1 and 2 used a totally different shader model with specialized vertex and pixel shaders. So the drivers try to implement the old shader model on the modern hardware while X-Plane tries to emulate modern OpenGL 3 and 4 methods like instancing within the boundaries of the old shader model. For the drivers this is a nightmare and totally uncommon since everyone in their right mind would simply use the modern calls directly. So at the moment it is a torture test for the drivers of the graphics card.

Hopefully they might finish the elimination of the old OpenGL 1 calls with 10.30. so that Mac Os X no longer sets the Legacy flag and allows them to use OpenGL3/4 openly. This might help with the driver problems..

Edited 2014-03-18 06:45 UTC
2014-03-20 7:39 pm

zima
So DirectX performance is dependent on the developer skills of the particular application

Or of the particular engine, more often than not?
2014-03-17 12:37 pm

moondevil
Yep, that was what made me go Windows 7 back when I was waiting for both Windows 7 and Mac OS X Snow Leopard to get released.

The quality of the OpenGL drivers wasn’t worth the price difference.

2014-03-18 9:14 am

smashIt
but It needs a lot of hw resources, in my Macbook Air i5 some games are unplayable (Metroid Prime for example).

the problem is that the air is an extremely weak gaming-machine

most likely the main reason for your performance-problems is the intel gpu

2014-03-16 3:40 am

Alfman verbose=1
I like these articles alot more than the phone ones, they speak out to me…but that could just be me

I actually don’t think emulating integers with floating point was such a bad idea. At first, it struck me as odd that they were having trouble with it. After all, the mantissa (the integral part within the floating point type) is an integer. So long as you can emulate an integer’s overflow/borrow behavior, then the mantissa in floating point arithmetic should be 100% identical to the integer in integral arithmetic.

Edit: I apologize for the messed up formatting, the math was far easier to read when it was correctly aligned. Alas, I don’t know how to force osnews to line up digits. Ok, the underscores help a bit.

|___ 1111 1111 A

|+__ 1000 0000 B

|=_1 0111 1111 A+B

By adding in high control bits, you could manage a shift operation, and with shifting you could achieve a rudimentary form of bit masking.

|___ ____ ___1 0111 1111 A+B

|+10 0000 0000 0000 0000 floating point shift right

|=10 0000 0001 xxxx xxxx (least significant bits dropped by floating point unit)

|-10 0000 0000 0000 0000 floating point shift left

|=__ ____ ___1 0000 0000 overflow bit

| __ ____ ___1 0111 1111 A+B

|-__ ____ ___1 0000 0000 overflow bit

| __ ____ ____ 0111 1111 A+B with normal integer overflow

So, unless I’ve missed something, that’s 4 floating point operations to emulate one integral addition. Multiplication might follow a similar pattern.

However then I read the part where they didn’t have enough mantissa bits in the 32bit floating point units to properly represent the 24bit integers, much less the carry bit.

http://en.wikipedia.org/wiki/Single-precision_floating-point_format

Well shoot, that sure puts a damper on things Given this fact, I have to agree with them that this whole approach was bound to fail no matter what.

Edited 2014-03-16 03:53 UTC

2014-03-16 2:45 pm

Kroc
Forgive my ignorance, but why not support 64-bit architecture only? Desktops have been shipping with 64-bit CPUs since 2005-ish; and any machine powerful enough in the first place to emulate GC/Wii is most likely (or would at least benefit) from a 64-bit OS.

2014-03-16 4:38 pm

dpJudas
Forgive my ignorance, but why not support 64-bit architecture only? Desktops have been shipping with 64-bit CPUs since 2005-ish; and any machine powerful enough in the first place to emulate GC/Wii is most likely (or would at least benefit) from a 64-bit OS.

64-bit floating point (aka doubles) is available for both 32-bit and 64-bit on the CPU side, so no 64-bit CPU is required. However they are trying to simulate integer math on the GPU that traditionally only supported 32 bit floats.

Most newer GPUs also support 16 bit half-float load/store operations as well as full 64 bit doubles. Only catch is that they do 64 bit at half the speed and supposedly is also gimped at consumer versions. Seems doubles is a big deal in certain scientific scenarios and they are willing to pay $$ so you need one of the pro cards to get them to perform well.

Edited 2014-03-16 16:38 UTC
2014-03-16 7:01 pm

Alfman verbose=1
Kroc,

Forgive my ignorance, but why not support 64-bit architecture only?

dpJudas is correct. 64 bit CPUs are capable of doing it (32 bit CPUs are as well), however the whole motivation for GPUs was (and is) to significantly ramp up the number of computations that can be executed in parallel (way more than even SSE). DX9 based GPUs were apparently limited to single precision floating point numbers, which only had 23bit mantissas. This is just two bits shy of holding a 24bit number plus carry. If it weren’t for this, the GPUs from many years ago could probably still offer more aggregate performance than a multicore CPU today, owing to the sheer number of execution units.

So the dilemma wasn’t about the transition to AMD64 so much as it was the transition to DX10/11. For instance, even though all my computers are 64 bit, I don’t have any that are capable of DX10

2014-03-16 9:40 pm

tylerdurden
Another important issues with DX9 HW, besides the nature of the ALU pipeline, were the limited programmability of the graphics pipelines and how little of them were exposed to the programmer.

DX10 HW is commonplace by now and overcomes most of those limitations, while providing more balanced I/O characteristics necessary for general purpose computation.