“Apple first introduced PowerMac G4 computer systems using AltiVec — a high performance vector processing expansion to the PowerPC architecture — in the fall of 1999. Architecturally, AltiVec adds 128-bit-wide vector execution units to the PowerPC architecture. Early versions of the G4 processor had a single AltiVec unit, while more recent versions have up to four units (simple, complex, floating, and permute). These vector units operate independently from the traditional integer and floating-point units”. Read the interesting article at O’ReillyNet.
While the actual hardware seems pretty good, i think that the software that takes advantage of it is sorta iffy though. Well, its not really the software’s fault. The point of C is to be a portable assembler, where it will take advantage of the hardware you have to create reasonably fast binaries. The C compiler should be smart enough and be able to optimize the binaries for AltiVec.
I just confirmed this with people in #c (OPN), and for an application to take advantage of MMX or SSE on the x86, it simply needs to be recompiled with proper flags (on a decent compiler).
Asking developers go and change their sources is a big deal if you ask me, especially seeing how they hold so little of the market.
Well, maybe not fault but. I believe:
Going from C code to full vectorized code is called “auto-vectorization”. Unfortunitly, Cray Supercomputer owns all the patents to do auto-vectorization at the scale the AltiVec instructions would need. I also know some features of MMX and SSE cannot be done automatically by the compiler. And btw, some features of altivec can be automatically used by the compiler. But only the simple ones, such as moving 128 bits at a time.
Even if gcc could autovectorize to fully utilize AltiVec, it’s still nice to be able to hand-code some of the instructions. Cause, well, compilers aren’t the best code writters. If they were… we wouldn’t have assembly
“The point of C is to be a portable assembler, where it will take advantage of the hardware you have to create reasonably fast binaries. The C compiler should be smart enough and be able to optimize the binaries for AltiVec.”
I agree, the compiler should be able to generate a reasonable bit of altivec instructions, going all out probably isn’t necessary. I just look at it like this.. the generated code is only as good as the compiler’s author. Which i think is part of the reason as to why GCC lags so far behind commercial compilers.
“I just confirmed this with people in #c (OPN), and for an application to take advantage of MMX or SSE on the x86, it simply needs to be recompiled with proper flags (on a decent compiler).”
I’d guess you asked the wrong person there then. A simple recompile isn’t all. Sure the compiler, in this case GCC will do some things for you.. but there’s hundreds of instructions in the MMX and SSE sets, and I can guarantee it won’t do all of them, but just a handful if that.
“Asking developers go and change their sources is a big deal if you ask me, especially seeing how they hold so little of the market.”
Why? Intel is doing this exact same thing with their Pentium 4 chips. Granted their market share is substantially higher, and they do make their own optimizing compiler (which does a damn good job i might add), but they still had the guts. Now if Apple could just get some people working on a compiler.. perhaps they could take a few hundred people from legal, or even the Aqua UI design team. (cheap shot, i know)
Brandon
Compilers are getting more and more advanced. Yes by turning on a few compiler switches you can enable a few MMX,SSE,Altivec optimisations.
But not many and usually only minor. To take full advantage of vector processing instructions means using different algorithms. Not something a compiler can do and that is why the article was written.
This is why there are not many programs that take advantage of MMX, SSE, SSE2, Altivec. How many years these extensions been in there. 2,3,4? You see some hand rolled 3D drivers and some image processing addons but not much more.
Intel about 6 months ago showed off a new compiler than can do a lot more SSE2 optimisation. Motorola is also supposed to be producing a better compiler back end for using Altivec instructions. Both these companies would love it if more people compile their programs for vector processing so they can sell more vector capable processors.
“Now if Apple could just get some people working on a compiler.”
Apple has a full team working on gcc. They import FSF once in a while and tweak it for altivec, they tweak it for using and IDE rather then makefile. You’re able to see what’s being done on the compiler level at apple, you just need to subscribe to the right mailing list which btw will give you each and every patch posted.
The team works mainly on Objective-C/C++, C and C++. But once in a while they shoot donw F77 bugs too.
Apple alos has engeneers working on gdb but they started lately to try to catch up with “current” versions of GDB so on the debugger level they are lagging behind.
Motorola also signed a deal with RedHat in order to bring better ppc support into GNUpro which means, gcc, gdb and all will benefit from it. Thus Apple’s altivec patchs might get incorporated faster. Motorola also owns metrowerks the “other” ppc compiler maker, which runs on osX – they do not support Objective-c.
Ludo
—
http://islande.hirlimann.net
I write extensions for 3D software, a place that vectorization could theoretically make a big difference. In the particular place I tried it, there really wasn’t much difference. By the time I was done vectorizing it, there was only about a 10% difference in speed but because everything had to be aligned and data structures padded out to 16 bytes (x, y, and z as a float only needs 12), CPUs without SSE lost about 10% from the original speed probably because of increased cache misses from the larger data set. In the end, I reworked the plain vanilla C++ algorithm to something that wasn’t vectorizable (is that a word?) that blew the doors off the SSE version. The C++ version was easier to maintain and was cross platform. Altivec, SSE, 3D-Now, etc. are all nice, but I’d rather see general floating point performance get better, that way everything wins
SIMD in some form or another exists on the more popular platforms. Maybe add an SIMD library to STL that abstacts for developers. Each platform compiler would be responsible for generating the appropriate SIMD instructions from the calls or just expanding it out to plain old fashion math for those platforms that don’t support it. That would probably speed adoption, maintain portability of code, and be a bonus for the end user. Not knowing anything about any of the SIMD instructions, except SSE, this may be just wishful dreaming.
AltiVec is pretty much the same thing as SSE/SSE2, granted that AltiVec has its own set of registers because it’s a separate execution unit. However, Intel has more popularity in the industry. Now, supposedly Macs are popular in the education and research fields, at least that’s Apple’s aim. If universities or research facilities were to write their own software, or hand optimize open source code, good for them. AltiVec will give them a boost in performance, since we already know that the PowerPC CPU alone is not as powerful as we thought it was (I’m referring to the C’t benchmarks).
I’m looking at AltiVec more as a perk for using a Macintosh. I don’t expect software companies to pick it up any time soon and I wouldn’t be disappointed if it never gains popularity. I’m a programmer, and I don’t mind occasionally hand writing some assembly code to make my programs faster.
What attracts me to the G4 is the vector unit is a separate execution unit. When I was inlining assembly code in my C/C++ programs to take advantage of 3DNow! I was hit by a speed penalty for mixing 3DNow and FPU code. I haven’t tested it yet, but from what I hear there won’t be a penalty for mixing AltiVec instructions with the FPU instuctions.
In all seriousness, if MacOS programmers don’t start using AltiVec where it can be used, Apple will get a bad name for bragging about speed and performance which does not exist in their platform. “Put your money where you mouth is”, we all heard that expression.
AltiVec is pretty much the same thing as SSE/SSE2, granted that AltiVec has its own set of registers because it’s a separate execution unit. However, Intel has more popularity in the industry. Now, supposedly Macs are popular in the education and research fields, at least that’s Apple’s aim. If universities or research facilities were to write their own software, or hand optimize open source code, good for them. AltiVec will give them a boost in performance, since we already know that the PowerPC CPU alone is not as powerful as we thought it was (I’m referring to the C’t benchmarks).
I’m looking at AltiVec more as a perk for using a Macintosh. I don’t expect software companies to pick it up any time soon and I wouldn’t be disappointed if it never gains popularity. I’m a programmer, and I don’t mind occasionally hand writing some assembly code to make my programs faster.
What attracts me to the G4 is the vector unit is a separate execution unit. When I was inlining assembly code in my C/C++ programs to take advantage of 3DNow! I was hit by a speed penalty for mixing 3DNow and FPU code. I haven’t tested it yet, but from what I hear there won’t be a penalty for mixing AltiVec instructions with the FPU instuctions.
In all seriousness, if MacOS programmers don’t start using AltiVec where it can be used, Apple will get a bad name for bragging about speed and performance which does not exist in their platform. “Put your money where you mouth is”, we all heard that expression.
On he’s examples he reduced n to n/4. Is this correct ? Shouldn’t it be that he increase the stepcounter from 1 to 4 ?
Thoems
The Altivec features of the G4 are awesome. Even if the software tools are currently somewhat lacking, they at least provide access to the chips features.
The Itanium has similar awesome features, although, if I recall correctly, it has only a few functions. With 128 floating point registers and 128 64 bit integer registers (plus more registers for a total of over 300 registers) the Itanium is a math processing monster. We’ll have to wait till really high speed chips are out for an accurate comparison as the first chips are more proof of concepts than production quality chips to destroy the competition with.
http://www.intel.com/products/server/processors/server/itanium/inde…
However, the fastest chip for this type of math seems to be the PACT XPU 128 eXtreme Processing Platform chip. It’s been rated at 80 time the latest pentium performance. The chip is made by Pact Corp. http://www.pactcorp.com/
PACT, the designer of the fast 32-bit signal processor architecture, XPU 128, is making these fully parallel and reconfigurable IP (intellectual property) cores immediately available as an “algorithmic co-processor” for leading CPU and DSP vendors. Using advanced parallel and reconfigurable technologies from the company’s eXtreme Processor Platform (XPP), the IP core can map any form of algorithm into multiple individual ALUs (Arithmetic Logic Units) arranged in an array inside the core.
http://www.hoise.com/primeur/01/articles/weekly/AE-PR-08-01-64.html
The first XPP device achieves sustainable peak performance in excess of 50 Giga Operations per second (GigaOps), making it the world’s most powerful 32-bit processor.
http://www.hoise.com/primeur/00/articles/weekly/AE-PR-11-00-45.html
By the way, the XPU128 isn’t named for the bit width of its I/O, as some graphics chip have; the XPU128 uses 32-bit data paths. Instead, the core of the XPU128 is 128 processing array elements, arranged in a dual 8×8 configuration.
http://www.byte.com/documents/s=481/byt20001016s0001/index4.htm
It would be totally awesome if someone added one or more of these XPU 128 chips to existing motherboards in conjunction with standard processors.
It looks like there are many approaches to radically improving performance from add on processor instructions of the MMX, SSE, SSE2 and Altivec, to entire new architectures like the Itanium with it’s massive number of registers, and to the current ultimate Pact XPU 128 processor with 128 ALUs and reconfigurable capabilities!
All these approaches require better compilers that can take advantage of them. Automatic tools are obviously more convienient than manual tools, however there is still a lot to be said for a human mind figuring out the most optimal implmentations for each of these technologies.
Let’s all encourage motherboard manufacturers to add the Pact XPU128 chip to their motherboards!
“The C compiler should be smart enough and be able to optimize the binaries for AltiVec.”
It’s not that simple. AltiVec like MMX, SSE etc is a vector unit implementation. Processing multiple data elements at the same time for a performance boost. Compilers can’t just slap in a few different instructions to speed things up.
“I just confirmed this with people in #c (OPN), and for an application to take advantage of MMX or SSE on the x86, it simply needs to be recompiled with proper flags (on a decent compiler).”
The process is called autovectorization. The compiler needs to look at a loop and recognise that multiple sets of data are being worked on and then combine these into vector operations. Intel’s compiler does this for some simple loops, however my company has forged a living out of making proper autovectorizing compilers on x86 and other platforms such as the PS2.
This page gives you an example of some loops which can be vectorized (it’s designed as a competitive analysis between ourselves and Intel, but just ignore that); http://www.codeplay.com/vectorc/feat-vec2.html
AltiVec itself is the probably the best vector unit in domestic CPUs, it’s very nice indeed although it doesn’t have double-precision like SS2 iirc. It’s an obvious target for our compilers but the market isn’t quite big enough and trying to get anything out of Apple was like pulling teeth.
Since this is the first incarnation of AltiVec, we’ll probably see a better implementation in the G5 processor whenever Motorola decides to roll that out. Did the first version of SSE have double precision floating point number capability or was that something added in SSE2? Either way, Motorola is a little behind in the race.
Are we debating about the technical merits of AltiVec, or Apple/Motorola strategies? I think we’ve established that AltiVec is great and powerful, the lack of industry support is the only real short coming. Compiler support would be great but that shouldn’t get in our way to develop applications that take advantage of AltiVec, I mean REALLY take advantage of it. If PowerPC hardware was more popular, I’m sure developers would hand code PPC assembly code into their apps.