In continuing with my articles exploring the my SPARC-based Sun Ultra 5, I’m going to cover the topic of compiler optimizations on the SPARC platform. While many are familiar with GCC compiler optimizations for the x86 platform, there are naturally differences for GCC on SPARC, and some platform-specific issues to keep in mind.
These are tips that are easy to add when compiling software where performance is important. And as it turns out, the SPARC platform has characteristics that further benefit from optimization, often more dramatically than x86.
Since compilers work based on the hardware architecture, these tips would apply for GCC for all operating systems that run on SPARC, including all of the operating systems reviewed on this series of articles (FreeBSD, Linux, Solaris, NetBSD, and OpenBSD). These tips would cover GCC 2.95 through the current GCC (3.3.2 as of writing).
This article is written from the perspective of a sys admin, and not a developer. System administrators are usually concerned with performance, and these are tips to help when compiling source code.
Basics On The SPARC Platform
For the SPARC platform, there are a 3 basic classes of processors: V7, V8, and V9. The SPARC V7 is the lowest common denominator for the SPARC platform; anything compiled with the SPARC V7 instruction set will run on any SPARC-based system, just like i386 is the lowest common denominator for the x86 platform.
V7-based systems include Sun’s sun4 and sun4c systems, such as the SPARCStation 1 and 2, and the SPARCStation IPX for sun4c, and the Sun 4/300 for sun4.
The V8 architecture includes sun4m and sun4d systems. The V8 architecture adds some instructions that really help out with performance, including integer divide and multiply. These benefits will become apparent in later tests.
Sun4m-based systems include the SPARCStation 5, 10, 20, and Classic, and sun4d-based systems include the SPARCServer 1000 and SPARCCenter 2000.
The V9 architecture are 64-bit processors (as opposed to V7/V8 32-bit processors) and are fully backwards compatible with previous architectures. The V9 processors include the UltraSPARC, UltraSPARC II, UltraSPARC III, and the new UltraSPARC IV processors. The V9 is known as sun4u, which is what my Sun Ultra 5 is classified as. Sun currently currently only makes systems based on SPARC V9/sun4u.
Architecture | SPARC |
---|---|
sun4/sun4c | V7 |
sun4d/sun4m | V8 |
sun4u | V9 (64-bit capable) |
Processor-specific Optimizations
On SPARC systems GCC will produces binaries for V7-based binaries by default, just as GCC produces binaries based on the i386 instruction set on the x86 platform, by default.
One way to possibly increase performance in a dramatic way is to set the -mcpu option to your specific processor. Here is a portion of the entry from the GCC docs regarding this option:
-mcpu=cpu_type
Set the instruction set, register set, and instruction
scheduling parameters for machine type cpu_type.
Since the processor for my Sun Ultra 5 is a V9-based UltraSPARC IIi, I’ll use
-mcpu=ultrasparc. Since the only V9 systems are UltraSPARC, there’s no real reason to use
-mcpu=v9
, -mcpu=ultrasparc
would work for all UltraSPARC processors and is the (theoretically) high optimization.
It should be noted that pecifying -mcpu=ultrasparc
or even v9 for the V9/64-bit class of processors will not create 64-bit code. The code will still be tuned for the UltraSPARC processors, but the binaries will remain 32-bit. The creation of 64-bit code requires using the -m64
flag (-m32
for 32-bit code is implied by default).
To show how dramatically -mcpu
can affect performance on the SPARC architecture, I ran some comparison tests with OpenSSL 0.9.7c compiled with three
-mcpuoptimizations:
v7
, ultrasparc
.
These tests were run under Solaris 9 (12/03) and compiled with GCC 3.3.2 and compiled with -O3. Each test was run three times, and the results averaged. The individual results varied very little.
For the computationally-intensive OpenSSL, the -mcpu=ultrasparc
optimization doubled the performance when compared with V7.
Of course not all applications will benefit to this extent; presumably there are applications that would benefit very little. But for those CPU-intensive operations, this optimization can make a big difference.
The difference is much more dramatic than what we would see with similar optimizations on the x86 platform.
To show you the contrast in performance in intra-platform optimizations, I ran the same test on a Pentium III 1 GHz x86 system. I compiled OpenSSL with -march=i386
and -march=i686
(the highest effective optimization for my Pentium III system).
The x86 test system is running Linux 2.4, and OpenSSL 0.9.7c was again compiled with GCC 3.3.2. They were compiled with -O3
, and each run was done 3 times with the results averaged. Again, there was very little delta between the individual runs.
Since I’m running a Pentium III, I could have used -march=pentium3
. I actually did, and found there to be no difference in results between -march=i686
and -march=pentium3
. Also, OpenSSL on Linux x86 is often distributed in both i386 and i686 iterations.
Remember, we’re not comparing the performance of a 1 GHz Pentium III processor with a 333 MHz UltraSPARC IIi processor, rather we’re comparing the difference between the lowest common denominator and the highest (effective) optimization between x86 and SPARC.
As you can see, the i686 flag does indeed give a performance boost as expected, but it’s not nearly as dramatic as the difference between V7 and V9 (or even V8) on SPARC. This highlights the importance of optimizations for SPARC.
Contrasting With x86
You may have noticed that I used -march
for x86, yet -mcpu
for SPARC. For x86 GCC users this may seem confusing, since
-mcpuunder x86 only tunes a specific CPU, but doesn’t take advantage of any additional instructions or additional functionality.
For SPARC, there is no -march
flag, instead it uses
-mcputo specify platform-specific optimizations. The
-mtuneflags works as the
-mcpu
has typically been used on the x86 platform, by tuning code for a particular platform but not taking advantage of additional instructions. (It should be noted that the -mcpu flag has actually been deprecated on x86 GCC in favor of -mtune
.)
So while -mtune
is the same on both x86 and SPARC (creates backward compatible tuned binaries), -mcpu
creates CPU-specific binaries (and not backward compatible) for SPARC, and -march
does the same for x86.
For great resources on GCC for x86, check out GCC Facts and Myths by Joao Seabra and the GCC x86 optimization docs from GCC.
The -On Flag
Another optimization option for GCC (universal to all platforms) is the -On
flag, which controls many more specific optimization flags.
Further reading on these optimizations can be found on the GCC document site.
To see what the effect of the -On
flag with GCC has, I compiled OpenSSL 0.9.7c with -mcpu=ultrasparc
, and -On
(where n could be 0 through 3), which is the range for GCC (there’s also -Os
, which does maximum optimizations save for anything that might tend to dramatically increase size, but I didn’t test that).
As before, the tests were run 3 times for each variant, and the results averaged. There was very little delta between the runs. OpenSSL 0.9.7c was used on Solaris 9 (12/03), compiled with GCC 3.3.2.
The results where quite surprising, as I had thought going in that there would be greater delta between the various levels of optimizations. As the results show, there wasn’t much difference until going to zero.
This was only a single application, and the effectiveness of these optimizations will vary of course depending on your application, so keep that in mind.
SSH is Slow, but Why?
On many of the operating systems I evaluated, I noticed that logging in via SSH was inordinately slow, such as NetBSD 1.6.1. It could take 3 or more seconds to get a password prompt. I knew it wasn’t the hardware, as logging into Solaris via SSH would return a password in less than a second. So what was a culprit? Google came up with these two items of particular note:
When I ran the OpenSSL speed test on NetBSD, I got extremely poor performance:
OpenSSL 0.9.6g 9 Aug 2002
built on: NetBSD 1.6.1
options:bn(32,32) md2(int) rc4(ptr,int) des(ptr,risc1,16,int) blowfish(idx)
compiler: gcc version 2.95.3 20010315 (release) (NetBSD nb3)
sign verify sign/s verify/s
rsa 512 bits 0.0248s 0.0022s 40.2 449.9
rsa 1024 bits 0.1279s 0.0076s 7.8 131.7
rsa 2048 bits 0.9217s 0.0276s 1.1 36.2
rsa 4096 bits 6.4647s 0.0928s 0.2 10.8
sign verify sign/s verify/s
dsa 512 bits 0.0224s 0.0281s 44.7 35.5
dsa 1024 bits 0.0750s 0.0927s 13.3 10.8
This was even slower than my OpenSSL 0.9.7c tests with the V7 instruction set on Solaris 9. Performing a “/usr/bin/true
” through SSH on NetBSD showed the lengthy delay:
> time ssh 192.168.0.19 “/usr/bin/true”
0:02.79
Almost 3 seconds! And it wasn’t just NetBSD, either. A few others suffered the same problem.To fix this, I compiled OpenSSL 0.9.6l from NetBSD’s pkgsrc and compiled OpenSSH 3.7.1p2. I made sure to include “
-mcpu=ultrasparc
“, and ran the “/usr/bin/true” test again.> time ssh [email protected] “/usr/bin/true”command from MySQL 4.0.17, showing the various compiler-related environment variables it accepts:
0:01.35I was able to cut the time almost in half with that optimization.
I ran the same test on Solaris 9, using OpenSSL 0.9.7c libs compiled for
-mcpu=v7
and-mcpu=ultrasparc
OpenSSH 3.7.1p2.For
-mcpu=v7
, the login took almost 2 seconds.> time ssh [email protected] “/bin/true”script is run. Editing just the top level
0:01.56With
-mcpu=ultrasparchowever, it took less than a second.
> time ssh [email protected] “/bin/true”
0:00.95Where To Add The Optimizations
There are a few ways to add optimizations at compile time. For many applications, you can go into the Makefile and look for theCLFAG
entry, such as this for OpenSSL 0.9.6l on NetBSD 1.6.1:CFLAG= -fPIC -DDSO_DLFCN -DHAVE_DLFCN_H -DTERMIOS -O2 -WallHere is where I would add
-mcpu=ultrasparc, probably at the end.CFLAG= -fPIC -DDSO_DLFCN -DHAVE_DLFCN_H -DTERMIOS -O2 -Wall -mcpu=ultrasparcFor applications like MySQL, there are several subdirectories with their own Makefiles, all generated/configured when the Configure
Makefileprobably will not affect the subdirectories, so there needs to be another way.Often, these applications will accept environment variables of
CFLAGS(for the C compiler) andCXXFLAGS
(for the C++ compiler flags).export CFLAG="-O3 -mcpu=ultrasparc"Running that before you run the configure script will add those flags. You can see in this excerpt from the Configure –help
CC C compiler command
CFLAGS C compiler flags
LDFLAGS linker flags, e.g. -L[lib dir] if you have libraries
in a nonstandard directory [lib dir] CPPFLAGS C/C++ preprocessor flags, e.g. -I[include dir] if you have headers in a nonstandard directory [include dir] CXX C++ compiler command
CXXFLAGS C++ compiler flags
CPP C preprocessor
This is common for the more complex open source applications.
To Optimize or Not To Optimize
Optimization really depends on what you’re compiling. If you’re creating a “hello world” application, or compiling ls from GNU’s fileutils, you probably don’t need to squeeze every ounce of possible performance. Characteristics such as mathematical operations versus I/O would all be factors in the potential benefit.Still, the performance optimizations discussed can have a potentially huge impact on performance on SPARC systems, much more dramatically than comparable optimizations on x86 systems.
As such, adding
-mcpu
options for compilation is a good idea for systems that support V8 or higher. Even if you’ve got a mix of systems, it can very well be worth your time to keep multiple sets of binaries, one for each platform you run.
1. Why aren’t you writing for one of the hardware sites? You do nice work.
2. Can you do a series like this for x86?
Thanks for an extremely interesting article, even though I have never touched a sparc in my life. You prove that optimizations do matter, in spite of those who say they make no difference.
These series of articles is as much about software as it is for hardware, in fact software has the upper hand IMHO.
Hey, excellent article. I use Gentoo Linux on my ultrasparc and the best thing about it is everything is compiled with a common set of CFLAGS — so any optimization settings I make are carried through all installed apps.
If you really want to get the most from your UltraSPARC you should really check out gentoo linux.
Just wanted to say thanks for the well written article. Its insiteful and while I’d love to see the effects of optimization on various peices of software instead of just OpenSSH, its something I can do in my own time.
Yep, I echo those sentiments. Its definitely a good article.
Couple of speculation I have about the performance of SPARC. My guess is that because its largely a RISC chip, instruction scheduling is much more important, which is why you get a larger jump in performance when optimizing for correct CPU architecture. That probably explains why the performance delta on x86 isn’t so great.
Heh… couple? I guess that’s just one speculation 🙂
Perhaps I’ve misjudged gcc on SPARC by not properly performance tuning the flags I’ve been passing when I’ve used gcc. It’d be quite interesting to see the benchmarks in comparison to the C/C++ compilers included in Sun’s Forte Compiler Collection.
Aaron Bennett (IP: —.edu)
If you really want to get the most from your UltraSPARC you should really check out gentoo linux.
If you really want to get the most from your UltraSPARC system you should probably be running Solaris…
” You prove that optimizations do matter, in spite of those who say they make no difference.”
Who in their right mind goes on around saying that compiler optimizations make no difference, especially dealing with a RISC design?
Compiler optimization makes all the difference with respect to RISC machines, sure hardware is usually a few steps ahead when it comes to dynamic scheduling. But most RISC chips depends heavily on a good scheduling, this dependence is taken to an extreme by VLIW machines.
Hello,
For those out there optimizing for Sparc and are writing code don’t forget to limit the level of subrotine calls. Sparcs use a sliding register structure and if you get too deep in subroutine calls it will kill performance. One reason why unrolling loops may help your code (and inlines)
Donaldson
I have to agree with what Aaron said about Gentoo. I’ve tried out many different OSs/distros with my Ultra 10. I’ve always been a huge Debian fan, so I used it happily on the machine for two years or so.
Recently, I had to install new hard disks and start again from scratch, so I decided to see how Gentoo worked, just for curiosity. Although it took a LONG time to build from scratch, I can really see the speed difference, especially with gcc flags like those discussed in this article. I would highly recommend Gentoo for anyone running an Ultra.
Again… great article…
i can’t wait for the next…
“I use Gentoo Linux on my ultrasparc and the best thing about it is everything is compiled with a common set of CFLAGS”
I agree that this is a nice feature, but I question how advantageous this really is in practice. I suspect that one would get most of the speed benefits by simply carrying out targetted optimization of those key apps that really stand to benefit from it (such as OpenSSL, which was the example chosen in this review). It’s perfectly possible to this in binary distributions – Debian has apt-build (http://packages.debian.org/unstable/devel/apt-build), for example. Anyway, a benchmark putting this to the test would be quite interesting.
Additionally, there’s the flip side of the coin to consider. For the apps that *don’t* particularly stand to benefit, some so-called system-wide “optimizations” may actually have a negative effect. For instance, the majority of Gentoo users blindly set “-O3” as the default CFLAGS for their x86 systems (http://www.mail-archive.com/[email protected]/msg02236.html) even though in many cases “-O2” would probably yield better performance.
Sparcs use a sliding register structure and if you get too deep in subroutine calls it will kill performance.
These are called register windows. The performance degredation occurs from taking a Spill Trap, which is when you fill your set of register windows and have to save one off elsewhere. Judicious register use can sometimes avoid this, but then again, on x86, the standard procedure is to push everything onto the stack since register usage is tight, and there aren’t really any commonly used alternate constructs.
If you really want to get the most from your UltraSPARC you should really check out gentoo linux.
I don’t know if this is necessarily true or not. I’m sure gentoo works well on SPARC, but one of the advantages of having such hardware is that it’s really easy to get Solaris to run on it w/o much hassle. I’d try both as they each have different features, strengths, etc. But realistically, this article is about optimizing application performance on SPARC with GCC. Tony has done a great job of presenting the topic cogently, and frankly it’s not much use to have the conversation degenerate into a “my operating system is bigger than yours” contest. I’d be curious to know if he’s planning a similar article for optimizations with Sun’s compilers.
Too bad that you didn’t test the -Os flag. When the file is size-optimised, then the possibility of the L1 and L2 cache hit would be bigger and this makes it faster too.
I concur – this is a well thought out and executed article. I would love to see more compiler/optimization articles. Especially by this author. Thanks.
I too would be curious to see how -Os would do compared to the others, especially how a 64-bit binary compiled with -Os would do compared to a normal 32-bit one, since 64-bit binaries tend to be larger. Would a 64-bit -Os binary still be larger then say a 32-bit -O3? Beyond that I recently purchased my first Ultra machine, specifically to run Solaris. Its a Ultra 60, and I plan on upgrading the hell out of it. 🙂 I also downloaded some Aurora iso’s just to give it a whirl and see if Linux on these older UltraSparcs really is faster, but even if it is I want to run Solaris.
I would Especially love to see more i can’t wait for the next…
..cause the Free Software Companion CD of Sun Solaris 9 has two gcc versions: gcc-2.95.x and 3.2…
Are there differences between both gcc versions? (At least in Solaris 9/Sparc?)
I tried that optimizations in my Quad/SS20 and my Dual/SS10 and it definitely worth the try.
Warning: If you have SMP systems (like mine) some software don’t use SMP capability of sparc engines.. like some ray tracers and such number-crunch apps…( povray and so on)..
BSDero
For these tests, I used GCC 3.3.2 (as outlined in the article). There’s very little performance difference between 3.3.2 and 2.95 (from my tests in the GCC versus Sun compiler article, 2.95 was actually very slightly faster, by about 1%).
I prefer using 3.3.2, simply because it’s the most recent.
After reading this article, I went and tried compiling Ethereal ( http://www.ethereal.com ) with “-mcpu=ultrasparc” to test how it peformed.
The results were very different from what was shown in this article: in particular, it didn’t change much. My post to the Ethereal-dev mailing list is archived here:
http://www.ethereal.com/lists/ethereal-dev/200403/msg00021.html
After reading this article, I went and tried compiling Ethereal ( http://www.ethereal.com ) with “-mcpu=ultrasparc” to test how it peformed; the tests I did were CPU intensive but they involved little to no math (multiplication or division).
The results were very different from what was shown in this article: in particular, it didn’t change much. My post to the Ethereal-dev mailing list (with my results) is archived here:
http://www.ethereal.com/lists/ethereal-dev/200403/msg00021.html
My guess is that SSL is doing a lot of work that became hardware instructions in the newer chips.
It looks like for two of the tests, there was a significant difference, although not as dramatic as the OpenSSL tests. About 30% for one test, and 17% for the other. Whether that’s worth it to you or to the project itself of course is a matter of opinion.
Also remember, there is I/O involved in these tests, and I/O operations donot benefit from compiler optimizations. When I ran tests with gzip for the various compilers in an earlier article, every compiler showed about the same results for a gunzip operation. The reason is likely that the bottleneck was the disk, as it couldn’t read the data fast enough to show any difference.