Linked by Thom Holwerda on Sun 16th Mar 2008 21:51 UTC, submitted by Oliver
Benchmarks "In May 2007 I ran some benchmarks of Dragonfly 1.8 to evaluate progress of its SMP implementation, which was the original focus of the project when it launched in 2003 and is still widely believed to be an area in which they had made concrete progress. This was part of a larger cross-OS multiprocessor performance evaluation comparing improvements in FreeBSD to Linux, NetBSD and other operating systems. The 2007 results showed essentially no performance increase from multiple processors on dragonfly 1.8, in contrast to the performance of FreeBSD 7.0 which scaled to 8 CPUs on the benchmark. Recently Dragonfly 1.12 was released, and the question was raised on the dragonfly-users mailing list of how well the OS performs after a further year of development. I performed several benchmarks to study this question."
Thread beginning with comment 305375
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[4]: Makes sense.
by nick on Mon 17th Mar 2008 13:17 UTC in reply to "RE[3]: Makes sense."
Member since:

I did it for you ;) .

Writing a fine grained kernel is hard, especially when you need to support some functionality that's not so friendly for large MP systems.

Maybe Dillon's approach is right. Instead of trying to coordinate actions between processors in a lock free or finely grained manner while trying to maintain tons of shared state, why not treat your large MP machine as many smaller ones that are coordinated at a higher level?

This is the McVoy cache coherent cluster approach

It has always seemed like hadwaving to me. The problem with this is: "what higher level?"

What does it buy you to program a multiprocessor system as a set of communicating cluster nodes, that you can't do as a monolithic kernel? As far as I can see, it only serves to place restrictions on the ways you can communicate and interact between CPUs.

So why do proponents of this approach think they can just assert that a system of communicating nodes -- that *must* still be synchronized and share data at some points -- can scale as well as a monolithic kernel? I don't think that is a given at all.

The Google cluster is a great example of this. They don't really need a scalable OS at all: a simple kernel that efficiently manages I/O and gets out of the way of the one executing task on a particular node is all that's needed.

Well, they use Linux, and actually they are pushing some pretty complex functionality into the kernel (to do resource control, for example).

Everything else is coordinated from their aggregation servers and their distributed namespace/locking system.

And I don't know what it's requirements are, but you can bet it's nothing like a regular UNIX filesystem / POSIX API.

Also, I don't see why you think Google is a great example of this. Google does not have a large MP machine. It has a big, non-cache-coherent cluster. So there is only one way to program it -- like a cluster.

Reply Parent Score: 2

RE[5]: Makes sense.
by PlatformAgnostic on Wed 19th Mar 2008 06:32 in reply to "RE[4]: Makes sense."
PlatformAgnostic Member since:


You have a point that the single-machine cluster approach is going to require a different method of programming than the typical monolithic OS image that you seem to favor. But I think that McVoy has a point about the locking cliff. NT has gained scalability over the releases by breaking up hot locks and moving toward per-processor data structures and lock-free algorithms. But if you look at a CPU count versus throughput graph, all of these improvements lower the throughput slightly on small hardware in order to gain advantages on large hardware.

Right now that's the correct choice since Intel has been selling multi-core processors for a while now. But what applications need truly gigantic machines, like the 1024 node one that McVoy speaks of? The only ones I can think of right now are scientific, visual, and engineering applications where, for cache and NUMA efficiency's sake, it's more of an advantage to divide problems into smaller chunks.

Cache coherency isn't good in this situation either, because it implies that every time a processor dirties a cache line every other processor must get a chance to know about it (or you use a hierarchy of cache directories which manage the same bookkeeping at the cost of additional latency for every operation).

Having a single system address space becomes problematic as well because every time you change a mapping from valid to invalid, every processor must be informed in case it has cached that mapping in its TLB.

The problem with single-image scaling from my perspective is that it's an n^2 proposition... When you scale from 1 to n processors, in a given period of time you have n times as many synchronizing events happening and you have n times as many people to inform at these synchronization points. The second n is done in parallel, but this parallelism translates into a factor of n latency due to distances and fan-out.

If you want to use a 1024 proc machine for something useful (as opposed to having it sitting around spinning on locks or waiting for cache misses), I think I've argued that you just have to bite the bullet and accept that you need to do batches of discontiguous work on small parts of your problem and aggregate the result.

Reply Parent Score: 2