Linked by Thom Holwerda on Fri 25th Sep 2009 23:12 UTC, submitted by Still Lynn
Microsoft Most of us are probably aware of Singularity, a research operating system out of Microsoft Research which explored a number of new ideas, which is available as open source software. Singularity isn't the only research OS out of Microsoft; they recently released the first snapshot of a new operating system, called Barrelfish. It introduces the concept of the multikernel, which treats a multicore system as a network of independent cores, using ideas from distributed systems.
Thread beginning with comment 386361
To read all comments associated with this story, please click here.
Comment by kaiwai
by kaiwai on Sat 26th Sep 2009 01:55 UTC
kaiwai
Member since:
2005-07-06

Neat ;) Hopefully that'll mean in the future one can create a truly robust operating system where if one kernel crashes on a massive multicore monster - the isolating will mean the rest will keep running without a hitch. I do, however, wonder what the overhead is, how complex it would be to use it for a general purpose operating system and whether such a concept can be retrofitted to an existing operating system.

Reply Score: 5

RE: Comment by kaiwai
by sbergman27 on Sat 26th Sep 2009 04:52 in reply to "Comment by kaiwai"
sbergman27 Member since:
2005-07-24

I do, however, wonder what the overhead is,

IIRC, "Message Passing" is Latin for "Slower than the January Molasses".

All that hardware bandwidth. All that potential for fast, low-latency IPC mechanisms. And it gets wasted, killed by latency, passing messages back and forth.

I always knew that the fantastically powerful computers of the future, running the software of the future, would perform significantly more poorly than what we have today. And this concept may just be a glimpse of how that future is to unfold.

Reply Parent Score: 5

RE[2]: Comment by kaiwai
by kad77 on Sat 26th Sep 2009 05:07 in reply to "RE: Comment by kaiwai"
kad77 Member since:
2007-03-20

You don't think the hardware will adapt?

I doubt it, seems logical that new processors will be designed with pipelines facilitating nanosecond IPC.

Microsoft just provided very costly R&D to the IT community free of charge, and is signaling to their partners that theoretical technology is now practical to some extent ....

... and in essence communicating that they should plan accordingly!

Reply Parent Score: 2

RE[2]: Comment by kaiwai
by tobyv on Sat 26th Sep 2009 05:10 in reply to "RE: Comment by kaiwai"
tobyv Member since:
2008-08-25

I do, however, wonder what the overhead is,
IIRC, "Message Passing" is Latin for "Slower than the January Molasses".


FYI, their paper does argue that message passing on a multicode architecture is significantly faster than shared memory access on the same machine.

But then they explain they have made the "OS structure hardware-neutral" in 3.2.

So in other words: Let's use message passing since it is fast on our AMD development machine, but if it is too slow on the next gen hardware, we will switch to something else.

Not exactly solving the problem, IMHO.

Edited 2009-09-26 05:11 UTC

Reply Parent Score: 1

RE[2]: Comment by kaiwai
by Aussie_Bear on Sat 26th Sep 2009 07:32 in reply to "RE: Comment by kaiwai"
Aussie_Bear Member since:
2006-01-12

An Intel Engineer once said it best:
=> "What Intel Giveth, Microsoft Taketh Away"

Reply Parent Score: 1

RE[2]: Comment by kaiwai
by happe on Sun 27th Sep 2009 04:19 in reply to "RE: Comment by kaiwai"
happe Member since:
2009-06-09

"I do, however, wonder what the overhead is,

IIRC, "Message Passing" is Latin for "Slower than the January Molasses".

All that hardware bandwidth. All that potential for fast, low-latency IPC mechanisms. And it gets wasted, killed by latency, passing messages back and forth.

I always knew that the fantastically powerful computers of the future, running the software of the future, would perform significantly more poorly than what we have today. And this concept may just be a glimpse of how that future is to unfold.
"

All communication, basically, involves messages. It all depends on the sender and receiver. Memory can viewed as a service that handles read and write requests (messages).

In multi-core systems inter-core communication must go through memory, except for atomic operation coordination, which obviously has to be core-to-core. This results in multiple messages going back and forth for at simple exchange og infomation:

1. Sender: write data to memory (write msg)
2. Sender: inform receiver of new data (read/write, core-to-core msgs).
3. Receiver: read data from memory (read msg)
4. Receiver: inform sender of reception (read/write, core-to-core msgs).

I have left out all the nasty synchronization details in #2 and #4, but it usually involves atomic updates of a memory address, which can cause core-to-core sync messages, depending on cache state. Also, cache coherency in general might cause lots of messages.

It is easy to imagine that this could be done faster and in fewer steps if low-level core-to-core communication were provided. All the hardware is already point-to-point.

My point is, that it is not the message passing in u-kernels that gives the overhead. In fact, it is the extra protection (a long story).

Also, shared memory as a programming platform doesn't scale if you code programs the obvious way. You have to know what's going on underneath. It's like cache optimization. You have to know the cache (Lx) and line sizes before you can do a good job. The non-uniform in NUMA does make things better.

I think we have to make memory a high-level abstraction and give OS and middleware programmers more control of what is communicated were.

Reply Parent Score: 2

RE[2]: Comment by kaiwai
by Brendan on Mon 28th Sep 2009 06:44 in reply to "RE: Comment by kaiwai"
Brendan Member since:
2005-11-16

Hi,

"I do, however, wonder what the overhead is,

IIRC, "Message Passing" is Latin for "Slower than the January Molasses".

All that hardware bandwidth. All that potential for fast, low-latency IPC mechanisms. And it gets wasted, killed by latency, passing messages back and forth.
"

If you compare 16 separate single-core computers running 16 separate OSs communicating via. networking, to 16 separate CPUs (in a single computer) running 16 separate OSs communicating via. IPC, then I think you'll find that IPC is extremely fast compared to any form of networking.

If you compare 16 CPUs (in a single computer) running 16 separate OSs using IPC, to 16 CPUs (in a single computer) running one OSs; then will the overhead of IPC be more or less than the overhead of mutexes, semaphores, "cache-line ping-pong", scheduler efficiency, and other scalability problems? In this case, my guess is that IPC has less overhead (especially when there's lots of CPUs) and is easier to get right (e.g. without subtle race conditions, etc); but the approach itself is going to have some major new scalability problems of it's own (e.g. writing distributed applications that are capable of keeping all those OSs/CPUs busy will be a challenge).

-Brendan

Reply Parent Score: 2

RE: Comment by kaiwai
by kjmph on Sat 26th Sep 2009 22:42 in reply to "Comment by kaiwai"
kjmph Member since:
2009-07-17

Yes, good points. I would presume that this model would only take down applications that were currently running on the failed core. However, you would have to deal with messages in flight to the running core, so there would be unknown state to clean up. I bet you could easily cycle/reset the core into a known state. So, greater up-time in the long run.

As far as overhead is concerned, they say that native IPC was 420 cycles and the similar message passing implementation cost 757 cycles. That's 151ns vs 270ns on the 2.8GHz chips they were testing on. However, by breaking the current synchronous approach and using a user RPC mechanism they dropped the message passing to 450cycles on die, and 532cycles one hop. With two hops only costing tens of cycles more. Which is really starting to become negligible. So, it does cost, but where they excelled was multi-core shared memory updates. But, to get back to your comments, that really is not general purpose computing as of today, as most applications on my Linux box are single threaded. Of the few apps that aren't single threaded, ffmpeg and Id's Doom3 engine, they are most likely aren't synchronizing shared memory updates, rather I think they would isolate memory access to certain threads and pass commands around via a dispatcher thread. So, this is a pretty specific type of applications that excel on Barrelfish. I think they are targeting Google's MapReduce and Microsoft's Dryad.

Finally, it's important to notice that HW is moving to a message passing type architecture as well. AMD had implemented HyperTransport and Intel now has the QuickPath Interconnect. So, in Barrelfish, the implementation of the message passing on AMD's cpus is based on cache lines being routed via HT. In other words, hardware accelerated message passing. They isolated the transport mechanism from the message passing API, so I believe they could swap in different accelerated transport implementations depending on the architecture it's currently running on.

Reply Parent Score: 2

RE[2]: Comment by kaiwai
by kaiwai on Sun 27th Sep 2009 04:45 in reply to "RE: Comment by kaiwai"
kaiwai Member since:
2005-07-06

Yes, good points. I would presume that this model would only take down applications that were currently running on the failed core. However, you would have to deal with messages in flight to the running core, so there would be unknown state to clean up. I bet you could easily cycle/reset the core into a known state. So, greater up-time in the long run.


I'd assume with the multiple kernels there would be a thin virtualisation layer sitting on top which makes the 'cluster' of kernels appear as a single image with the same sort of result when a single machine goes off line there is maybe a momentary stall as either there is a retry of the processing or issuing a failure notice to the end users - the preferable scenario being the former rather than the later.

As far as overhead is concerned, they say that native IPC was 420 cycles and the similar message passing implementation cost 757 cycles. That's 151ns vs 270ns on the 2.8GHz chips they were testing on. However, by breaking the current synchronous approach and using a user RPC mechanism they dropped the message passing to 450cycles on die, and 532cycles one hop. With two hops only costing tens of cycles more. Which is really starting to become negligible. So, it does cost, but where they excelled was multi-core shared memory updates. But, to get back to your comments, that really is not general purpose computing as of today, as most applications on my Linux box are single threaded. Of the few apps that aren't single threaded, ffmpeg and Id's Doom3 engine, they are most likely aren't synchronizing shared memory updates, rather I think they would isolate memory access to certain threads and pass commands around via a dispatcher thread. So, this is a pretty specific type of applications that excel on Barrelfish. I think they are targeting Google's MapReduce and Microsoft's Dryad.

Finally, it's important to notice that HW is moving to a message passing type architecture as well. AMD had implemented HyperTransport and Intel now has the QuickPath Interconnect. So, in Barrelfish, the implementation of the message passing on AMD's cpus is based on cache lines being routed via HT. In other words, hardware accelerated message passing. They isolated the transport mechanism from the message passing API, so I believe they could swap in different accelerated transport implementations depending on the architecture it's currently running on.


Well, it is the same argument I remember having about Micro-kernels and the apparent slowness. Although every desires the absolute maximum performance I would sooner sacrifice some speed and have slightly more overhead if the net result is a more stable and secure operating system. Internet Explorer 8 for example has a higher over head because of process isolation and separation but it is a very small price to pay if the net result is a more stable and secure piece of software.

It therefore concerns me when I hear on osnews.com the number of people who decry an increase in specifications off the back of improved security and stability (with a slight performance penalty). Hopefully if such an idea were to get off the ground there wouldn't be a similar backlash because the last thing I want to see is yet another technology that comes in half baked simply to keep the ricers happy that their system is 'teh max speed'. They did it with Windows when moving the whole graphics layer into the kernel, I hope that the same compromise isn't made when it comes to delivering this idea to the real world.

Reply Parent Score: 2