Genode 11.05 Gets New API for Type-safe IPC, Runs L4Linux

Guest post by Norman Feske 2011-05-26 OS News 9 Comments

The concern for efficient and easy-to-use inter-process communication is prevalent among microkernel-based operating systems. Genode has always taken an unorthodox stance on this subject by disregarding the time-tested standard solution of using an IDL compiler in favour of sticking to raw C++ mechanisms. The new version 11.05 of the OS framework takes another leap by introducing a brand new API for implementing procedure calls across process boundaries, facilitating type safety and ease of use, yet still not relying on external tools. Furthermore, the platform support for the Fiasco.OC kernel has been extended to the complete feature set of the framework. The most significant new features are L4Linux (on Fiasco.OC), an experimental integration of GDB, ARM RealView PBX device drivers, and device I/O support for the MicroBlaze platform.

Inter-process communication (IPC) is one of the most challenging aspects of creating a multi-server operating system. On the one hand, the design space for IPC kernel mechanisms is huge and there seem to be never-ending debates in the OS community about which approach is the best. E.g., the Mach kernel used to buffer messages in the kernel. In contrast, L4 regarded synchronous IPC as the key for achieving high performance. Others follow the line of Nemesis and K42, preferring a combination of signalling and shared memory. On the other hand, most kernel architects miss the holistic knowledge of the usage patterns demanded by user-level components.

Genode tries to bridge the gap between the variety of kernel interfaces and real-world user-level software by providing an API that should be implementable on a wide range of different kernels, yet cover all use cases required to build a fully-featured operating system. When the Genode project was started in 2006, it was clear that user-level software demands three different types of communication, synchronous remote procedure calls (RPC), shared memory, and signalling. However, the actual patterns of how these mechanisms are used when the system scales in complexity were not fully understood. Now, five years later, the picture has clarified, prompting the project to reconsider the APIs introduced initially. The result is a brand new RPC framework crafted for type safety and ease of use while sticking to plain C++. For those interested in the subject, the complete line of thought is described here.

Besides the work on the foundations of the framework, Genode’s new platforms introduced with the previous release received a great deal of attention. The Fiasco.OC microkernel is now fully supported for the x86-32, x86-64, and ARM architectures. Thereby, Genode’s complete software stack including the dynamic linker, lwIP, and Qt4 have become available on this kernel. Furthermore, the prime feature of Fiasco.OC, namely the paravirtualized L4Linux kernel, has been ported over to Genode. L4Linux is a variant of the Linux kernel specifically developed to run as user-level component on top of Fiasco.OC.

Genode’s custom kernel for the MicroBlaze architecture has been complemented with the functionality needed for implementing user-level device drivers. For those who are eager to explore Genode on Xilinx FPGAs, the new version provides a ready-to-use reference SoC for the Spartan3A Starter Kit platform accompanied with new documentation.

As the Genode developer community grows, the need for a solid debugging solution arises. The new release features the first step towards the use of the GNU debugger with the framework. In addition to the progress on the actual framework, the current release provides new ways to make Genode more approachable to new developers by the means of ready-to-use scripts for building, configuring, and test-driving a number of Genode features including Qt4, lwIP, GDB, and L4Linux.

Read on about those and more topics in the full version of the release notes of Genode 11.05.

9 Comments

2011-05-26 2:54 pm
poundsmack
Very nice Genode team.

2011-05-26 4:17 pm
Pro-Competition
Indeed. This project is very interesting, and keeps moving forward.
I really need to set aside some time to play with it.

2011-05-27 10:19 am
axilmar
I don’t see why messages have go through the kernel. For me, the best approach for interprocess communication on the same machine is to have two processes share memory, and then when a process A wants to send a message to another process B, then process A simply allocates a buffer from the shared memory, and then informs process B about the message via a semaphore. Then process B reads the message, copies it into a private memory, and then checks it.
In this way, there is no need for context swapping; the kernel need not be invoked at all.

2011-05-27 10:34 am
Morin
I don’t see why messages have go through the kernel. For me, the best approach for interprocess communication on the same machine is to have two processes share memory, and then when a process A wants to send a message to another process B, then process A simply allocates a buffer from the shared memory, and then informs process B about the message via a semaphore. Then process B reads the message, copies it into a private memory, and then checks it.
In this way, there is no need for context swapping; the kernel need not be invoked at all.
(1) A single memory shared by everything is a bottleneck in multiprocessor systems. Caches don’t solve this problem, they only hide it behind the cache coherency protocol.
(2) “Going through the kernel” is only slow if you make it slow.

2011-05-27 2:34 pm
Megol
I don’t see why messages have go through the kernel. For me, the best approach for interprocess communication on the same machine is to have two processes share memory, and then when a process A wants to send a message to another process B, then process A simply allocates a buffer from the shared memory, and then informs process B about the message via a semaphore. Then process B reads the message, copies it into a private memory, and then checks it.
In this way, there is no need for context swapping; the kernel need not be invoked at all.
(1) A single memory shared by everything is a bottleneck in multiprocessor systems. Caches don’t solve this problem, they only hide it behind the cache coherency protocol.
Sharing always have bottlenecks (fundamentally from the speed of light). Sharing memory with caching-aware semantics is the fastest communication a standard processor can have, even pure message passing like the basic QNX primitives still use the same shared memory mechanism.
(2) “Going through the kernel” is only slow if you make it slow.
Like on x86? Depending on the processor and the kernel/user design a pure enter/exit of kernel mode can take some 1000s of clocks (including stalls due to cache/TLB evictions). Add the overhead of the operation. (I am aware that pure null-operations are considerably faster however real code have real overheads)
This means that user-level communications with shared memory can in many cases do spin-locks with lower overheads than using any kernel primitives. Spin+fallback to kernel synchronization is very effective.
2011-05-27 4:26 pm
axilmar
A single memory shared by everything is a bottleneck in multiprocessor systems. Caches don’t solve this problem, they only hide it behind the cache coherency protocol.
You can always have shared memory per process couple (sender-receiver).
“Going through the kernel” is only slow if you make it slow.
In modern 80×86 CPUs, it’s very slow. It’s also always slower than if you don’t go through the kernel.

2011-05-27 11:16 am
krishna
… then when a process A wants to send a message to another process B, then process A simply allocates a buffer from the shared memory, and then informs process B about the message via a semaphore. Then process B reads the message, copies it into a private memory, and then checks it. In this way, there is no need for context swapping; the kernel need not be invoked at all.
Isn’t the semaphore also a kernel-provided mechanism, which forces both processes to synchronize via kernel entries? Also, the usage of a shared-memory communication comes not for free as one has to establish the shared memory with each communication partner, allocate buffers in the shared memory, maybe allocate control packets, acknowledge completed messages for buffer reuse and maybe unblock the sender that wants to marshal the next message. From our experience with Genode, this pays off for bulk-data transfer but not for most RPCs with just a few register words of payload.
Modern microkernels like Fiasco.OC support fast inter-process communication with a user-level accessible part of the thread control block – UTCB – with a size of about 64 register words or something more. Processes marshal their message payload into the UTCB, the kernel copies from sender to receiver UTCB, and, finally, the receiver demarshals the data out of the UTCB as needed. Performance-wise this should fit the approach you described, the time for the copy operation is bounded by UTCB size, and there’s no shared-memory establishment overhead as only the kernel accesses both UTCBs.

2011-05-27 4:31 pm
axilmar
Isn’t the semaphore also a kernel-provided mechanism, which forces both processes to synchronize via kernel entries?
It doesn’t have to be a kernel object.
Also, the usage of a shared-memory communication comes not for free as one has to establish the shared memory with each communication partner
Not a problem. The virtual memory subsystem can take care of that.
allocate buffers in the shared memory
Can be done via mutexes (not kernel objects) in shared memory.
acknowledge completed messages for buffer reuse
The buffers are simply freed from the shared memory.
maybe unblock the sender that wants to marshal the next message
Atomically increment the semaphore. If the other CPU spins on the semaphore (i.e. it’s a spin lock), then the other process will be unblocked.
Modern microkernels like Fiasco.OC support fast inter-process communication with a user-level accessible part of the thread control block – UTCB – with a size of about 64 register words or something more. Processes marshal their message payload into the UTCB, the kernel copies from sender to receiver UTCB, and, finally, the receiver demarshals the data out of the UTCB as needed. Performance-wise this should fit the approach you described, the time for the copy operation is bounded by UTCB size, and there’s no shared-memory establishment overhead as only the kernel accesses both UTCBs.
The 80×86 CPU doesn’t have 64 registers available for users. Furthermore, you still do two kernel switches. I think the shared memory approach is faster, at least on 80×86.

2011-05-29 6:32 pm
pepper
I don’t see why messages have go through the kernel. For me, the best approach for interprocess communication on the same machine is to have two processes share memory
This is done of course whenever two tasks have to communicate frequently.
But first you need to negotiate this shared memory and enforce access control between tasks. So IPC via kernel is always the first step. And if you don’t expect to communicate often, establishing shared mem is only overhead.
IPC performance is indeed a major performance criterion for microkernel systems, but modern kernels can do this quite fast.