Linked by Hadrien Grasland on Sun 29th May 2011 09:42 UTC
OSNews, Generic OSes It's funny how trying to have a consistent system design makes you constantly jump from one area of the designed OS to another. I initially just tried to implement interrupt handling, and now I'm cleaning up the design of an RPC-based daemon model, which will be used to implement interrupt handlers, along with most other system services. Anyway, now that I get to something I'm personally satisfied with, I wanted to ask everyone who's interested to check that design and tell me if anything in it sounds like a bad idea to them in the short or long run. That's because this is a core part of this OS' design, and I'm really not interested in core design mistakes emerging in a few years if I can fix them now. Many thanks in advance.
Thread beginning with comment 475642
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[7]: RPC considered harmful
by Brendan on Thu 2nd Jun 2011 10:02 UTC in reply to "RE[6]: RPC considered harmful"
Brendan
Member since:
2005-11-16

PART I

Hi,

"While I can see some similarities between this and asynchronous messaging, there's also significant differences; including the overhead of creating (and eventually destroying) threads, which (in my experience) is the third most expensive operation microkernels do (after creating and destroying processes).

Ah, Brendan, Brendan, how do you always manage to be so kind and helpful with people who play with OSdeving ? Do you teach it in real life or something ?

Anyway, have you pushed your investigation so far that you know which step of the thread creation process is expensive ? Maybe it's something whose impact can be reduced...
"

Thread creation overhead depends on a lot of things; like where the user-space stack is (and if it's pre-allocated by the caller), how kernel stacks are managed (one kernel stack per thread?), how CPU affinity and CPU load balancing works, how much state is saved/restored on thread switches and must be initialised to default values during thread creation (general registers, FPU/MMX/SSE, debug registers, performance monitoring registers, etc), how thread local storage is managed, etc.

For an extremely simple OS (single-CPU only, no support for FPU/MMX/SSE, no "per thread" debugging, no "per thread" performance monitoring, no thread local storage, no "this thread has used n cycles" time accountancy) that uses one kernel stack (e.g. an unpreemptable kernel); if the initial state of a thread's registers is "undefined", and the thread's stack is pre-allocated; then it could be very fast. Not sure anyone would want an OS like that though (maybe for embedded systems).

Also, if other operations that a kernel does are extremely slow then thread creation could be "faster than extremely slow" in comparison.

There's something else here too though. For most OSs, typiaclly only a thread within a process can create a thread for that process; which means that at the start of thread creation the CPU/kernel is using the correct process' address space, so it's easier to setup the new thread's stack and thread local storage. For your IPC this isn't the case (the sending process' address space would be in use at the time you begin creating a thread for receiving process), so you might need to switch address spaces during thread creation (and blow away TLB entries, etc) if you don't do it in a "lazy" way (postpone parts of thread creation until the thread first starts running).


"On top of that, (because you can't rely on the queues to serialise access to data structures) programmers would have to rely on something else for reentrancy control; like traditional locking, which is error-prone (lots of programmers find it "hard" and/or screw it up) and adds extra overhead (e.g. mutexes with implied task switches when under lock contention).

This has been pointed out by Alfman, solved by introducing an asynchronous operating mode where pending threads are queued and run one after the other. Sorry for not mentioning it in the post where I try to describe my model, when I noticed the omission it was already too late to edit.
"

Hehe. Let's optimise the implementation of this!

You could speed it up by having a per-process "thread cache". Rather than actually destroying a thread, you could pretend to destroy it and put it into a "disused thread pool" instead, and then recycle these existing/disused threads when a new thread needs to be created. To maximise the efficiency of your "disused thread pool" (so you don't have more "disused threads" than necessary), you could create (or pretend to create) the new thread when IPC is being delivered to the receiver and not when IPC is actually sent. To do that you'd need a queue of "pending IPC". That way, for asynchronous operating mode you'd only have a maximum of one thread (per process), where you pretend to destroy it, then recycle it to create a "new" thread, and get the data needed for the "new" thread from the queue of "pending IPC".

Now that it's been optimised, it looks very similar to my "FIFO queue of messages". Instead of calling "getmessage()" and blocking until a message is received, you'd be calling "terminate_thread()" and being put into a "disused thread pool" until IPC is received. The only main difference (other than terminology) is that you'd still be implicitly creating one initial thread.

[Continued in Part II - silly 8000 character limit...]

Reply Parent Score: 2

RE[8]: RPC considered harmful
by Brendan on Thu 2nd Jun 2011 10:03 in reply to "RE[7]: RPC considered harmful"
Brendan Member since:
2005-11-16

PART II

"I also wouldn't underestimate the effect that IPC overhead will have on the system as a whole (especially for "micro-kernel-like" kernels).

I know, I know, but then we reach one of those chicken and egg problems which are always torturing me : how do I know that my IPC design is "light enough" without doing measurements on a working system for real-world use cases ? And how do I perform these measurements on something which I'm currently designing and is not implemented yet ?
"

You can estimate; but how fast would be "too fast"?

In my opinion, there's no such thing as too fast. It's mostly a question of whether any extra overheads are worth any extra benefits.


"For example, if IRQs are delivered to device drivers via. IPC, then on a server under load (with high speed ethernet, for e.g.) you can expect thousands of IRQs per second (and expect to be creating and destroying thousands of threads per second). Once you add normal processes communicating with each other, this could easily go up to "millions per second" under load. If IPC costs twice as much as it does on other OSs, then the resulting system as a whole can be 50% slower than comparable systems (e.g. other micro-kernels) because of the IPC alone.

First objection which spontaneously comes to my mind is that this OS is not designed to run on server, but rather on desktop and smaller single-user computers.

Maybe desktop use cases also include the need to endure thousands of IRQ per second though, but I was under the impression that desktop computers are ridiculously powerful compared to what one asks from their OSs and that their reactivity issues rather come from things like poor task scheduling ("running the divx encoding process more often than the window manager") or excessive dependency on disk I/O.
"

The difference between a server OS and a desktop OS is mostly artificial product differentiation made in higher levels of the OS (e.g. if it comes with HTTP/FTP servers and no GUI, or if it comes with a GUI and no HTTP/FTP servers; licensing, advertising, availability of support contracts, etc). It makes no difference at the lowest levels; until/unless you start looking at fault tolerance features (redundancy, hot-plug RAM/CPUs, etc).


"The alternative would be if the kernel has inbuilt support for multiple different forms of IPC; which can lead to a "Tower of Babel" situation where it's awkward for different processes (using different types of IPC) to communicate with each other.

Actually, I tend to lean towards this solution, even though I know of the Babel risk and have regularly thought about it, because each IPC mechanism has its strength and weaknesses. As an example, piping and messaging systems are better when processing a stream of data, while remote calls are better suited when giving a process some tasks to do.
"

Well, not quite (I'm not sure you fully understand the differences between messaging and pipes).

Pipes would work well for streams of bytes, but messaging wouldn't be ideal (there'd be extra/unnecessary overhead involved with breaking a stream into smaller pieces). Most things aren't streams of bytes though - they're streams of "some sort of data structure". Maybe a stream of "video frames", a stream of "keypresses", a stream of "commands/instructions", a stream of "compressed blocks of audio", etc. In these cases there's natural boundaries between the "some sort of data structures" - messages would be ideal (one message per "some sort of data structure") and pipes wouldn't be ideal (there'd be extra/unnecessary overhead involved with finding the natural boundaries).

Also, for messaging each message typically has a "message type" associated with it. This means that the same receiver can handle many different things at the same time (e.g. it could be receiving a stream of "video frames", a stream of "keypresses" and a stream of "compressed blocks of audio" at the same time, and be able to distinguish between them using the message types). Pipes don't work like that - you'd need a different pipe for each stream. This means that rather than waiting to receive messages of any type, you end up needing something special like "select()" to monitor many different pipes.

- Brendan

Reply Parent Score: 2

Neolander Member since:
2010-03-08

You can estimate; but how fast would be "too fast"?

In my opinion, there's no such thing as too fast. It's mostly a question of whether any extra overheads are worth any extra benefits.

For me, "too fast" would be when gaining extra speed implies having another desirable characteristic of the OS become exceedingly low. Speed has its trade-offs, and to solve the problem of trade-offs it's good to have goals. Well, I think we agree there anyway.

The difference between a server OS and a desktop OS is mostly artificial product differentiation made in higher levels of the OS (e.g. if it comes with HTTP/FTP servers and no GUI, or if it comes with a GUI and no HTTP/FTP servers; licensing, advertising, availability of support contracts, etc). It makes no difference at the lowest levels; until/unless you start looking at fault tolerance features (redundancy, hot-plug RAM/CPUs, etc).

That's the way it's done today, but it has not been always done like that. Classic Windows and Mac OS, as an example, were designed for desktop use, and would have been terrible as server OSs for a number of reasons.

With TOSP, I design solely for "desktop" (more precisely, personal) computers, because I believe that reducing the amount of target use cases will in turn simplify the design in some areas and reduce the amount of trade-offs, resulting in a design that's lighter and better suited for the job. It's the classical "generic vs specialized" debate, really.

Well, not quite (I'm not sure you fully understand the differences between messaging and pipes).

For me, the difference is about what is the atomic transmitted unit that is processed by the recipient.

In a pipe, that atomic unit is a fixed-size heap of binary data, typically a byte in the UNIX world. In a message, the atomic unit is a variable-sized message, which is not actually processed by the recipient until the message's terminator has been received.

Am I right ?

Pipes would work well for streams of bytes, but messaging wouldn't be ideal (there'd be extra/unnecessary overhead involved with breaking a stream into smaller pieces). Most things aren't streams of bytes though - they're streams of "some sort of data structure". Maybe a stream of "video frames", a stream of "keypresses", a stream of "commands/instructions", a stream of "compressed blocks of audio", etc. In these cases there's natural boundaries between the "some sort of data structures" - messages would be ideal (one message per "some sort of data structure") and pipes wouldn't be ideal (there'd be extra/unnecessary overhead involved with finding the natural boundaries).

But what about a kind of pipe which would take something larger than a byte as its basic transmission unit ?

Like, you know, if a keypress is defined by a 32-bit scancodes, a 32-bit scancode pipe ?

You could avoid the overhead of a messaging protocols for fixed-size structures this way.

Also, for messaging each message typically has a "message type" associated with it. This means that the same receiver can handle many different things at the same time (e.g. it could be receiving a stream of "video frames", a stream of "keypresses" and a stream of "compressed blocks of audio" at the same time, and be able to distinguish between them using the message types). Pipes don't work like that - you'd need a different pipe for each stream. This means that rather than waiting to receive messages of any type, you end up needing something special like "select()" to monitor many different pipes.

What if a program could monitor several streams at the same time by having a different callback being triggered when a message comes in each of the pipes ?

(Again, if the OSnews comment system has decided to archive this discussion by the time you answer, feel free to continue replying on my blog.)

Edited 2011-06-02 10:50 UTC

Reply Parent Score: 1

RE[9]: RPC considered harmful
by Alfman on Thu 2nd Jun 2011 18:44 in reply to "RE[8]: RPC considered harmful"
Alfman Member since:
2011-01-28

Brendan,

"You could speed it up by having a per-process "thread cache". Rather than actually destroying a thread, you could pretend to destroy it and put it into a "disused thread pool" instead, and then recycle these existing/disused threads when a new thread needs to be created."


Yes, many MT devs use this design, I use it in my async package because dynamic thread creation via pthreads is so costly. But I do wonder if it is inherently slow, or just that pthreads/linux are inefficient at it.

In theory, one could just alloc a stack, register swap area, TLS and accounting structure (possibly in one call to malloc). Then we add this structure to a linked list of threads and fire off the thread function.

It wouldn't even be necessary to coordinate the above with other CPUs if each CPU had it's own process thread "list".

It wouldn't even be necessary to coordinate with any syscalls (if userspace could be trusted with managing the thread list within it's own process, as long as the process only endangers itself, this might not be an issue).

The entire thread lifecycle could take place on a single CPU with no inter CPU synchronization at all.

Now obviously, there would need to be a separate mechanism to migrate threads between CPUs. But this migration process might be able to take the synchronization hits once per cycle, instead of once per thread.

During an interrupt (pre-emptive threads), an explicit yield, or a blocking OS call, the OS would swap the cpu state and start another thread in the queue.

That summerizes the lightest thread implementation I can conceive of, and if it worked that way, I would think the performance between a thread pool and ordinary thread creation might be a wash (particularly with an efficient or pooled malloc).



"Most things aren't streams of bytes though - they're streams of 'some sort of data structure'."

The lack of message boundaries has always been a major design flaw in my opinion.

There's no reason (other than legacy design) that pipes shouldn't allow us to send discrete packets. It should be possible for high level application code to specify boundaries (even if very large packets still span multiple read requests).

This deficiency has lead to the same design patterns being reimplemented over and over again inside of network applications needing to reassemble messages from the stream.

"Also, for messaging each message typically has a "message type" associated with it. This means that the same receiver can handle many different things at the same time"

Yes, but if the OS has explicit support for RPC, wouldn't the need for discrete messages and message types be largely eliminated?

Reply Parent Score: 2

Neolander Member since:
2010-03-08

There's something else here too though. For most OSs, typiaclly only a thread within a process can create a thread for that process; which means that at the start of thread creation the CPU/kernel is using the correct process' address space, so it's easier to setup the new thread's stack and thread local storage. For your IPC this isn't the case (the sending process' address space would be in use at the time you begin creating a thread for receiving process), so you might need to switch address spaces during thread creation (and blow away TLB entries, etc) if you don't do it in a "lazy" way (postpone parts of thread creation until the thread first starts running).

Not necessarily, it depends where the IPC primitives are managed. If RPC is done through system calls, then you can create a thread while you're in kernel mode and have no extra address space switching overhead.

Hehe. Let's optimise the implementation of this!

You could speed it up by having a per-process "thread cache". Rather than actually destroying a thread, you could pretend to destroy it and put it into a "disused thread pool" instead, and then recycle these existing/disused threads when a new thread needs to be created. To maximise the efficiency of your "disused thread pool" (so you don't have more "disused threads" than necessary), you could create (or pretend to create) the new thread when IPC is being delivered to the receiver and not when IPC is actually sent. To do that you'd need a queue of "pending IPC". That way, for asynchronous operating mode you'd only have a maximum of one thread (per process), where you pretend to destroy it, then recycle it to create a "new" thread, and get the data needed for the "new" thread from the queue of "pending IPC".

Now that it's been optimised, it looks very similar to my "FIFO queue of messages". Instead of calling "getmessage()" and blocking until a message is received, you'd be calling "terminate_thread()" and being put into a "disused thread pool" until IPC is received. The only main difference (other than terminology) is that you'd still be implicitly creating one initial thread.

Yeah, I had thought about something like this for thread stacks (keeping a cache of orphaned stacks to remove the need to allocate them). But you're right that it can totally be done for full threads, with even better performance.

Reply Parent Score: 1