Linked by Hadrien Grasland on Sun 29th May 2011 09:42 UTC
OSNews, Generic OSes It's funny how trying to have a consistent system design makes you constantly jump from one area of the designed OS to another. I initially just tried to implement interrupt handling, and now I'm cleaning up the design of an RPC-based daemon model, which will be used to implement interrupt handlers, along with most other system services. Anyway, now that I get to something I'm personally satisfied with, I wanted to ask everyone who's interested to check that design and tell me if anything in it sounds like a bad idea to them in the short or long run. That's because this is a core part of this OS' design, and I'm really not interested in core design mistakes emerging in a few years if I can fix them now. Many thanks in advance.
Permalink for comment 475642
To read all comments associated with this story, please click here.
RE[7]: RPC considered harmful
by Brendan on Thu 2nd Jun 2011 10:02 UTC in reply to "RE[6]: RPC considered harmful"
Brendan
Member since:
2005-11-16

PART I

Hi,

"While I can see some similarities between this and asynchronous messaging, there's also significant differences; including the overhead of creating (and eventually destroying) threads, which (in my experience) is the third most expensive operation microkernels do (after creating and destroying processes).

Ah, Brendan, Brendan, how do you always manage to be so kind and helpful with people who play with OSdeving ? Do you teach it in real life or something ?

Anyway, have you pushed your investigation so far that you know which step of the thread creation process is expensive ? Maybe it's something whose impact can be reduced...
"

Thread creation overhead depends on a lot of things; like where the user-space stack is (and if it's pre-allocated by the caller), how kernel stacks are managed (one kernel stack per thread?), how CPU affinity and CPU load balancing works, how much state is saved/restored on thread switches and must be initialised to default values during thread creation (general registers, FPU/MMX/SSE, debug registers, performance monitoring registers, etc), how thread local storage is managed, etc.

For an extremely simple OS (single-CPU only, no support for FPU/MMX/SSE, no "per thread" debugging, no "per thread" performance monitoring, no thread local storage, no "this thread has used n cycles" time accountancy) that uses one kernel stack (e.g. an unpreemptable kernel); if the initial state of a thread's registers is "undefined", and the thread's stack is pre-allocated; then it could be very fast. Not sure anyone would want an OS like that though (maybe for embedded systems).

Also, if other operations that a kernel does are extremely slow then thread creation could be "faster than extremely slow" in comparison.

There's something else here too though. For most OSs, typiaclly only a thread within a process can create a thread for that process; which means that at the start of thread creation the CPU/kernel is using the correct process' address space, so it's easier to setup the new thread's stack and thread local storage. For your IPC this isn't the case (the sending process' address space would be in use at the time you begin creating a thread for receiving process), so you might need to switch address spaces during thread creation (and blow away TLB entries, etc) if you don't do it in a "lazy" way (postpone parts of thread creation until the thread first starts running).


"On top of that, (because you can't rely on the queues to serialise access to data structures) programmers would have to rely on something else for reentrancy control; like traditional locking, which is error-prone (lots of programmers find it "hard" and/or screw it up) and adds extra overhead (e.g. mutexes with implied task switches when under lock contention).

This has been pointed out by Alfman, solved by introducing an asynchronous operating mode where pending threads are queued and run one after the other. Sorry for not mentioning it in the post where I try to describe my model, when I noticed the omission it was already too late to edit.
"

Hehe. Let's optimise the implementation of this!

You could speed it up by having a per-process "thread cache". Rather than actually destroying a thread, you could pretend to destroy it and put it into a "disused thread pool" instead, and then recycle these existing/disused threads when a new thread needs to be created. To maximise the efficiency of your "disused thread pool" (so you don't have more "disused threads" than necessary), you could create (or pretend to create) the new thread when IPC is being delivered to the receiver and not when IPC is actually sent. To do that you'd need a queue of "pending IPC". That way, for asynchronous operating mode you'd only have a maximum of one thread (per process), where you pretend to destroy it, then recycle it to create a "new" thread, and get the data needed for the "new" thread from the queue of "pending IPC".

Now that it's been optimised, it looks very similar to my "FIFO queue of messages". Instead of calling "getmessage()" and blocking until a message is received, you'd be calling "terminate_thread()" and being put into a "disused thread pool" until IPC is received. The only main difference (other than terminology) is that you'd still be implicitly creating one initial thread.

[Continued in Part II - silly 8000 character limit...]

Reply Parent Score: 2