Linked by JRepin on Mon 29th Apr 2013 09:24 UTC
Linux After ten weeks of development Linus Torvalds has announced the release of Linux kernel 3.9. The latest version of the kernel now has a device mapper target which allows a user to setup an SSD as a cache for hard disks to boost disk performance under load. There's also kernel support for multiple processes waiting for requests on the same port, a feature which will allow it to distribute server work better across multiple CPU cores. KVM virtualisation is now available on ARM processors and RAID 5 and 6 support has been added to Btrfs's existing RAID 0 and 1 handling. Linux 3.9 also has a number of new and improved drivers which means the kernel now supports the graphics cores in AMD's next generation of APUs and also works with the high-speed 802.11ac Wi-Fi chips which will likely appear in Intel's next mobile platform. Read more about new features in What's new in Linux 3.9.
Thread beginning with comment 560228
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE[9]: Load of works there
by Brendan on Wed 1st May 2013 14:09 UTC in reply to "RE[8]: Load of works there"
Brendan
Member since:
2005-11-16

Hi,

"It's not the context switches between user space and kernel that hurt micro-kernels; it's context switches between processes (e.g. drivers, etc)."

In today's operating systems, don't userspace context switches need to go through kernel space context switches?


Yes; but the user space to kernel space switching is only about 50 cycles on old CPUs (less for a newer CPU using the SYSCALL instruction); and often you're in the kernel (e.g. due to IRQ, exception or unrelated syscall) when you find out that you need to switch processes and there is no user space to kernel space switch before a task switch.

Note that this applies to both micro-kernels and monolithic kernels - they both have the same user space to kernel space context switch costs.

"But it's not really the context switches between processes that hurt micro-kernels; it's the way that synchronous IPC requires so many of these context switches. E.g. sender blocks (causing task switch to receiver) then receiver replies (causing task switch back)."


Still, if the context switch were "free", I think it'd help take microkernels out of the shadows. IPC doesn't have to be expensive, but we'd have to use it differently than the synchronous call & block pattern (like you said). I was always a fan of asynchronous batch messaging like that used by mainframes. We think of them like dinosaurs, but they did an inspirational job of breaking problems down into elements that could scale up very easily. Modern software design doesn't do justice to the software efficiency that earlier computers demanded.


Agreed. The other thing I'd mention is that asynchronous messaging can work extremely well on multi-core; as the sender and receiver can be running on different CPUs at the same time and communicate without any task switches at all.

"Of course this is a lot of work - it's no surprise that a lot of micro-kernels (Minix, L4, Hurd) failed to try."

I have been working on my own async library, and although it works, the nagging problem is that without an OS written for truly async system calls, it ends up being emulated on top of a synchronous kernel like linux where the benefits cannot be witnessed. It's difficult to sell a new paradigm (even with merit) when it runs poorly on existing operating systems which were optimized for the old paradigm.


Ironically; for modern kernels (e.g. both Linux and Windows) everything that matters (IO) is asynchronous inside the kernel.

- Brendan

Reply Parent Score: 2

RE[10]: Load of works there
by Alfman on Wed 1st May 2013 16:17 in reply to "RE[9]: Load of works there"
Alfman Member since:
2011-01-28

Brendan,

"Yes; but the user space to kernel space switching is only about 50 cycles on old CPUs (less for a newer CPU using the SYSCALL instruction)"


On older processors it used to be a couple hundred cycles like in the link I supplied. I'm not sure how much they've brought it down since then. Do you have a source for the 50 cycles stat?


"and often you're in the kernel (e.g. due to IRQ, exception or unrelated syscall) when you find out that you need to switch processes and there is no user space to kernel space switch before a task switch."

That's only for pre-emption though. Blocking system calls always incur explicit context switches to/from userspace.


"Note that this applies to both micro-kernels and monolithic kernels - they both have the same user space to kernel space context switch costs."

While technically true, the monolithic kernel doesn't need to context switch between modules like a microkernel does. That's the reason microkernels are said to be slower. The microkernel context switches can be reduced by using non-blocking messaging APIs, this is what I thought you were already suggesting earlier, no?


"Agreed. The other thing I'd mention is that asynchronous messaging can work extremely well on multi-core; as the sender and receiver can be running on different CPUs at the same time and communicate without any task switches at all."

On the other hand, whenever I benchmark things like this I find that the cache-coherency overhead is a significant bottleneck for SMP systems such that a single processor can often do better with IO-bound processes. SMP is best suited for CPU bound processing where the ratio of CPU processing to inter-core IO is relatively high. Nothing is ever simple huh?


"Ironically; for modern kernels (e.g. both Linux and Windows) everything that matters (IO) is asynchronous inside the kernel."

With linux, file IO uses blocking threads in the kernel, all the FS drivers use threads. These are less scalable than async designs since every request needs a kernel stack until it returns. The bigger problem with threads is that they're extremely difficult to cancel asynchronously. One cannot simply "kill" a thread just anywhere, there could be side effects like locked mutexes, incomplete transactions and corrupt data structures...consequentially most FS IO requests are not cancelable on linux. In most cases this isn't observable because most file IO operations return quickly enough, but there are very annoying cases from time to time (most commonly with network shares) where we cannot cancel the blocked IO or even kill the process. We are helpless, all we can do is wait for FS timeouts to elapse.

It's difficult to justify the amount of work that'd be needed to fix these abnormal cases. I'd rather push for a real async model, but that's not likely to happen given the immense scope such a patch would entail.

Reply Parent Score: 2

RE[11]: Load of works there
by Brendan on Thu 2nd May 2013 02:07 in reply to "RE[10]: Load of works there"
Brendan Member since:
2005-11-16

Hi,

On older processors it used to be a couple hundred cycles like in the link I supplied. I'm not sure how much they've brought it down since then. Do you have a source for the 50 cycles stat?


I think we're talking about different things. The cost of a "bare" software interrupt or call gate is around 50 cycles; but the benchmarks from your link are probably measuring the bare syscall plus an assembly language "stub" plus a call to a C function (and prologue/epilogue) plus another call (possibly via. a table of function pointers) to a minimal "do nothing" function.

"Note that this applies to both micro-kernels and monolithic kernels - they both have the same user space to kernel space context switch costs."

While technically true, the monolithic kernel doesn't need to context switch between modules like a microkernel does. That's the reason microkernels are said to be slower. The microkernel context switches can be reduced by using non-blocking messaging APIs, this is what I thought you were already suggesting earlier, no?


For the overhead of privilege level switches, and for the overhead of switching between tasks/processes, there's no real difference between micro-kernel and monolithic.

Micro-kernels are said to be slower because privilege level switches and switching between tasks/processes tend to happen more often; not because the overhead is higher.

"Agreed. The other thing I'd mention is that asynchronous messaging can work extremely well on multi-core; as the sender and receiver can be running on different CPUs at the same time and communicate without any task switches at all."

On the other hand, whenever I benchmark things like this I find that the cache-coherency overhead is a significant bottleneck for SMP systems such that a single processor can often do better with IO-bound processes. SMP is best suited for CPU bound processing where the ratio of CPU processing to inter-core IO is relatively high. Nothing is ever simple huh?


That'd be true regardless of how threads communicate. The only way to reduce the cache-coherency overhead is to build a more intelligent scheduler (e.g. make threads that communicate a lot run on CPUs that share the same L2 or L3 cache).

"Ironically; for modern kernels (e.g. both Linux and Windows) everything that matters (IO) is asynchronous inside the kernel."

With linux, file IO uses blocking threads in the kernel, all the FS drivers use threads. These are less scalable than async designs since every request needs a kernel stack until it returns. The bigger problem with threads is that they're extremely difficult to cancel asynchronously. One cannot simply "kill" a thread just anywhere, there could be side effects like locked mutexes, incomplete transactions and corrupt data structures...consequentially most FS IO requests are not cancelable on linux. In most cases this isn't observable because most file IO operations return quickly enough, but there are very annoying cases from time to time (most commonly with network shares) where we cannot cancel the blocked IO or even kill the process. We are helpless, all we can do is wait for FS timeouts to elapse.

It's difficult to justify the amount of work that'd be needed to fix these abnormal cases. I'd rather push for a real async model, but that's not likely to happen given the immense scope such a patch would entail.


Sadly, it's easier to keep adding extensions on top of extensions (and end up with an ugly mess that works) than it is to start again with a new/clean design; even when a new/clean design would reduce the total amount of work and improve the quality of the end result in the long run. Most people are too short-sighted for that - they only look at the next few years rather than the next few decades.

- Brendan

Reply Parent Score: 2