Inside the Linux Scheduler

Submitted by IdaAshley 2006-07-05 Linux 77 Comments

“The Linux kernel continues to evolve, incorporating new technologies and gaining in reliability, scalability, and performance. One of the most important features of the 2.6 kernel is a scheduler implemented by Ingo Molnar. This scheduler is dynamic, supports load-balancing, and operates in constant time — O(1). This article explores these attributes of the Linux 2.6 scheduler, and more.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

77 Comments

2006-07-05 5:57 pm
diegocg
While it’s a recent document, it remembers to those “what is new in 2.6” documents and it’s missing a couple of feaures, ej. these days we have “Scheduler domains” (http://lwn.net/Articles/80911/) (which allows to take avdanzed optimized scheduling decisions ej: optimized the scheduler for multi-core core CPUs, or for power saving policies, and 2.6.18 will get a implementation of “smpnice” (http://lwn.net/Articles/186438/)
2006-07-05 6:14 pm
JacobMunoz
…has a significant lag when opening new applications.
Not that I should really be complaining about an OS that’s completely free, but even with the new scheduler I feel like the ghost of XP still lingers in my non-MS box. As far as I can remember, this 0(1) scheduler’s been out for many years, and I do remember the 2.4 kernel was horribly slow – which was why many of my friends abandoned the idea of using Linux on a desktop. The new 2.6 kernel shows significant improvements over previous releases (of many flavors), however the lag is still a noticeable factor. With Ghz-level machines, there shouldn’t be such a terrible performance drag when doing simple desktop activities – it completely kills the user’s experience. Win XP can barely run on slower machines, and while Linux can perform well on them there is still a faint whiff of wasted (or mismanaged) cycles all over the place. This may very well have to do with the fact that *nix was designed for uptime and reliability and not real-time desktop performance, so I expect some significant (and maybe radical) changes in the next kernel release.
It’s still about as responsive as XP, which isn’t much of a complement – but at least it’s about quo…

2006-07-05 6:39 pm
nunogt
I believe the problems you described are not related with the Linux scheduler itself, but with the X Window System, which was designed to feature network transparency: the machine where application programs run can differ from the user’s local machine.
This obviously introduces some added complexity and overhead, and in its current state is not as responsive as, let’s say, Windows, whose “Window System” is integrated within the operating system itself.
I’d expect major improvements in this area in the next few years (months?), though. Now that Xorg has been modularized and eye candy and desktop acceleration (XGL/AIGLX) is getting mainstream, Xorg is due to get a major revamp. Afterall, its development rate hasn’t been so spectacular probably since it was first created back in the 80’s.
Edited 2006-07-05 18:47

2006-07-05 7:10 pm
SEJeff
You are correct in stating that X features network transparency. You are misinformed in that it does not introduce much overhead. X on the local machine communicates over a TCP socket which is blazing fast.
I think the user was letting loose a general gripe about “foo window toolkit is slow so that means X is slow.”
Xorg has already “gotten” a major revamp and it was called X.org 7.0 which is fully modularized. This hard work to modularize X will allow developers to work on individual modules and make releases independent of the rest of X. This will speed up X development signifigantly. X.org 7.1 integrated AIGLX so exciting new eye candy will now be available to many people.

2006-07-05 8:23 pm
hurdboy
You are misinformed in that it does not introduce much overhead. X on the local machine communicates over a TCP socket which is blazing fast.
Actually, on systems that support it (which is virtually everything these days, except maybe something like BSD/OS), X uses shared memory instead of TCP sockets. Not really a nitpick, as local TCP communications still is pretty slow compared to well, almost anything else (local socket, e.g. /var/run/foo, or shared mem).
2006-07-06 7:05 am
jebb
Toolkits can indeed be an issue. Any gtk+ app (the latest Inkscape build comes to mind) feels sluggish on my work XP computer, whereas fluxbox is blazingly fast on my home Arch box. Granted, I should probably check for an update to gtk for windows. But then again, I guess as long as it’s not as easy as ‘pacman -Syu’, I won’t bother…
But I believe the parent post also had issues with application start time. I don’t think the GHz-count matters much when dealing with multi-MB web browsers, but a faster hard drive (think raptor, or SCSI if you’re really serious about that) helps a lot.

2006-07-06 1:29 pm
GatoLoko
Inkscape, as a kde app, uses qt, not gtk.
The use of fluxbox has nothing to do with toolkits, if you use nautilus or konqueror on your fluxbox, you are using gtk or qt.
The apps are using the toolkit nomather what window manager you use.
2006-07-06 6:08 pm
SEJeff
GatoLoko, what are you talking about? Inkscape uses gtkmm. In case you didn’t know, gtkmm is c++ bindings to gtk, not qt. Inkscape is a gnome app if you want to go by which toolkit it uses. If you don’t believe me, look at the gtkmm code yourself:
http://inkscape.cvs.sourceforge.net/inkscape/inkscape_project/
2006-07-06 6:58 pm
Tom Janowitz
Just becouse it’s GTK doesn’t mean that it’s GNOME’s app.
2006-07-12 9:47 pm
GatoLoko
Oops, I was thinking about a differen program, wrong name in my head. I’m sorry.

2006-07-05 8:44 pm
rayiner
Close, but no cigar. The speed issues aren’t the result of complexity due to network transparency. They are due to two things:
1) Lack of proper alpha-blit acceleration for anti-aliased text;
2) The use of a seperate process for window management.
(1) is a driver-quality issue, and is not a problem on drivers like NVIDIA’s that do at least try to accelerate special-cases of RENDER. (2) is where the complexity factors in. The communication between the WM and the X server is both highly elaborate, and prone to performance problems. For example, when the user resizes a window, you get a series of context switches between the WM, the X server, and the client app.
There is also the fact that Windows just plays a lot of shady tricks with scheduling to artificially boost response time. It gives the foreground window special priveleges just for being in front. These increase the feeling of responsiveness without actually making anything faster, but also make the system much less predictable under load.

2006-07-06 1:04 am
smashIt
There is also the fact that Windows just plays a lot of shady tricks with scheduling to artificially boost response time. It gives the foreground window special priveleges just for being in front. These increase the feeling of responsiveness without actually making anything faster, but also make the system much less predictable under load.
I wouldn’t call it shady tricks.
windows is a desktop os.
and as such, optimising for the active application is a valid methode. If you don’t like it, just disable this feature.

2006-07-06 1:58 am
rayiner
It is a shady trick, from a programming point of view. Just priority boosting the foreground process isn’t a very precise optimization. It might make things smoother when things are going right, but it’s the reason Windows grinds to a halt when the foreground process does something stupid. Today at work, Visual Studio went off the deep end while debugging an application, and the system went unresponsive for a minute or two before I could get to the task manager to shut it down. Linux or OS X never do that.
2006-07-06 6:57 am
Cloudy
Linux or OS X never do that.
Except for when a high priority thread goes into an infinite loop by mistake…
2006-07-06 7:12 am
jebb
It’s also the reason why moving to a hyperthreaded or SMP machine makes such a huge difference on windows boxes, in terms of UI responsiveness. The improvement felt on a 2.6 linux or OSX box is far less impressive…
2006-07-06 11:03 am
Ronald Vos
Linux or OS X never do that.
I wish. When Gnome in SLED locked up because 2 different apps were accessing a dir with 3k files over USB1, and Gnome became totally unresponsive for a minute or 2, I thought: man, Linux’ thread scheduler sucks ass.
But apparently that’s X’ fault.
2006-07-06 2:16 pm
CrLf
Nope, that’s the Gnome programmers’ fault.
Running through a directory with 3000 files is fast if you’re only getting their names, but can become slow if you are also reading their ACLs, translating their UGIDs into names, and trying to identify their mime-types on top of that.
GUI apps *cannot* run this sort of tasks on UI callbacks, because that causes the UI to wait until the task is completed, thus becoming unresponsive. This is something that should go into another thread, while the main thread is free to continue responding to input events (button clicks and the like).
Did anything else besides Gnome locked up? Did X lock up? Could you switch to a text console or log into the machine through SSH? If so, it isn’t the kernel’s fault.

2006-07-05 7:37 pm
ferringb
…has a significant lag when opening new applications.
scheduler really wouldn’t have much to do with this- what you’re complaining about is likely IO, initial load up of libs not in memory.
DE’s and their apps have a fair amount of linkage- lib loadup (first time for ’em at least) is a nice hit waiting for the disk.
That said… without specifics (app, reproducing, etc), hard to comment (although again, scheduler likely has no real relevance to your beef).
2006-07-05 7:43 pm
WereCatf
As people already stated, the fault is not with the Linux kernel. The things you describe might be because of X, disabled DMA on disks, poor drivers and so on and so forth. Especially enabling DMA gives a nice overall boost to everything if it isn’t enabled. Though, I don’t know any recent distro which doesn’t enable DMA by default..

2006-07-05 6:14 pm
tomcat
… whether there is a realtime scheduler available for Linux?

2006-07-05 6:44 pm
MightyPenguin
Yes and no. Linus probably won’t ever accept one into the official kernel, or at least that’s how things stand now. There are 3rd party projects/patches that do offer realtime linux. The best are probably proprietary at the moment, but there are some.

2006-07-05 7:07 pm
Milo_Hoffman
The problem is that going real-time create HUGE amounts of additional overhead in the kernel maintenance to make sure that everything is real-time safe….and this would all be work for just a tiny minority of the user base.

2006-07-05 9:19 pm
Rahul
A real time path is available from Ingo who wrote the 0(1) scheduler.
http://people.redhat.com/mingo/
It is being merged piece meal into the Linux kernel over a period of time. The patch set is already under heavy use though and rapid improvement is being made.

2006-07-05 7:49 pm
superstoned
dr Con Kolivas’ [1] Staircase scheduler [2] would deserve a bit more attention. It is an alternative to the build-in scheduler, with a different and much cleaner design (and less code) focussed on better (desktop) interactivity without less throughput. Development started in 2004, and it has been stable since some time. All benchmarks show it better or equal to the current scheduler in 2.6.
It has been in Andrew Morton’s patchset for some time, but the kernel maintainers seem to rather patch their current scheduler with interactivity instead of using the inherently interactive staircase design.
[1] http://members.optusnet.com.au/ckolivas/kernel/
[2] http://kerneltrap.org/node/2744

2006-07-06 8:33 am
nick
The staircase scheduler is not in Andrew Morton’s
patchset and (AFAIK) never has been.
What do you mean by inherently interactive? The
current scheduler looks to me like it tries to be
inherently interactive as well.
It just doesn’t seem to be that big a deal any more.
There are lots of cool schedulers out there, but it
is quite likely that they would end up causing more
people problems than the current one, which has gone
through a huge amount of testing, if nothing else.

2006-07-06 9:06 am
superstoned
it sure has been in -mm (more than a year ago), but went away as it didn’t seem it would get into the kernel.
inherently interactive means the interactivity is not a hack (the changing priorities) but is in the design, read about it – it’s much cleaner.
The current scheduler is quite big and bloathed, and needs regular new hacks to cope with other usage scenario’s. staircase needed way less tuning to accomodate to a wider range of use, and thus is much smaller. while still leading to a percievably better interactivity and equal or better throughput.
i think if it wasn’t for ‘political’ instead of technical reasons, staircase whould’ve been in the linux kernel for quite some time.

2006-07-06 9:54 am
nick
it sure has been in -mm (more than a year ago),
but went away as it didn’t seem it would get into the
kernel.
OK, but it isn’t now, which was my point.
inherently interactive means the interactivity is not
a hack (the changing priorities) but is in the design,
read about it – it’s much cleaner.
I’ve read not only about it, but the code itself. I
think the current scheduler is pretty ugly, but it is
still in the design (from the very beginning) to be
interactive.
The current scheduler is quite big and bloathed, and
needs regular new hacks to cope with other usage
scenario’s. staircase needed way less tuning to
accomodate to a wider range of use, and thus is much
smaller. while still leading to a percievably better
interactivity and equal or better throughput.
Firstly, I don’t think you can conclude anything about
interactivity based on the sample sizes and types of
tests in place.
Secondly, the current scheduler doesn’t need regular
new hacks. It has probably changed no more frequently
over the past year than the staircase scheduler.
i think if it wasn’t for ‘political’ instead of technical reasons, staircase whould’ve been in the linux kernel for quite some time.
There doesn’t seem to be a technical reason to include
it. The kernel development model doesn’t specify that
you include something unless it is proven bad. You
must include something only if it has been proven good.

2006-07-06 12:07 pm
superstoned
>The staircase scheduler is not in Andrew Morton’s
>patchset and (AFAIK) never has been.
>OK, but it isn’t now, which was my point.
well, you said it never has been, so…
>There doesn’t seem to be a technical reason to include
>it. The kernel development model doesn’t specify that
>you include something unless it is proven bad. You
>must include something only if it has been proven good.
well, there aren’t many interactivity tests, but staircase is better on those. and throughput, as i said, isn’t worse. so tests say its better or even.
second, it does this with a smaller codebase, and as far as i can tell, cleaner design.
two good technical reasons to get it in, right?
2006-07-06 12:44 pm
nick
>>OK, but it isn’t now, which was my point.
>
>well, you said it never has been, so…
I said it is not and as far as I knew it hasn’t been.
Which was correct at the time of writing.
Sorry, now I see I misread your post as saying that it
is *presently* in Andrew Morton’s tree, which was what I
was trying to correct.
Anyway it doesn’t matter.
>well, there aren’t many interactivity tests, but
>staircase is better on those. and throughput, as i
>said, isn’t worse. so tests say its better or
>even. second, it does this with a smaller codebase,
>and as far as i can tell, cleaner design.
Interactivity benchmarks don’t aren’t really
representative as far as I’m concerned. Before 2.6
was released, the scheduler was apparently awesome
according to all metrics and the types of workloads
that kernel developoers’ workstations had. When it
actually started getting wide use, obscure problems
surfaced.
Same with throughput. It is hard to say that some
change doesn’t hurt throughput if all those people
using 10GbE networking, or 500 CPU Altixes, or
10,000 disk enterprise databases haven’t tested it.
>
>two good technical reasons to get it in, right?
I’m not say it is worse, nor that it isn’t better.
But “let’s put this huge change in, which doesn’t
specifically address a bug, but hopefully isn’t any
worse” isn’t enough.
It is hard to judge a “clean design”. Aside from the
mechanism (in which the two are pretty similar), the
whole thing is a big set of heuristics. One thing may
sound better than another, but that doesn’t translate
to real world performance.
2006-07-07 5:12 am
superstoned
now i wouldn’t say staircase should go in immediately, but i think (as i said) it deserves a bit more attention. it might not be a revolutionairy enhancement, but it definately seems better in the area of interactivity, and the smaller codesize might definately make it easier to maintain and hack on. as for the point that problems may surface, well, it has been in testing for some time, so i think that won’t be a very big problem.
but indeed, it hasn’t been tested much on very big and heavy stuff… well, a volunteer would help.
and again, i think it just deserves some attention.

2006-07-05 10:41 pm
atici
If these new features in the Linux kernel is really “such a huge improvement”, it also implies how crappy the earlier implementation was. Just think how many of these modules have been repeatedly rewritten in the last few years and boasted as the next big thing each time.
Even though Linux keeps getting new features none of them is the ultimate solution to the problem (such as ZFS of Solaris, scheduler activations of NetBSD, etc.)
So why bother? No reason to get excited. It’ll probably be replaced in the next kernel version.

2006-07-05 11:44 pm
rayiner
Your post is completely wrong, on every single count.
First, the O(1) scheduler is not new. The basic design was created in 2002. It’ll be around in the next kernel version.
Second, the fact that its a huge improvement on the 2.4 scheduler does not imply the previous one was crappy. The previous one was just fine — for systems with a moderate number of server tasks. Which is fine, because that was most of Linux’s market in the late 1990s. Where the big improvement of the new scheduler comes in is on loads with thousands of threads, on highly parallel machines, or with interactive requiremen
Third, you obviously have no idea how software development works. There is no “ultimate solution” to any non-trivial problem, just various solutions with various properties. Remember Solaris’s much-vaunted M:N scheduler based scheduler activations? Remember how it was quitely replaced by a 1:1 model similar to Linux’s? How about Solaris’s FireEngine TCP stack? Sun bill’s it as a huge improvement, does that mean that the TCP stack in previous versions of Solaris is crappy?
Gradual improvement is what happens with software development. Different solutions have different performance, and experience bears out which solutions are best-suited for the loads your software encounters. Very often, earlier in the life of a piece of software, a simpler solution is implemented in order to speed time-to-market. As the software matures, pieces are replaced as experience shows such changes to be necessary.
Edited 2006-07-05 23:45

2006-07-06 12:33 am
Cloudy
First, the O(1) scheduler is not new. The basic design was created in 2002. It’ll be around in the next kernel version.
Actually it’s been around since the 60s.
Second, the fact that its a huge improvement on the 2.4 scheduler does not imply the previous one was crappy.
In this instance it does. The O(1) scheduler is a perfect example of the way Linux developers come at problems without prior knowledge. The so called O(n) scheduler was a poor choice for any process scheduler, which is why you don’t find it in commercial systems more recent than, oh, 1965.
This is a good example of why after more a decade of development with more than six thousand people involved the Linux kernel is barely equal to systems that are twenty years older, but with a lot fewer development hours: too little learning from the past, too much ‘fix it when it breaks.’

2006-07-06 1:55 am
rayiner
Actually it’s been around since the 60s.
Which scheduler I was referring to should be obvious based on the title of the article..
In this instance it does. The O(1) scheduler is a perfect example of the way Linux developers come at problems without prior knowledge.
Of course they don’t approach it prior knowledge. This is the first system they’ve built. It’s not like they’ve got a bunch of kernels under their belt. On the other hand, code complexity is also an issue. For most tasks Linux was used for in the 1990s, an O(1) scheduler wouldn’t be appreciably faster, and would’ve increased complexity a lot.
The so called O(n) scheduler was a poor choice for any process scheduler, which is why you don’t find it in commercial systems more recent than, oh, 1965.
And all those systems running with the 4.4BSD scheduler don’t count as “commercial”?
This is a good example of why after more a decade of development with more than six thousand people involved the Linux kernel is barely equal to systems that are twenty years older, but with a lot fewer development hours:
That’s not really a fair characterization. Yes, there are thousands of people involved with Linux, but the core team doesn’t contain nearly that many people. Also, many of those people aren’t full-time paid employees.
Oh, and Linux is about an order of magnitude bigger in scope than any of the older systems you’re talking about. All those people working on Linux are working on different pieces. There’s people doing drivers for all sorts of systems, people working on optimizing it for desktops, servers, supercomputers, PDAs, and embedded applications.
Solaris, IRIX, AIX, all are pretty much only good for servers. Load up Nextenta on an x86 box sometime. It’s interactive scheduling performance is at the level of Linux 2.4…
too little learning from the past, too much ‘fix it when it breaks.’
Another way to see it is “don’t fix what isn’t broken”. Or “don’t fix it until you need to.” Or the classical, “premature optimization is the root of all evil.”

2006-07-06 2:41 am
Cloudy
“Actually it’s been around since the 60s.”
Which scheduler I was referring to should be obvious based on the title of the article..
Yes. It’s Ingo’s “O(1)” scheduler, which is identical to one I first saw described in DEC documentation for RSX11 back in 73.
Of course they don’t approach it prior knowledge. This is the first system they’ve built. It’s not like they’ve got a bunch of kernels under their belt.
What’s wrong with reading and understanding the literature before you start reinventing the wheel?
And all those systems running with the 4.4BSD scheduler don’t count as “commercial”?
4.4BSD did not use an O(n) scheduler. It used an O(C) scheduler, where C was then number of priority queues. Since C is constant at compile time, it was actualy an O(1) scheduler in Ingo’s sense.
That’s not really a fair characterization. Yes, there are thousands of people involved with Linux, but the core team doesn’t contain nearly that many people. Also, many of those people aren’t full-time paid employees.
There is no “core team.” There are more than 1000 people paid full time to work on the Linux kernel. Check the professional attendence at OLS.
Or the classical, “premature optimization is the root of all evil.”
Sigh. Selecting a constant time algorithm over an O(n) algorithm when the constant time algorithm is more than 20 years old and no more complex to implement is not “premature optimization”.
And waiting 10 years to get around to fixing the O(n) algorithm is no less than ‘fix it when it breaks.’ And anyone who has seriously followed LKML for the last 6 years knows that ‘wait until it breaks’ is the first rule of Linux kernel developers, the second, apparently, being ‘ignore the past’.
(By the way, I found IRIX worked very well on my SGI workstation back when I had one. You do realize that SGI was a workstation company before they got into servers, right?)
2006-07-06 9:48 am
nick
You fail to realise that Linux was not a commercial
system when the O(n) scheduler was designed. And
actually, for many workloads the 2.4 scheduler is
faster than the 2.6 scheduler due to less cacheline
footprint. And it was much simpler to understand.
As you say, O(1) scheduling algorithms aren’t rocket
science. They are pretty much a 2nd year university
topic, and definitely some if not most kernel devs
would know how to do them. So when they found it was
needed, it was implemented.
There are many many places in the kernel where one
could use a theoretically better algorithm, that isn’t
appropriate either due to complexity or lower order
factors. For example, pretty much anywhere that uses an
O(n) hash or an O(log(n)) rb tree could be using an
O(1) radix tree.
The secret to being a good programmer isn’t by blindly
implementing everything with the lowest complexity
algorithms, but knowing what to apply and where.
2006-07-06 2:22 pm
CrLf
This “Linux sucks because it used to have an O(n)” scheduler is tipical academentia mentality.
Academics tend to forget constants, and assume that an “O(1)” algorithm is better than an O(n) algorithm, case closed.
Well, it the so called “real world”, some O(n) algorithms can actually beat O(1) algorithms up until very large values of “n”, just because their constants are orders of magnitude smaller.
What’s better to keep an array of 500 integers sorted and with low search times? A balanced tree or a simple, brute-force, insertion sort?
2006-07-06 7:49 pm
Cloudy
You do realize that you’re arguing against the approach that the Linux folk took, right?
By the way, neither btrees nor insertion sort are O(1)…
2006-07-07 1:09 am
CrLf
“You do realize that you’re arguing against the approach that the Linux folk took, right?”
Yes, and no. Their approach was deemed the right one at the time (both simple and good enough), and suited Linux well until it didn’t, and that’s all that matters.
Good enginnering is all about choosing the most appropriate solution for your needs given the resources available, and knowing when that solution becomes inappropriate and work to fix it.
The O(n) scheduler was good enough, but Linux evolved in ways that forced a change, and they (whoever they were) changed it.
You people should just stop whining. Linux started as a hobby project, yes, people working on it were inexperienced, maybe. But it is eating the proprietary unixes’ lunch, even though they were supposedly “better designed” from scratch. Well, it sucks to be them.
“By the way, neither btrees nor insertion sort are O(1)…”
Where did I claim they were? I was merely implying that balanced trees have a better algorithmic complexity (in the average and worst cases) than insertion sort, but up until a certain value of “n” balanced trees have enough overhead to actually make them perform worse. Again, the “good” solution is the one that provides better use of available resources, being good enough for the task at hand.
2006-07-07 1:59 am
Cloudy
Good enginnering is all about choosing the most appropriate solution for your needs given the resources available, and knowing when that solution becomes inappropriate and work to fix it.
Yes. This, of course, is why using the O(n) scheduler was not good engineering. O(n) schedulers have been known for decades not to be an appropriate solution for multiprocess scheduling.
“By the way, neither btrees nor insertion sort are O(1)…”
Where did I claim they were?
Where did I say you did? You appended a comment about btrees and insertion sorts immediately after a comment about O(1) versus O(n). I was simply curious as to whether you were changing the subject or selecting an inappropriate example.
The B-tree/insertion sort example isn’t really relevant here, as we’re discussing two scheduler selection algorithms. The O(n) algorithm has exactly the same expressive complexity as the O(1) algorithm at selection time (which is when you care). It’s the same loop applied to a different data structure. However for n > about 2 on most systems, the O(1) algorithm is going to, on average, be faster.
(The actual metric is that the O(n) scheduler is slower than O(1) anytime n is greater than the number of priority queues. I’ve yet to see even an early Linux system where that wasn’t true.)
Was the O(n) scheduler good enough for a hobby OS? absolutely. Was it ‘good engineering’? Not even in an undergraduate CS OS class.
2006-07-07 6:43 am
nick
>However for n > about 2 on most systems, the O(1)
>algorithm is going to, on average, be faster.
Wrong. O(1) can be O(1000000000). And O(n) can be
O(0.00000000001*n).
>The actual metric is that the O(n) scheduler is slower
>than O(1) anytime n is greater than the number of
>priority queues. I’ve yet to see even an early Linux
>system where that wasn’t true.
No, it isn’t. You obviously don’t know anything about
contstants and lower order factors.
However, to clear up another one of your
misunderstandings, the n in the 2.4 scheduler is not
the total number of processes in the system. It is the
number of processes on the runqueue. And this is very
often 1 when the scheduler is invoked. Even in a small
server, on many workloads it is unusual to go above
10 or 20.
One of the constant factors in the 2.6 scheduler is the
number of priority levels. I think this is around 140.
>Was the O(n) scheduler good enough for a hobby OS?
>absolutely. Was it ‘good engineering’? Not even in an
>undergraduate CS OS class.
Actually, it was fine engineering at the time. It
absolutely satisifed the requirements of anyone using
Linux at the time. It probably got under the noses of
some ivory tower academics, though.
You don’t seem to A. have a good grasp of the actual
solutions themselves other than theoretical complexity
measurements (but even those you misunderstood); or B.
understand the requirements they need to satisfy.
So how are you in a position to judge whether it was
‘good engineering’, or ‘good enough’?
2006-07-07 8:18 am
Cloudy
>However for n
> about 2 on most systems, the O(1)
>algorithm is going to, on average, be faster.
Wrong. O(1) can be O(1000000000). And O(n) can be
O(0.00000000001*n).
“can be”, but isn’t. We’re comparing two actual algorithms here. In the straightforward implementation of the the two algorithms, on average, the O(1) algorithm finds a process on the first run queue it looks at more than half of the time. Whereas, on average, the O(n) algorithm has to look at n/2 processes to determine which to run next. If n > 2, than O(n) is faster.
However, to clear up another one of your
misunderstandings, the n in the 2.4 scheduler is not
the total number of processes in the system. It is the
number of processes on the runqueue. And this is very
often 1 when the scheduler is invoked.
Well, I’m sorry you’re confused by my lack of clarity, but I never said that n was total # of processes in the system.
Actually, in a reasonable implementation, 1 is the lower bounds on the # of processes in the run queue, as the scheduler shouldn’t be invoked on any empty run queue.
Er, wait, that’s probably an optimization that hasn’t hit Linux yet.
You don’t seem to A. have a good grasp of the actual
solutions themselves other than theoretical complexity
measurements (but even those you misunderstood); or B.
understand the requirements they need to satisfy.
Damn but that’s a funny statement The irony of it nearly killed my irony meter.
Since I’ve managed to confuse you completely, let me restate my case all in one place:
Once upon a time, Linus Torvalds started a hobby OS project. One of the early decisions he made was to use a poor process scheduling algorith. Over a decade later, Ingo Molnar replaced that poor process scheduling algorithm with a somewhat better one.
At the time Linus wrote the original process scheduler, reasons for choosing a classing round robin scheduler for process scheduling were well understood, and, given what Linus asserted he wanted his OS to do, mad that sort of scheduler more appropriate than the one he selected.
No amount of revisionist history can make Linus’ choice at that time a good choice. It was an inexperienced choice made by someone unfamiliar with either practice or literature.
It is a long stretch of the imagination to claim that that original algorithm was suitable for the purposes Linux was being used for all the way up until the “O(1)” algorithm replaced it.
So how are you in a position to judge whether it was
‘good engineering’, or ‘good enough’?
Hmm… 30 years of OS design, first ported Unix in ’78, (last ported it in ’06,) researcher in OS resource scheduling in the ’80s, former manager of commercial Unix development for a fortune 500 company, OS researcher at a fortune 50 company for five years in the 90s, twice built linux embedded platforms where I replaced the process scheduler because it wasn’t suitable for the application, contributor to LKML and the kernel for years….
nope, nothing there to give me that ability.
Don’t confuse sloppy exposition on an obscure web forum with lack of expertise. I know I don’t confuse your rude personal insults with a lack of social skills.
2006-07-07 9:04 am
nick
>”can be”, but isn’t. We’re comparing two actual
>algorithms here. In the straightforward implementation
>of the the two algorithms, on average, the O(1)
>algorithm finds a process on the first run queue it
>looks at more than half of the time. Whereas, on
>average, the O(n) algorithm has to look at n/2 processes
>to determine which to run next. If n > 2, than O(n) is
>faster.
Then they’re performing different tasks. If you want
them to perform the same task, you need some sort of
priority queue, otherwise the O(n) guy doesn’t need to
be O(n) either.
And linux uses an array of 140 queues for its priority
queue (rather than the traditional heap). So therefore
it is a real engineering factor you ignore.
And don’t keep going on about “simple round robin”, you
obviously just pulled that out to try to win the
argument, and anyway it is a totally unsuitable
algorithm for any traditional UNIX-like process
management.
[snip ranting]
>Don’t confuse sloppy exposition on an obscure web
>forum with lack of expertise. I know I don’t confuse
>your rude personal insults with a lack of social
>skills.
You can confuse them all you like, and if you think
*I’m* being rude you can elect to ignore me.
Just like I can make a judgement of your expertise and
decide that it isn’t worth debating with you either.
Have a great day though.
2006-07-07 9:25 am
Cloudy
And linux uses an array of 140 queues for its priority
queue (rather than the traditional heap). So therefore
it is a real engineering factor you ignore.
There you go with the random strawmen. What exactly does what Linux does in 2.6 have to do with a discussion of what Linus should have done in 2.0?
Of course, as usual, you are incorrect. Linux does not use an array of 140 queues. You may want to read sched.c and see how it actually works.
And don’t keep going on about “simple round robin”, you
obviously just pulled that out to try to win the
argument, and anyway it is a totally unsuitable
algorithm for any traditional UNIX-like process
management.
Um, we’d have to be having an argument for me to want to win one. I’ve made a case about the design decision Linus made back in ’91 and its engineering quality. You’ve been raising strawmen about the relationship between that scheduler and the changes Ingo Molnar made in ’02, when you haven’t been throwing around gratuitious insults.
[snip ranting]
I like you. You’re funny.
if you think *I’m* being rude you can elect to ignore me.
You are. But it’s more fun to see if I can set you straight than to ignore you. I get a perverse kick out of people telling me I don’t understand systems I’ve worked on for over 25 years. Especially when they are unable to offer any evidence to support their claim.
Let me know if you ever want to have an actual debate on the relative merit of scheduling algorithms. I’ll be happy to school you on the subject.
2006-07-07 9:43 am
nick
>There you go with the random strawmen. What exactly ?>does what Linux does in 2.6 have to do with a >discussion of what Linus should have done in 2.0?
Because *that* is how you do an O(1) scheduler for
a UNIX like system. Not a stupid round robin scheduler
like you were trying to suggest for 2.0!
Can you understand that? Round robin == crap. Ergo,
my previous arguments about the O(n) scheduler being
a decent tradeoff hold. Your only counter to them was
that a round robin scheduler would have been better.
>Of course, as usual, you are incorrect. Linux does
>not use an array of 140 queues. You may want to read
>sched.c and see how it actually works.
OK, mr pedant. It uses dual arrays of 140 queues. Per
CPU. Happy?
>Um, we’d have to be having an argument for me to want
>to win one. I’ve made a case about the design
>decision Linus made back in ’91 and its engineering
>quality
And you still haven’t justified why it was a bad one.
“Use a round robin scheduler” is utterly naive and
wrong. *That* would have been a bad engineering choice.
>Let me know if you ever want to have an actual debate
>on the relative merit of scheduling algorithms. I’ll
>be happy to school you on the subject.
OK.
1. tell me why a round robin scheduler would have been
better than the O(n) scheduler in terms of general
purpose performance (not context switching performance)
2. tell me how you can design a priority queue based
scheduler with at least 40 priorities (ie. a simple
UNIX model) will be faster than an O(n) model in
“most interactive applications”
2006-07-07 8:46 pm
Cloudy
Because *that* is how you do an O(1) scheduler for
a UNIX like system. Not a stupid round robin scheduler
like you were trying to suggest for 2.0!
The classic round robin scheduler I am discussing (the one I mentioned being documented in RSX documentation, very early in this discussion) is an N priority queue round-robin-within-queue scheduler.
It is the one in which no compute-bound penalty is charged but threads which block rather than having a quantum expire are switched to the ‘highest priority’ queue and those which have a quantum expire are switched to their ‘oriignal priority’ queue. It does not feature ‘real time’ priorities. It does feature scheduler pools because it wasn’t designed for an SMP in which separate processor pools make sense. The only optimization it uses is a head pointer that remembers which is the highest priority pool that is not empty.
It is as simple to understand and implement as the “O(n) scheduler” implemented by Linus in 1991, and there is no workload with more than a few processes on the run queue for which it is slower, or has a larger cache footprint (either i or d) than the O(n) queue.
OK, mr pedant. It uses dual arrays of 140 queues. Per CPU. Happy?
I’m always happy. You’re still wrong, but I’m always happy. You really should go read kern/sched.c more carefully.
“Use a round robin scheduler” is utterly naive and
wrong. *That* would have been a bad engineering choice.
On that we agree. Unfortunately, you keep using my shorthand against me, as if you can’t deal with elipses. When I first mentioned the scheduler that I would have used rather than Linus’ O(n) scheduler, I mentioned where it was documented. Throughout this discussion, I’ve called it various things, since we are just having a conversation on a web forum, and I thought you were following my context. But, for the entire time, I have meant the scheduler I described in the first paragraph of this post. It is neither a pure/simple round robin scheduler, nor is it the 2.6 scheduler. It is a better engineering choice for the sort of system that Linus proposed in ’91 than the scheduler he opted for.
1. tell me why a round robin scheduler would have been
better than the O(n) scheduler in terms of general
purpose performance (not context switching performance)
See above. The scheduler I meant when I said “classic round robin” is better than”Linus O(n)” because it schedules faster and otherwise has similar properties.
2. tell me how you can design a priority queue based
scheduler with at least 40 priorities (ie. a simple
UNIX model) will be faster than an O(n) model in
“most interactive applications”
Using the queue head optimization described above means that a priority queue based scheduler always knows without searching which process to run. It runs the one the head pointer is pointing to. If you put the idle process in a queue, you don’t even have to check for an empty runqueue.
The cost of the optimization is a single test whenever a process is inserted onto a run queue. The win is that no searching is ever done at schedule time.
The interactivity is achieved by placing processes that are preempted on the tail of the queue for their priority when they were selected and placing processes that become runable after blocking on the tail of the highest priority queue when they become runable.
Once again, the cost is a single test, this time at queue insertion time. (It’s really two tests, but they happen in different places in the state machine, so only one is ever required: the preempter places a process at the tail of its priority queue, an act which unblocks a thread places the thread at the tail of the highest priority queue.)
Scheduling time becomes fixed O(1) with a very small value for K, and independent of both n and C. (C, you will recall, is the number of priority queues.) Similarly, process insertion is O(1) with small K and independent of n and C.
There is one weakness to this design compared to basic Unix process scheduling on a uniprocessor, and I’ll leave it to you to identify it and explain what the basic Unix process scheduler does differently.
If you set C to 2, by the way, the problem mentioned above disappears, and it is better than the O(N) scheduler in all respects for the workload Linus expected in 1991.
2006-07-08 5:13 am
nick
>The classic round robin scheduler I am discussing (the
>one I mentioned being documented in RSX documentation,
>very early in this discussion) is an N priority queue
>round-robin-within-queue scheduler.
Sorry, that isn’t the classic round robin scheduler.
>>OK, mr pedant. It uses dual arrays of 140 queues. Per
>>CPU. Happy?
>
>I’m always happy. You’re still wrong, but I’m always
>happy. You really should go read kern/sched.c more
>carefully.
No, you’re wrong. What do you think it uses?
>Using the queue head optimization described above
>means that a priority queue based scheduler always
>knows without searching which process to run. It runs
>the one the head pointer is pointing to. If you put
>the idle process in a queue, you don’t even have to
>check for an empty runqueue.
>
>The cost of the optimization is a single test whenever
>a process is inserted onto a run queue. The win is
>that no searching is ever done at schedule time.
>
>The interactivity is achieved by placing processes
>that are preempted on the tail of the queue for their
>priority when they were selected and placing processes
>that become runable after blocking on the tail of
>the highest priority queue when they become runable.
>
>Once again, the cost is a single test, this time at
>queue insertion time. (It’s really two tests, but they
>happen in different places in the state machine, so
>only one is ever required: the preempter places a
>process at the tail of its priority queue, an act
>which unblocks a thread places the thread at the tail
>of the highest priority queue.)
>
>Scheduling time becomes fixed O(1) with a very small
>value for K, and independent of both n and C. (C,
>you will recall, is the number of priority queues.)
>Similarly, process insertion is O(1) with small K and
>independent of n and C.
>
>There is one weakness to this design compared to basic
>Unix process scheduling on a uniprocessor, and I’ll
>leave it to you to identify it and explain what the
>basic Unix process scheduler does differently.
Wait, before you get carried away, you haven’t showed
me a working solution yet. This solution has a far
larger cache footprint and will be slower for a small
number of tasks.
The fact that there are a large number of queues means
there will be a large cache footprint. I can’t quite
understand exactly what you’re describing… but what
happens when you have run the full quantum of your
“queue head” and want to find the next process to run?
You need to search the priority queues, right?
Maybe you have a list of “next to run” processes,
which means you don’t have to search. Well then what
happens when you insert a new process into a priority
with no other processes in it? You have to search.
And suppose you have a single process running. OK, so
no searching, right? Maybe so, but with a dynamic
priority it is going to be wandering around the
priority queues. Cache footprint.
You still haven’t shown why it is better than the O(n)
scheduler in all ways, because it quite likely would
be slower due to larger cache footprint at n=small.
2006-07-08 7:52 am
Cloudy
This solution has a far larger cache footprint and will be slower for a small number of tasks.
Um, given that you point out below that you don’t understand the solution, how, precisely, can you claim this about it?
The fact that there are a large number of queues means there will be a large cache footprint.
Actually, no. The memory footprint is slightly larger than that for Linus’ scheduler, but is made up for by fewer branch delays, and by fewer memory fetches on average.
I can’t quite understand exactly what you’re describing… but what happens when you have run the full quantum of your “queue head” and want to find the next process to run?
You run the process that’s pointed to by the “highest priority non-empty queue” pointer you update every time you add a process to a run queue.
You need to search the priority queues, right?
Nope. You always know which is the highest priority queue with a runable process in it, because you always set that whenever you add a process on a run queue, or delete the last active process from a run queue.
There is no searching done at schedule time. There is a single pointer dereference to select the process pointed to by the ‘highest priority non-empty queue” pointer.
Maybe you have a list of “next to run” processes,
which means you don’t have to search. Well then what
happens when you insert a new process into a priority
with no other processes in it?
If it’s a higher priority queue than the current higest priority non-empty queue, you set the ‘highest priority non-empty queue” pointer to point to it. If it’s not, you do nothing.
And suppose you have a single process running. OK, so no searching, right? Maybe so, but with a dynamic
priority it is going to be wandering around the
priority queues. Cache footprint.
This algorithm has no dynamic priorities. (That’s what “no compute bound penalty” is about.) However, it is possible to use dynamic priorities in a way that doesn’t cause the only runable process in the system to wander around. Now that would be an optimization I wouldn’t have expected from Linus in ’91.
You still haven’t shown why it is better than the O(n)
scheduler in all ways, because it quite likely would
be slower due to larger cache footprint at n=small.
It’s not going to have a larger cache footprint. It’s going to have a single additional data structure, in the form of a one dimensional array, that is accessed on very rare occassions, and is less likely to have a large cache foot print than the linked list structure used by the Linus’ original algorithm.
Don’t confuse memory footprint with cache footprint.
Here’s some guarenteed-not-to-compile pseudocode that might help. It’s for the uniprocessor case, and I left out synchronization, the implementation of ‘run’, which doesn’t modify any of the queues, and the details of the structures… (Ignore the crappy lack of indentation. stupid web forum software eats spaces )
#define MAX_PRI /* SOME VALUE */
#define PRI_IO MAX_PRI
static runqueue_head_t *rq[MAX_PRI];
static runqueue_head_t *current = rq[0];
static uint maxpri = 0;
/* the idle process is always runable and always on the priority 0 queueu. ie, rq[0] is never empty */
/* Quantum expiration handler */
void expire(void) {
runqueue_entry_t e = current->head;
/* If there’s more than one entry on the runqueue */
if (current->head != current->tail) {
remove_from_head(current);
add_to_tail(e, rq[e->pri]);
if (e->pri > maxpri) maxprio = e->pri;
}
run(current->head);
}
/* Any activity that unblocks a blocked task calls this
* e isn’t on any runqueue at this point */
void unblock(runqueue_entry_t *e) {
add_to_tail(e, rq[PRI_IO]);
maxpri = PRI_IO;
run(current->head);
}
/* Any activity that blocks a running task calls this
* (killing the task counts as ‘blocking’)
*/
void block() {
remove_from_head(current);
if (!current->head) {
/* If you figure out what is missing here,
you understand the algorithm. */
}
run(current->head);
}
void add_to_tail(runqueue_entry_t *e, runque_t *rqp) {
if (rqp->head)
rqp->tail->next = e;
else
rqp->head = e;
rqp->tail = e;
e->next = 0;
}
void remove_from_head(runque_t *rqp) {
if (rqp->head)
rqp->head = rqp->head->next;
}
2006-07-08 1:05 pm
nick
>Um, given that you point out below that you don’t
>understand the solution, how, precisely, can you
>claim this about it?
Because I can understand enough to know that the only
way it can work is with a larger cache footprint.
>Actually, no. The memory footprint is slightly larger
>than that for Linus’ scheduler, but is made up for by
>fewer branch delays, and by fewer memory fetches on
>average.
No that doesn’t matter. What matters is the range of
cachelines that are fetched. It doesn’t matter whether
all of them or just one of them are fetched per ctxsw,
what matters is the total footprint.
>You run the process that’s pointed to by the “highest
>priority non-empty queue” pointer you update every
>time you add a process to a run queue.
I didn’t ask what happens when you wake up some other
process; I asked what happens when your current process
runs out of timeslice. Assume no other process has
woken up in the meantime, and > 0 other processes
exist on the runqueue.
>Nope. You always know which is the highest priority
>queue with a runable process in it, because you
>always set that whenever you add a process on a run
>queue, or delete the last active process from a run
>queue.
You haven’t even thought this through, have you? Say
you have 40 priority queues. Process from queue 0 is
running right now, and there is 1 process on queue 10
2 processes on q 20, and 1 on queue 30.
When queue 0 finishes running, suppose it has a “next
pointer” to the process on queue 1. Now what happens
when the process on queue 1 finishes its timeslice?
How does your scheduler “know” where the next process
is?
>It’s not going to have a larger cache footprint. It’s
>going to have a single additional data structure, in
>the form of a one dimensional array, that is accessed
>on very rare occassions, and is less likely to have a
>large cache foot print than the linked list structure
>used by the Linus’ original algorithm.
>
>Don’t confuse memory footprint with cache footprint.
You don’t need to access all the memory in a single
context switch operation. Just the fact that you touch
that memory at some stage means it contributes to
cache footprint.
I guess you haven’t moved out of the 80s. Nowadays,
instructions are basically free in comparison to
cache misses.
[snip sad C code]
No, the concept you try to express with this C code is
wrong. You fail to maintain the current->head list when
you remove the process from the queue in expire().
Imagine current is the only process in this priority
level: then current->head will not be valid. You’ll
have to search for the next priority level that has an
active process.
Lastly, you still haven’t told me what the Linux 2.6
scheduler uses. It uses dual arrays of 140 queues,
per CPU, which you say is wrong. Tell me what it
*really* uses.
2006-07-08 6:32 pm
Cloudy
blargh. That’s what I get for typing up an example at 2 in the morning. Sorry; there’s a line missing from unblock. It should read:
/* Any activity that unblocks a blocked task calls this
* e isn’t on any runqueue at this point */
void unblock(runqueue_entry_t *e) {
add_to_tail(e, rq[PRI_IO]);
maxpri = PRI_IO;
current = rq[maxpri];
run(current->head);
}
I suspect you’d have understood the algorithm without that ommission. That said, the search you want to put in expire doesn’t go into expire. It’s what the comment
/* If you figure out what is missing here,
you understand the algorithm. */
was about in block.
Expire never leaves current empty, because the currently running process is always runable at the expiration of its quantum.
Here’s block, with the comment replaced:
void block() {
remove_from_head(current);
if (!current->head) {
/* since the idle task is in rq[0], this always
* finds a valid rq */
for (int i = MAX_PRI; i >= 0; i–)
if (rq[i]->head) break;
maxpri = i;
current = rq[i];
}
run(current->head);
}
The reason this algorithm doesn’t have the cache footprint issues you think it does is that the only time the search is done is in unblock (which is the only routine that can result in an empty rq) and then only when the last task is removed from a run queue. Unless you’ve picked a value for MAX_PRI that’s far too large for the system, rq fits in a single cache line and you have to take at least that hit anyway. For the uses Linus wanted Linux for in ’91, I’d have set MAX_PRI = 2.
By the way, the pedantic reason that you’re wrong about the data structure used in the 2.6 scheduler is that you’ve neglected the optimization that I would add here if MAX_PRI had to be large: the priority bitmap. (and yes, I do realize that rq should have been declared as rq[MAX_PRI+1]).
It is left as an exercise for the reader what two lines of code would have to be added to maintain such a bitmap.
2006-07-09 2:36 am
nick
>Expire never leaves current empty, because the currently
>running process is always runable at the expiration of
>its quantum.
Well, you’re right that I didn’t exactly understand it,
because it had important omissions. That said, I knew
there would have to be a search somewhere, because I’ve
tried implementing almost exactly the same thing.
>The reason this algorithm doesn’t have the cache
>footprint issues you think it does is that the only
>time the search is done is in unblock (which is the
>only routine that can result in an empty rq) and then
>only when the last task is removed from a run queue.
Only when the last task is removed *from the priority
queue*, not the entire runqueue.
This actually happens very frequently. In workloads
where tasks are simply expiring their timeslices,
ctxsw rates are typically very low (depending on your
quantum size) generally < 1000/s.
The critical case is when there is a lot of block/wake
activity going on, and in this case you will frequently
need to search.
It barely alleviates the cache issues, and definitely
has a much larger footprint than a simple O(n) sched
for small task numbers.
>Unless you’ve picked a value for MAX_PRI that’s far
>too large for the system, rq fits in a single cache
>line and you have to take at least that hit anyway.
>For the uses Linus wanted Linux for in ’91, I’d have
>set MAX_PRI = 2.
That doesn’t provide a very good range of dynamic
priorities. So it would still have had many of the
problems that a round robin scheduler would have.
>By the way, the pedantic reason that you’re wrong
>about the data structure used in the 2.6 scheduler is
>that you’ve neglected the optimization that I would
>add here if MAX_PRI had to be large: the priority
>bitmap. (and yes, I do realize that rq should have
>been declared as rq[MAX_PRI+1]).
No I didn’t. I said that it uses arrays of 140 queues.
This is a completely correct statement.
The bitmap does a bit to alleviate cache footprint
costs (especially because the upper 100 queues are
often empty), but the undeniable fact remains that
it has a larger cache footprint.
You don’t think I keep repeating this without having
a clue what I’m talking about, do you? I have actually
profiled the scheduler for things like this (and not
just in microbenchmarks).
2006-07-09 6:02 am
Cloudy
Well, you’re right that I didn’t exactly understand it,
because it had important omissions. That said, I knew
there would have to be a search somewhere, because I’ve
tried implementing almost exactly the same thing.
Two omissions, one accidental and one intentional, but yeah, oops on my part. Leaving the search out was supposed to be a clever way to steer the discussion towards lazy evaluation and why moving the search to the block routine was a good thing. Oh well, best laid plans and all that.
By the way, there is a variant on this scheduler that requires no searching at all, but I’ve never been convinced that the bookkeeping necessary is justified, so I’ve never used it in a real system. If you have C queues, it requires a 2^C table of ints in the most straightforward implementation. . .
I sometimes ask how to implement it as an interview question and then ask how to optimize it. It’s a classic example of the old saw about solving computer science problems with more indirection.
>only when the last task is removed from a run queue.
Only when the last task is removed *from the priority
queue*, not the entire runqueue.
“a run queue” meant an rq, which, as you say is a queue for a particular priority.
The critical case is when there is a lot of block/wake
activity going on, and in this case you will frequently
need to search.
In this case, the search is often short, since you are probably ping-ponging between two queues. Also, an interesting artifact of having a lot of blocking going on is that the blocking itself tends to make a mess of the caches, so what effect the scheduler has will be greatly alleviated.
It barely alleviates the cache issues, and definitely
has a much larger footprint than a simple O(n) sched
for small task numbers.
I’m still not convinced about the ‘much’ larger cache footprint, as I’ve not seen that to be true in practice.
That doesn’t provide a very good range of dynamic
priorities. So it would still have had many of the
problems that a round robin scheduler would have.
None of which would have mattered for that particular workload. Certainly not being able to easily tune the # of priority queues is a limitation of both the design I laid out and the current 2.6 default scheduler.
But given the cache line size for D caches, even back in ’91, a 16 entry queue wouldn’t have done any more harm to cache footprint than a 2 entry queue, and would have handled more than enough priorities for the workload.
You don’t think I keep repeating this without having
a clue what I’m talking about, do you? I have actually
profiled the scheduler for things like this
If I thought you didn’t have a clue, I wouldn’t have bothered writing up examples; because I would have assumed you wouldn’t understand them.
2006-07-09 12:13 pm
nick
>”a run queue” meant an rq, which, as you say is a queue
>for a particular priority.
Yep. And that’s what your code did. Run queue seems to
be used in practice to describe the aggregate set of
runnable tasks, whether or not a single queue.
>In this case, the search is often short, since you are
>probably ping-ponging between two queues. Also, an
>interesting artifact of having a lot of blocking going
>on is that the blocking itself tends to make a mess of
>the caches, so what effect the scheduler has will be
>greatly alleviated.
Oh, I wouldn’t say that. In many really high perfomance
applications it often will hit. You might have say a
web server which has a single thread per CPU, and that
multiplexes requests in and out of a single network
interface. When this network interface fills up (on
the TX side) or empties out (RX), the thread will
block and wake up.
In this application, you are not pingponging between
two threads (unless you count the idle thread as a
thread, but that doesn’t use much cache).
And in that case, you will be searching all the way
down to the idle task (if you do the naive
implementation you describe, but I guess that is easy
to avoid).
Even when you do have several processes running and
blocking, that doesn’t necessarily trash the cache.
A single process might have only several tens or
hundreds of K of working set.
>None of which would have mattered for that particular
>workload.
What particular workload? Mixed CPU and IO processes
running together on the same system? A general purpose
UNIX dynamic priority scheduler? I think you’d need a
few more than 2 queues…
>Certainly not being able to easily tune the
># of priority queues is a limitation of both the
>design I laid out and the current 2.6 default
>scheduler.
Define easily. At runtime? (because it is very easy to
change the 2.6 scheduler at compile time)
>But given the cache line size for D caches, even back
>in ’91, a 16 entry queue wouldn’t have done any more
>harm to cache footprint than a 2 entry queue, and
>would have handled more than enough priorities for the
>workload.
16 entry queue? Doubly linked list, so each entry takes
8 bytes on a 32-bit machine, for 4 cachelines on a
typical CPU of ’91. And then if you have a few other
bits of state in there (# tasks running, idle task
ptr, next task ptr), that brings you up to 5 cachelines
where the O(n) scheduler could do it with 1. That isn’t insignificant, considering Linus used to be known to
count icache line size of the context switch function
and reject patches that increased it by 1.
OK, people were still doing size/speed tradeoffs in a
big way back in 91, but a mere couple of years later,
the memory wall was there.
Anyway, I don’t think it mattered back in ’91 to be an
O(1) scheduler anyway. Even in 2000, *very* few Linux
users cared, and as I said, even today the biggest
thing that the O(1) rewrite brought was per-CPU
runqueues and locking. Considering it lasted that long,
(an eternity in CS), I don’t think it could have been
a bad engineering choice.
2006-07-09 5:33 pm
Cloudy
Oh, I wouldn’t say that. In many really high perfomance
applications it often will hit. You might have say a
web server which has a single thread per CPU, and that
multiplexes requests in and out of a single network
interface.
Sure, but we’re talking about the scheduler for a ’91 era PC sitting on someone’s desk, not a high performance server. Besides you’re thinking of the worst case and a poorly designed multithreaded server. A well designed multithreaded server would empty the queue without blocking again.
Define easily. At runtime? (because it is very easy to
change the 2.6 scheduler at compile time)
Yes, ‘at run time.’ Since both schedulers are using fixed sized arrays and relying on compile time constants, they require a recompile to tune queue counts. This can be fixed in both cases, but may not be worth it for either.
16 entry queue? Doubly linked list, so each entry takes
8 bytes on a 32-bit machine, for 4 cachelines on a
typical CPU of ’91.
That’s a naive implementation. and it’s only the queue head array that needs to be searched, not the queue.
That isn’t insignificant, considering Linus used to be known to count icache line size of the context switch function and reject patches that increased it by 1.
Eh, he’s been known to do a few things that he’s later changed his mind on.
Even in 2000, *very* few Linux users cared,
in 2000 there were very few Linux users. But given that the telco and embedded guys were already using their own schedulers in 2000, I suspect you may be wrong about that.
2006-07-09 6:11 pm
nick
>Sure, but we’re talking about the scheduler for a ’91
>era PC sitting on someone’s desk, not a high performance
>server. Besides you’re thinking of the worst case and a
>poorly designed multithreaded server. A well designed
>multithreaded server would empty the queue without
>blocking again.
No. I said if their TX queue is full (so, they’ve
filled up the outbound queue of the netif), or the RX
queue is empty (they’ve emptied the queue on the
netif).
What would you propose they do in those cases other
than block?
>That’s a naive implementation. and it’s only the queue
>head array that needs to be searched, not the queue.
Oh really? And what about inserting a new task into
that priority level? You have to insert it in the tail.
(Or you could do a semi LIFO scheduler, with
interesting fairness results).
>Eh, he’s been known to do a few things that he’s later
>changed his mind on.
Well yeah. He relented with the scheduler.
>in 2000 there were very few Linux users. But given
>that the telco and embedded guys were already using
>their own schedulers in 2000, I suspect you may be
>wrong about that.
Very few? Compared to which other commercial and/or
hobby operating systems?
I’d say even in 2000, it would have been in #2 spot, or
not far behind in the desktop+small server area. Don’t
forget 2000 (or was it ’99) was the year of RedHat’s
IPO, and Netcraft says apache installations were about
10 million active domains (which obviously aren’t all
Linux, but still).
Probably a bigger subset of users would be embedded
installations. Many of whom still use 2.4 and would
use the default scheduler.
Anyway, some of the groups interested in improving the
Linux scheduler (before the O(1) scheduler was adopted)
included IBM, SGI, and RedHat. The funny thing about
this is: because they were having so much trouble
trying to get Linus on side with a scheduler rewrite,
they ended up opting to keep the O(n) algorithm and
focus on per-CPU data structures, which is the main
thing they were having problems with!
(Eventually the O(1) scheduler did get in, ~2002).
The oh-n-ness really wasn’t a huge problem for most
people. The only benchmarks I’ve ever seen where it
made a significant difference are those that explicitly
create a large number of tasks and bring up the ctxsw
rate. Not so many significant improvements in real
workloads.
Determinism? Not really, because Linux was and pretty
well still is unsuitable for deterministic realtime,
especially if you can’t constrain your resources (like
# of tasks).
Academics hate it, though, naturally
2006-07-09 7:58 pm
Cloudy
What would you propose they do in those cases other
than block?
Doh. I read it backwards… oops.
Oh really? And what about inserting a new task into
that priority level? You have to insert it in the tail.
(Or you could do a semi LIFO scheduler, with
interesting fairness results).
you use the same single-linked list layout that I used in the sample code, but if you care about cache footprint, you make head and tail arrays instead of members of a struct. Then you only have to search head for finding a non empty queue and you use tail for tail insertion.
Probably a bigger subset of users would be embedded
installations. Many of whom still use 2.4 and would
use the default scheduler.
We were using 2.4, but of the embedded people I was aware of had ripped out the scheduler. I know I did and I’m sure Nokia did.
The funny thing about this is: because they were having so much trouble trying to get Linus on side with a scheduler rewrite,
Yeah. I know. that’s why we riped it out and used our own. I’ve had some, um, “interesting” discussions with Linus on scheduling on various mailing lists, mostly LKML over the past six years.
Determinism? Not really, because Linux was and pretty
well still is unsuitable for deterministic realtime,
especially if you can’t constrain your resources
absolutely. but very few OSes are suitable for deterministic realtime.
I wish it were currently better for soft realtime, though. I hate it when mp3s glitch because I change the window focus.
2006-07-10 7:10 pm
nick
OK… well thanks for continuing the conversation.
I think many of the things we disagree on are
subjective and/or open to different interpretation of
their semantics, eg. what is a good engineering
choice? How much cache footprint do you trade for
instructions (this will vary between architectures
and workloads)? etc.
I jumped to the conclusion that you were baselessly
bashing Linux, because that isn’t uncommon on public
forums. So I apologise for my rudeness earlier.
I don’t suppose anybody else is reading, but our
conversation has been interesting to me. Thanks!
2006-07-12 12:57 am
Cloudy
I pretty much agree with your assessment.
I’ve enjoyed our conversation as well.
Until next time!
2006-07-09 12:14 pm
nick
>>You don’t think I keep repeating this without having
>>a clue what I’m talking about, do you? I have
>>actually profiled the scheduler for things like this
>
>If I thought you didn’t have a clue, I wouldn’t have
>bothered writing up examples; because I would have
>assumed you wouldn’t understand them.
>
Sorry. My comment was not called for anyway.
2006-07-07 11:18 am
CrLf
“Yes. This, of course, is why using the O(n) scheduler was not good engineering. O(n) schedulers have been known for decades not to be an appropriate solution for multiprocess scheduling.”
Yet it survived for many years, while Linux grew from an irrelevant hobby project into something that people actually use in production systems. And it seem to have gone unnoticed by both inexperienced and experienced developers working on the kernel. And it didn’t seem to stop people from writing books on OS design using Linux as an example (alongside the NT kernel and BSD).
Maybe, just maybe, the O(n) scheduler was actually a good enough solution, better than the other “known for decades” solutions over the workloads that Linux was handling in the *real world*.
“However for n > about 2 on most systems, the O(1) algorithm is going to, on average, be faster.”
Please elaborate. What’s the constant cost of both algorithms (as they are implemented in Linux), and what’s the average expected case on, again, the *real world*. And what is the probability of the worst case happening on a regular system (as opposed to systems where the worst case actually becomes the expected case)?
You don’t go overkill unless you need to, otherwise you can turn out having a real clever solution which looks good on paper and has been “proven for decades” (although not on Linux, and not on the same workloads as Linux handled at the time) but sucks and gives people a false sense of good design that prevents them from just assuming it sucks, riping it out and replace it.
2006-07-08 12:58 am
Cloudy
Yet it survived for many years, while Linux grew from an irrelevant hobby project into something that people actually use in production systems.
That doesn’t make it an example of good engineering. Remember, if popularity was the criteria, Windows would clearly be the best engineered OS of all time.
Please elaborate. What’s the constant cost of both algorithms (as they are implemented in Linux), and what’s the average expected case on, again, the *real world*. And what is the probability of the worst case happening on a regular system (as opposed to systems where the worst case actually becomes the expected case)?
See my answer to Nick elsewhere in this thread where I discuss the details.
The short answer is that the algorithm I had in mind is always faster than Linus’ original algorithm, always provides better latency, and uses slightly more space, because it has C queue heads rather than one.
2006-07-06 7:47 pm
Cloudy
You fail to realise that Linux was not a commercial
system when the O(n) scheduler was designed.
Um, no. i know exactly what it was when the O(n) “scheduler” was designed: a hobby project undertaken by a young man with a great deal of attitude and very little expertise.
And actually, for many workloads the 2.4 scheduler is faster than the 2.6 scheduler due to less cacheline
footprint. And it was much simpler to understand.
Sorry, but that’s simply not true. The 2.4 scheduler isn’t any easier to understand or implement than the classic round robin scheduler and it wouldn’t have a smaller cacheline footprint in either the i-cache or the d-cache. It would also tend to very quickly become slower than the classic algorithm because it would take more branches and execute more instructions.
The secret to being a good programmer isn’t by blindly implementing everything with the lowest complexity algorithms, but knowing what to apply and where.
Yes. Unfortunately, the tendency in Linux is to implement the lowest complexity algorithm first and then find out what to apply later.
2006-07-07 6:11 am
nick
>Um, no. i know exactly what it was when the O(n)
>”scheduler” was designed: a hobby project undertaken by
>a young man with a great deal of attitude and very
>little expertise.
And yet you are rabidly arguing that it was a bad
choice based on the fact that you don’t find it in
commercial systems.
You wrote:
The so called O(n) scheduler was a poor choice for any
process scheduler, which is why you don’t find it in
commercial systems more recent than, oh, 1965.
>>And actually, for many workloads the 2.4 scheduler is
>>faster than the 2.6 scheduler due to less
>>cacheline footprint. And it was much simpler to
>>understand.
>
>Sorry, but that’s simply not true.
Umm, yes, it simply is.
>The 2.4 scheduler >isn’t any easier to understand or
>implement than the classic round robin scheduler and
That’s a great strawman. The 2.6 scheduler is far from
a classic round robin scheduler.
>it wouldn’t have a smaller cacheline footprint in
>either the i-cache or the d-cache.
Here’s a clue for you though: the 2.4 scheduler is
faster at low runqueue counts, and has a smaller cache
footprint than the 2.6 scheduler.
>It would also tend to very quickly become
>slower than the classic algorithm because it would
>take more branches and execute more instructions.
“quickly” according to a theoretical analysis, maybe.
In practice few workloads have huge numbers of runnable
tasks on the runqueues. Which is obviously why it was
not replaced earlier.
Even now, the biggest benefit from the O(1) scheduler
*in practice* is the per-CPU locking and runqueues. Not
it’s constant time complexity.
>Yes. Unfortunately, the tendency in Linux is to
>implement the lowest complexity algorithm first and
>then find out what to apply later.
I don’t think you have any idea what the tendency in
Linux is.
Take the virtual->physical map infrastructure that
was added in 2.6. This was initially an O(1) operation
to find 1 physical mapping. This was eventually
replaced by a better design that takes O(log(m)), where
m is the total number of mappings in an address space.
The tendency in Linux is to judge the kinds of
workloads a given algorithm will need to handle, and
try to design something fast but simple. If someone
reports a failure case for their workload, it is
reevaluated.
2006-07-07 6:33 am
Cloudy
And yet you are rabidly arguing that it was a bad
choice based on the fact that you don’t find it in
commercial systems.
strawman. I mentioned once that it’s not found in commercial systems. I’m arguing it was a bad choice because classic round robin made better sense, even in the original implementation.
>The 2.4 scheduler
>isn’t any easier to understand or
>implement than the classic round robin scheduler and
That’s a great strawman. The 2.6 scheduler is far from
a classic round robin scheduler.
It’s not a strawman. I was arguing that the appropriate engineering choice, in 1992, for a first process scheduler for an OS would have been classic round robin, not the 2.6 scheduler.
“quickly” according to a theoretical analysis, maybe.
In practice few workloads have huge numbers of runnable
tasks on the runqueues. Which is obviously why it was
not replaced earlier.
In practice. I’ve show the analysis elsewhere in this thread. But again, I have been comparing the O(n) scheduler to classic round robin, simply to point out that the O(n) scheduler wasn’t the appropriate first choice, even given the criteria you claim that “Linux” follows.
I don’t think you have any idea what the tendency in
Linux is.
You certainly have not offered any evidence to support this assertion.
The tendency in Linux is to judge the kinds of
workloads a given algorithm will need to handle, and
try to design something fast but simple.
If by “judge” you mean ‘guess with no prior experience and no familiarity with the literature’, then, yes, “judge” is the word to use here.
The original O(n) scheduler, when compared to classic round robin, is a good example of this. Classic round robin is as easy to implement and understand as the original O(n) scheduler and is typically faster than the original O(n) scheduler on any system with more processes than sceduler classes (ie, nearly any interactive system)
2006-07-07 7:13 am
nick
>strawman. I mentioned once that it’s not found in
>commercial systems. I’m arguing it was a bad choice
>because classic round robin made better sense, even in
>the original implementation.
That is not a strawman.
In your first post, you said “The so called O(n)
scheduler was a poor choice for any process scheduler”
And the justification for that appeared to be that it
isn’t in commercial systems. This is what I was
referring to. You didn’t bring the round robin
scheduler into it until later, but more on that in a
minute…
>It’s not a strawman.
Of course it is. Do you know what a strawman is?
This is what you were replying to:
—
You wrote:
> I wrote:
>>And actually, for many workloads the 2.4 scheduler is
>>faster than the 2.6 scheduler due to less
>>cacheline footprint. And it was much simpler to
>>understand.
>
>Sorry, but that’s simply not true.
—
I said that the *2.4* scheduler is easier to understand
than the *2.6* scheduler. You claimed this is false.
You then justified your claim by saying that the *2.4*
scheduler isn’t easier to understand than a simple
*round robin* scheduler.
That is a strawman.
>I was arguing that the appropriate engineering choice,
>in 1992, for a first process scheduler for an OS would
>have been classic round robin, not the 2.6 scheduler.
And you would have been wrong, because Linux uses
dynamic priorities, and it is critical for IO and
interactivity for CPU heavy processes to be preempted
by IO bound processes.
If the scheduler design had been up to you, we’d have
had a simple, fast, O(1) round robin scheduler that
would have been absolutely horrible. Glad it wasn’t
you deciding what the “right engineering choices” are.
>If by “judge” you mean ‘guess with no prior experience
>and no familiarity with the literature’, then, yes,
>”judge” is the word to use here.
It is the word I use to describe your arguments, yes.
>The original O(n) scheduler, when compared to classic
>round robin, is a good example of this.
No it isn’t.
>Classic round robin is as easy to implement and
>understand as the original O(n) scheduler and is
>typically faster than the original O(n) scheduler on
Faster? Faster to perform the low level selection
maybe. Its performance will be much worse in a real
system, however.
For someone with obviously little knowledge of a UNIX
like process and scheduling environment, you really
have strong opinions on these things.
>any system with more processes than sceduler classes
>(ie, nearly any interactive system)
Wrong.
As I said in the other post, the ‘n’ is referring to
the number of processes on the runqueue.
Edited 2006-07-07 07:18
2006-07-07 7:44 am
Cloudy
>Classic round robin is as easy to implement and
>understand as the original O(n) scheduler and is
>typically faster than the original O(n) scheduler on
Faster? Faster to perform the low level selection
maybe. Its performance will be much worse in a real
system, however.
Um, no. For the reasons I’ve already outlined.
For someone with obviously little knowledge of a UNIX like process and scheduling environment, you really have strong opinions on these things.
Your funny when you’re throwing around insults. Care to actually argue on details though?
>any system with more processes than sceduler classes
>(ie, nearly any interactive system)
Wrong.
As I said in the other post, the ‘n’ is referring to
the number of processes on the runqueue.
Amazingly enough, I knew that when I made my point. Which typing “wrong” is not a refutation of.
You don’t really have anything here but insults and bald assertions.
Let me kow if you want to actually discuss process scheduling.
2006-07-07 8:55 am
nick
>Um, no. For the reasons I’ve already outlined.
No, you did not outline any reasons. A round robin
scheduler will perform badly with IO bound tasks because
it won’t let them get in front of the CPU bound tasks.
Sure, the *mechanism* of selecting a task will be very
fast, but it will suck.
>Amazingly enough, I knew that when I made my point.
>Which typing “wrong” is not a refutation of.
Umm, I’ll repeat what you wrote:
“any system with more processes than sceduler classes
(ie, nearly any interactive system)”
You were wrong in 1 of 2 ways. Either you thought
the scheduler complexity rose with the total number
of processes in the system (and that’s really what
it sounds like), or you thought that “nearly any
interactive system” has a significant number of tasks
on the runqueue at any time.
Here is a sequence of the ‘R’ field of `vmstat 1` on my
system while browsing the web, listening to music,
editing and compiling:
0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 2 0 0 0 0 3 0 0 1 0 0 5
0 0 0 1 2 2 2 2 3 1 1 …
You let me know when nearly all your interactive
systems reach 140 or so, and I’ll reconsider.
2006-07-07 7:22 am
nick
>It’s not a strawman. I was arguing that the appropriate
>engineering choice, in 1992, for a first process
>scheduler for an OS would have been classic round robin,
>not the 2.6 scheduler.
Oh and if you still don’t believe this is a strawman,
consider that you never once mentioned “round robin”
until then.
Are you really forgetful, or just dishonest?
2006-07-07 7:51 am
Cloudy
Oh and if you still don’t believe this is a strawman,
consider that you never once mentioned “round robin”
until then.
Yeah. There I go assuming that the readership of OSNews didn’t need to have the simple stuff spelled out again. Sorry. I’ll remember that you need lots of detail and small words in the future.
From how quickly you’ve gone to coping an attitude, I would guess that the first time I modified a Unix process scheduler to run on an SMP (that would have been 1984, if you’re keeping track,) you would have been, er, would you have been at all? Because you’re sure behaving like someone born in 1990 or later.

2006-07-06 1:15 am
Yoke
Scheduler activations? http://tinyurl.com/kscuf

2006-07-06 1:41 am
Cloudy
If you’re going to schedule many threads on a large scalem SMP, you are eventually going to break down and implement M:N thread scheduling with a library level (user land) and a kernel level scheduler.
The reasons for this have been well known since the mid-80s, and were first documented by various people from Cray Research, based on work done on UniCos on the YM/P.
This is not the same as scheduler activations, but it provides a significant win in moderate-to-fine grain parallism on that class of machine.

2006-07-06 10:46 am
Yoke
Tell that to Sun Microsystems. This is one area, much like that of micro vs monolithic kernel, where what works in theory is very different from works in practice.

2006-07-06 7:53 pm
Cloudy
Tell that to Sun Microsystems. This is one area, much like that of micro vs monolithic kernel, where what works in theory is very different from works in practice.
Who said anything about theory? The M:N scheduler work done at Cray was applied to real world workloads and showed significant performance improvements.
You will also note that I stipulated that M:N is appropriate for certain workloads. I don’t claim, nor did the Cray people, that it was an appropriate scheduling approach for all problem sets.
The published papers on the Cray M:N scheduler do a good job of explaining why it works well and when to use it.

2006-07-06 3:18 am
deanlinkous
mine must be broke because I dont have the lag everyone is is talking about
2006-07-06 6:16 am
djst
It seems the 2.6 kernel has many similarities with the Windows NT kernel (including XP). Both kernels group task priorities into “dynamic” and “realtime” classes (where “realtime” by no means provides any real guarantees in terms of WCET), they both give I/O intensive tasks priority boosting to provide better UI responsiveness, they’re both preemtible to allow higher priority tasks to interrupt a running task, and so on.
There’s no doubt the 2.6 kernel is an improvement over the old 2.4 kernel, but I can’t say neither of them provide significantly better solutions to common scheduling problems than many other OS schedulers. Not saying they have to either, of course.

2006-07-06 6:55 am
PlatformAgnostic
for some reason the NT kernel is “unacceptable” to people on LKML, so I think you’ve just blasphemed against Linus’ name.
I was googling through their archives, but couldn’t find any reasons why NT is labelled “unacceptable.”
2006-07-06 7:01 am
Cloudy
The scheduling model you describe was first implemented in VMS, back in ’78 or there about. It’s not surprising that NT would use it, given the number of former VMS people working on NT.
Of course, it’s a well known, and well understood process scheduling algorithm that really only needed processor affinity added to deal with SMP scheduling.