Linux Switches to CFS Scheduler in 2.6.23

Submitted by maverick 2007-07-10 Linux 36 Comments

The Linux kernel process scheduler, as you know it, has been completely ripped out and replaced with a completely new one called Completely Fair Scheduler. How fair it will be, remains to be seen. Here’s what its original creator Ingo Molnar says: “80% Of CFS’s design can be summed up in a single sentence: CFS basically models an ‘ideal, precise multi-tasking CPU’ on real hardware.” Learn more about the new scheduler from the CFS design document.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

36 Comments

2007-07-10 10:19 pm

predictor
which means jiffies are dead by 2.6.24 or so (i hope)

2007-07-10 10:23 pm

sbergman27
One might even say that they’ll be gone… in a jiffy. 🙂

I, for one, welcome our new tickless, interrupt grouping, completely fair overlords.

Sorry… 😉
2007-07-10 10:25 pm

maverick
Hm, but I don’t see much connection between jiffies and scheduler. Jiffies are for timekeeping, and scheduler for scheduling, right?

Would you care to elaborate?

2007-07-10 10:47 pm

diegocg
I think he meant a completely tickless kernel (right now the tickless feature only works for idling, the goal would be to run tickless 100% of the time)
2007-07-10 11:10 pm

Wes Felter
IIRC, the scheduler used to calculate times (such as thread quanta and accumulated time) in units of jiffies, while now time is calculated in physical units (ns).
2007-07-11 1:30 am

butters
Kinda hard to schedule anything if you can’t keep track of time, right? CFS doesn’t use jiffies or any other HZ-dependent timescale. But it does keep track of time in terms of nanoseconds.

The theory behind CFS is that each thread runnable on a given CPU is due an equal share of its time. As the CPU runs a particular thread, the scheduler deducts from its time. The thread that is due the most time is the one that deserves to run on the CPU.

That’s the 80%. The first caveat is scheduler granularity, the only tunable in CFS. Although CFS does away entirely with the notion of timeslices, the granularity represents the amount of time that a thread may run before a more deserving thread is eligible to preempt it. Of course, a thread can yield the CPU before this period expires. Decreasing this value favors interactivity and increasing it favors throughput.

Another caveat is priorities. Threads may have various priority levels, which entitle them to more or less than an equal share of CPU time. As we know, some threads are more equal than others. Priorities in CFS are static, user-defined levels. There are no runtime heuristics in CFS. Threads wake up with the same level of unfairness as when they went to sleep.

Then there’s scheduler groups, which can contain multiple threads and (potentially) scheduler groups. At any level of the resulting tree, scheduler groups get the same fair share as a single thread, no matter how many threads they contain. Threads within a scheduler group get their fair share of the group’s fair share. This can be used to implement fairness among users or processes running various numbers of threads, for example.

The CFS commit introduces the notion of scheduler modules, which allows multiple schedulers to run in series. The current implementation runs the realtime scheduler first to potentially select a runnable realtime thread, and failing that, runs the CFS scheduler to select a normal thread. The list of modules is currently static, but in the future, the order of the the modules might be managed dynamically. Scheduler modules implement a clean and simple API.

CFS maintains the per-CPU scheduler design central to the Linux 2.6 architecture, but does away with runqueues and the amortized O(1) algorithms. The “runqueues” in CFS are red-black trees, a kind of balanced search tree. Scheduling involves selecting the left-most thread in the tree, which is an O(log N) operation. It remains to be seen whether O(1) is actually faster in terms of wall-clock than CFS, and it is suggested that the scheduling improvement on heavily-loaded systems will be worth any additional overhead.

Ingo Molnar developed CFS in response to the problems with Con Kolivas’ popular Staircase Deadline scheduler. After some heated debate and Ingo’s attribution of key CFS ideas inspired by SD, Con essentially conceded that CFS would become the next Linux process scheduler. Ingo definitely has a knack for taking good ideas to their fruition, as he did the same thing assembling the ideas of Robert Love and others into what became the O(1) scheduler. Ingo also created a prototype for CFS that included a “scheduler economy,” which allows threads to literally trade CPU time on an open market.

Edited 2007-07-11 01:32

2007-07-11 1:44 am

hechacker1
CFSv19 has many tunables in /proc/sys/kernel besides ns_granularity… Initially yes, it only had one like SD, but like I’ve previously posted, Ingo has been adding “smoothness” calculations and other “hand tuning” and also exposing more tunables. His “economy” based ideas (actually ideas of a college student) will only provide more heuristics ultimately leading to built-in unfairness. I don’t think the economy based ideas made it into CFS.

A Completely Fair Scheduler shouldn’t have heuristics if it should be truly fair. Initially both SD and CFS suffered from built in heuristics. SD has fixed those problems. CFS is still being tuned.

Perhaps these extra tunables are temporary until optimal values are found.. but it would seem that SD is better in this regard (inherently more fair by design).

Initially RSDL did have some problems, but they were fixed in (r)SDL -> SD. Part in thanks to Ingo and other developers giving Con some ideas.

btw, in CFSv19, most operations are O(1) time. Ingo’s design document is out of date.

Edited 2007-07-11 01:54

2007-07-10 10:25 pm

Kroc
Can anyone correct me, does this mean it doesn’t count bogoflops anymore?
2007-07-10 10:30 pm

_mikk
I truly hope it works out.

Linux has become the forefront of OS development (and I mean kernel,drivers etc.)

It’s very nice to see the OS that is being actively developed and enhanced.

The fact that more and more emphasis is put on the performance – it’s just great!

Edited 2007-07-10 22:30
2007-07-11 12:16 am

hechacker1
I’ve been using SD when it first made its appearance as RSDL. Con Kolivas showed that a completely fair scheduler really provides the best experience for all workloads… No more “hand tuning” for specific cases like the old mainline scheduler.

To date: SD is still better at some workloads (absolutely fair compared to CFS) especially when it comes to OpenGL/beryl/compiz/gaming/music workloads.

SD has a higher context switch rate (due to array switching) but has “smoother” performance (yes, that is a metric that Ingo uses too). CFS doesn’t have array switching artifacts since it uses a rbtree (no arrays) which leads to lower context switches. In theory, CFS is better, but in practice SD is my favorite.

I’ve tested each version of SD and CFS since there inception, and SD has been consistently ahead in terms of “smoothness” and fps while doing compiz/totem/mplayer/xmms/muine torture testing.

Unfortunately CK got some cold treatment by Ingo and other Kernel developers (including Linus). Ingo essentially took CK’s idea, made a worse copy (not as fair), and has been bug fixing it since the beginning. Each version adds more “hand tuning” to make up for its unfair nature; such as the many “granularity” kernel interfaces in /proc/sys/kernel. SD only has one interface in /proc/sys/kernel, rr_interval.

Meanwhile SD has been stable at 1.00 for a long time now. As far as I’m concerned, SD is the reference, and CFS is trying to match it. SD could probably be further improved with many micro-optimizations, only if CK had the will.

Ingo also took CK’s IDLEPRIO and IDLE_BATCH and implemented them.

I’m not saying anything is wrong with using GPL’d code and trying to improve on it, but CK was shut down early on, and now Ingo is free to implement his version after the fact.

CK has now released his final ever release of CK 2.6.22-ck1. I believe he is tired of this BS. So you see I’m forced to use CFS because it has Linus’ blessing and is now on the fast track for mainline. Sigh.

CFSv19 is finally matching SD for most workloads, but I can still manage to starve X and compiz during normal usage. Overall however, CFS is a major improvement to the old mainline scheduler which was inherently unfair.

Edited 2007-07-11 00:20

2007-07-11 12:39 am

sbergman27
“””

Unfortunately CK got some cold treatment by Ingo and other Kernel developers (including Linus).

“””

Not to take sides. But do you mean this cold treatment?

http://lwn.net/Articles/226963/

2007-07-11 1:03 am

hechacker1
That argument was for RSDL v.31 when renicing X to -10 helped out Xorg a lot in gaining more CPU time for itself (to improve user experience under load).

Now realize that Ingo also now supports renicing X to -10. He even made a patch for the kernel to do it automatically (i’m current using that patch with CFSv19).

It’s precisely that kind of doublespeak that got CK upset. me too. Since then renicing to -10 isn’t required, but still recommended (more so for CFS than SD in their current versions).

I did actual benchmarks for each version of SD vs CFS under the same conditions. All priorities at default 0… and SD won.

Linus was simply complaining because more people were willing to give CFS a try and post benchmark results (mostly highlighting problems with fairness compared to SD), while SD benchmark results were much less publicized. Probably because numbers weren’t needed, it simply felt “smoother”.. CK did try calling for more people to post benchmarks, but by then SD was basically wrote off from getting into mainline.

Edited 2007-07-11 01:06

2007-07-11 1:37 am

sbergman27
“””

Linus was simply complaining because more people were willing to give CFS a try and post benchmark results (mostly highlighting problems with fairness compared to SD), while SD benchmark results were much less publicized. Probably because numbers weren’t needed, it simply felt “smoother”.. CK did try calling for more people to post benchmarks, but by then SD was basically wrote off from getting into mainline.

“””

That’s… unconvincing.

I’ve not followed all of this closely. But the major thwarting factor that I have noted is the scenario where one process, which looks like a processor hogging server, does its very sensitive work on behalf of a very much interactive client. The premiere example being X.

Are you saying that after all this we are no closer to solving this problem? Renice X to -10 has been the “solution” to this particular problem for as long as I have been aware of it. Which means back in the 90’s.

Excuse me for a moment while I throw myself onto my sword. 😉

2007-07-11 1:52 am

hechacker1
What i’m saying is:

In the past, a renice of X to -10 was recommended by CK for RSDL to overcome X not updating enough (low FPS).

At the time Ingo and Linus were saying that we shouldn’t have to renicie to -10. I agree.

Since then, SD has been fixed and X at priority 0 is perfectly fine. Renicing to -10 doesn’t really help all that much anymore.

Since then, CFS has come out to fix the old RDSL problem, but it still benefits from renicing X to -10 more so than SD. Sure, it can be run at priority 0 too, but with less “smoothness” i.e. FPS.

The link you brought up was true for RSDL, but is no longer the case. I would say the opposite is true now. CFS should be fixed (and it is being actively developed). While SD has been put on the back burners because of a lack of interest (due to the negative reaction to CK and RSDL).

2007-07-11 8:45 am

Redeeman
you do not understand the circumstances of this, those people were complaining about the fairness, he wanted X to provide the responsiveness it requires like 50% cpu to do, while as much other stuff, getting their fair share of cpu, which simply leaves less for X.

totally fair, but not what he wanted, and not how default did.

2007-07-11 7:05 am

Duffman
Since years I heard that Linux has nothing missing from Unix and that you can replace all Unix box with Linux one. And now that.

Dudes, solaris got fair share scheduler 5 years ago.

2007-07-11 8:01 am

sukru
Dudes, solaris got fair share scheduler 5 years ago.

Yep we know Solaris has a better scheduler and also IO, ZFS, etc components.

However Linux and Solaris are different beasts. You cannot replace one with another in every situation.

For example I would not run Linux on a Niagara machine that’s serving heavily threaded J2EE applications. (On the other hand I would not develop those applications on Solaris either).

Yet what you’ve heard is “mostly” true. You can use Linux instead of Unix in almost all the cases (and a little bit more than that – e.g: PDAs)

2007-07-11 2:19 pm

Oliver
>You can use Linux instead of Unix in almost all the cases (and a little bit more than that – e.g: PDAs)

UNIX is just a trademark, if you’re refering to UNIX derivatives there is *BSD (Free/Open/Net) and e.g. MacOS (based on BSD, Mach) and some other.

Linux sucks under massive load – this is the rule. And this is the big advantage of *BSD and the former BSD derivative SunOS/Solaris. With CFS things are different, peak isn’t important anymore (it’s nonsense too), but proper performance under high load. So in the end more quality for Linux and less nonsense in terms of performance hype.

http://jeffr-tech.livejournal.com/10103.html

“Basically, it has the same performance as the FreeBSD 4BSD scheduler now. Which is to say the peak is terrible but it has virtually no dropoff and performs better under load than the default 2.6.21 scheduler.”

2007-07-11 2:54 pm

sbergman27
“””

So in the end more quality for Linux and less nonsense in terms of performance hype.

http://jeffr-tech.livejournal.com/10103.html

“””

The image referenced in that link is here:

http://people.freebsd.org/~jeff/sysbench.png

Edited 2007-07-11 14:55

2007-07-12 12:58 am

w00dst0ck
Very nice, I was about to post the same link.

I’m happy to see FreeBSD progressing as it is my fav of the two systems. I still think that Linux has a lot of good going for it though.

2007-07-11 3:55 pm

Redeeman
i think you will find that when it really comes to scalability, and load, linux does a lot better than BSDs..

2007-07-13 11:44 am

Chreo
Scalability on FreeBSD is as you can see clearly fixed in FreeBSD 7. Majorly so. Scalability for linux using default sources are not so good as you move beyond 8 cores.

Load? run ULE under X and you get the smooth and responsive feeling that CFS is now offering. I’ve no problem whatsoever finding linux handling load poorly under so not so uncommon circumstances.

You forgot about two much more important aspects, stability and security. FreeBSD is load more stable (except for the preview-releases 5.0 and 5.1 that was not quite there yet an therefor designated as such).

Linux is a very nice system (kernel) and I get paid to make sure our linux farm runs as it should. That does not make me for a single second stop wishing for a FreeBSD 7 migration.

2007-07-11 2:22 pm

sbergman27
“””

Since years I heard that Linux has nothing missing from Unix and that you can replace all Unix box with Linux one.

“””

http://en.wikipedia.org/wiki/Straw_man

2007-07-11 1:34 pm

Marcellus
While the schedulers are intended for two different operating systems (linux vs freebsd), is there any (academic) comparison between the ideas two schedulers? What are the similarities/differencies?
2007-07-11 4:48 pm

adkk
I already posted this in Kerneltrap (http://kerneltrap.org/node/11712#comment-253505) but I have to post this again because the BSD nerds are driving me nuts..

*QUOTE ON*

That’s exactly why I hate the FreeBSD community.. It took them several years to get something decent and now they present *ONE* benchmark and claim that Linux is playing catch-up.. That must have been great for your ego Funny that this benchmark http://people.freebsd.org/~kris/scaling/nickel.png didn’t test Postgres on Linux, which doesn’t have the scalability issue.. So it looks like they wanted to embarrass Linux with one benchmark where it didn’t perform that well..

And please keep in mind that:

1.) FreeBSD 7 isn’t released yet.

2.) Linux is usable right now.

3.) sched_smp is not committed and has still issues (read the current mainling list).

4.) sched_ule still has a dropoff between 4 and 10 threads.

5.) the old sched_4bsd will be the default scheduler for FreeBSD 7.

6.) the author of this benchmark didn’t test the latest Linux kernel (which includes an important fix by Nick Piggin) whereas he used bleeding-edge FreeBSD…

7.) several people in #kernelnewbies couldn’t reproduce the problem (after Nick’s patch which is in 2.6.22 and an update to glibc 2.6)

8.) MySQL does some funny systemcalls which fail on Linux and produce nothing but overhead.. (Remove #ifdef HAVE_PTHREAD_SETSCHEDPARAM from mysys/my_pthread.c [MySQL sourcode] and recompile.. gives a decent boost)

*QUOTE OFF*

CFS will most likely be tuned further, when more people have tested it on their high-end machines.
2007-07-11 6:56 pm

FreeGamer
A bit more on Staircase Deadline vs CFS for those who want to understand why there has been a big problem around the way CFS has been fast-tracked into Linux:

http://freegamer.blogspot.com/2007/07/dedicated-to-con-kolivas.html

Not a technical article, more about the history that preceeded CFS.

Edited 2007-07-11 18:57

2007-07-11 7:36 pm

adkk
Sorry, but this “People will debate back and forth the technical merits of CFS (by Ingo Molnar) and SD (by Con Kolivas) but the reality is that both solutions are good but only SD is very well tested and refined. CFS is still new and raw.” I think overall CFS got a lot more testing. You have to keep in mind that the people who tested SD were primarily the people who already used the -ck patchset. Which means the more desktop oriented people. If you followed LKML carefully during the whole debate, it was clear that SD was better for desktop usage (at first), but over time Ingo addressed nearly all (if not all) reported problems. Meanwhile, there were many from whom CFS was clearly better..

Anyway.. there is nothing we can do about it right now.

CFS is in the kernel. The best we can do is to carefully test it and report all problems to Ingo. He is a really nice guy, who is eager to fix all issues and make CFS the best scheduler possible. Just sitting around and feeling grumpy that SD didn’t make it doesn’t help at all.

A
2007-07-11 8:13 pm

sbergman27
I didn’t really care that much before. But all this poor sportsmanship from the CK side makes me feel fortunate that CFS has gone in instead.

I’ve been told that I just don’t understand the context of the link which I posted earlier in this thread:

http://lwn.net/Articles/226963/

I understand the context perfectly well. But it seems that some advocates of SD still do not understand it, and are continuing to provide examples to support Linus’ valid and forward looking perspective.

Con’s dramatic ride off into the sunset (last -ck patchset ever!) adds further confirmation of the wisdom of merging the patch most likely to be maintained into the future.

Edit: An addendum:

From Con, on the CK mailing list:

2. All interest I have in kernel development, even out of the mainline

spotlight, has been… abolished (I had nastier words but decided not to use

them.)

Edited 2007-07-11 20:24

2007-07-11 8:26 pm

FreeGamer
Ok, I’ll bite.

Con was always receptive to bug reports. He’s been doing this for years. Yes, the ck patchset had fanboys but not without reason. Linus may be a C God but he is not as all-knowing as he comes across. I mean the very tone of the email ( http://lwn.net/Articles/226963/ ) suggests Con has just thrown SD at the LKML and expected somebody else to take it on, but that couldn’t be further than the truth.

The reality is quite different and people should stop jumping on either bandwagon without reading the full history first.

Con never claimed SD was perfect but it was a damn sight better than mainline. He was rubbished for his theories for years yet now his very approach is the main basis of CFS – fair scheduling.

2007-07-11 8:34 pm

sbergman27
“””

I mean the very tone of the email … suggests Con has just thrown SD at the LKML and expected somebody else to take it on, but that couldn’t be further than the truth.

“””

Really? Con is gone. Took his marbles and went home. No longer interested in kernel development either in or out of the main tree. I’m not sure how “that couldn’t be further from the truth” fits here.

Edited 2007-07-11 20:39

2007-07-11 9:04 pm

sbergman27
I said:

“””

Con is gone. Took his marbles and went home. No longer interested in kernel development either in or out of the main tree.

“””

And shortly thereafter I dropped by http://www.kerneltrap.org to see what was on the front page.

Sometimes I love being wrong. 🙂

-Steve
2007-07-12 12:06 am

Redeeman
and you probably should research WHY it is he left the scene…
2007-07-12 4:42 am

sbergman27
“””

and you probably should research WHY it is he left the scene…

“””

While I did say that I had not followed the situation closely, it is hard not to be aware of the general outline, as much news coverage as the whole prolonged episode generated.

Are you saying that there is some bombshell shocker of a reason which I missed?

2007-07-11 7:38 pm

korpenkraxar
CFS indeed sounds like a very nice improvement, but something else I would like to see improved performance-wise is the method for swapping memory to disk. Almost every time some program actively starts eating swap all my Linux systems and X have become completely unresponsive. This may not be a big issue for day-to-day desktop use, but when running scientific applications in parallel that suddenly grabs 500+ megs of RAM when needed, swap performance and system responsiveness really matters, if only to be able to kill one of them. I dunno, am I the only one who have these problems?
2007-07-13 12:59 am

sbergman27
This thread is getting old, I know. But to the people who keep claiming that CFS got merged due to some “Good ole boy” system in Linux kernel development circles, I would like to highlight this message, sent yesterday, which I just noticed on kerneltrap.

As a manager, Linus has absolutely no motive for showing anyone favoritism based upon anything but solid trust. No reason to merge things that are not in the best interest of the kernel.

And no one’s code is exempted from that standard. Note, in particular, the “good riddance” bit. Linus is pretty clear that if you can’t take the heat, you should stay out of the kitchen. No matter whether you are Con Kolivas or Ingo Molnar.

—

From: Linus Torvalds [email blocked]

To: Ingo Molnar [email blocked]

Subject: Re: x86 status was Re: -mm merge plans for 2.6.23

Date: Wed, 11 Jul 2007 14:42:53 -0700 (PDT)

On Wed, 11 Jul 2007, Ingo Molnar wrote:

>

> What you just did here is a slap in the face to a lot of contributors

> who worked hard on this code

Ingo, I’m sorry to say so, but your answer just convinced me that you’re

wrong, and we MUST NOT take that code.

That was *exactly* the same thing you talked about when

To the people who insist that CFS got fast tracked

I refused to take

the original timer changes into 2.6.20. You were talking about how lots of

people had worked really hard, and how it was really tested.

And it damn well was NOT really tested, and 2.6.21 ended up being a

horribly painful experience (one of the more painful kernel releases in

recent times), and we ended up havign to fix a *lot* of stuff.

And you admitted you were wrong at the time.

Now you do the *exact* same thing.

Here’s a big clue: it doesn’t matter one _whit_ how much face-slapping you

get, or how much effort some programmers have put into the code. It’s

untested. And no, we are *not* going to do another “rip everything out,

and replace it with new code” again.

Over my dead body.

We’re going to do this thing gradually, or not at all.

And if somebody feels slighted by the face-slap, and thinks he has already

done enough, and isn’t interested in doing it gradually, then good

riddance. The “not at all” seems like a good idea, and maybe we can

re-visit this in a year or two.

I’m not going to have another 2.6.21 on my hands.

Linus
2007-07-13 3:07 pm

adkk
Cmon.. this is exactly what i expected from the FreeBSD community. Linux does *not* have problems scaling past 8 cores. This was a MySQL only thing. Postgres for example scales fine. And do you think Oracle would be pushing Linux so hard if it would not scale? Give me a break… Please read my message earler in this thread.. And ULE in FreeBSD 6.x is *CRAP*. Even the author admits that. Please check the facts!