FreeBSD Developer About Linux SD, CFS Schedulers

Submitted by lomax 2007-08-06 Linux 25 Comments

“I took a little while to learn more about SD and CFS to see what the linux guys were up to. I have a couple of interesting comments. Including some discussion of increased algorithm complexity in the CFS scheduler; it’s no longer O(1). Please keep in mind that these comments are entirely from reading code and documentation. I make no claim of knowing which scheduler actually makes the best scheduling decisions.” Jeff Roberson (jeffr), the developer of the new ULE3.0 scheduler for FreeBSD 7, wrote down some insights about the Linux SD and CFS schedulers.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

25 Comments

2007-08-06 3:41 pm
jackson
I don’t understand half the stuff these kernel hackers talk about, but I have been running -CURRENT for awhile now, and when the ULE scheduler updates hit in mid-July, it has been awesome. The responsiveness, especially during buildworld or buildkernel, is unbelievable. Great job, jeffr.
2007-08-06 3:47 pm
diegocg
Not according with its author AFAIK…
Edit: Oh, he meant insertion/removal of tasks in the rbtree. That’s true, and here’s the explanation: http://lkml.org/lkml/2007/4/13/252
> I’m not in love with the current or other schedulers, so I’m
> indifferent to this change. However, I was reviewing your release
> notes and the patch and found myself wonder what the logarithmic
> complexity of this new scheduler is .. I assumed it would also be
> constant time , but the __enqueue_task_fair doesn’t appear to be
> constant time (rbtree insert complexity).. […]
i’ve been worried about that myself and i’ve done extensive measurements
before choosing this implementation. The rbtree turned out to be a quite
compact data structure: we get it quite cheaply as part of the task
structure cachemisses – which have to be touched anyway. For 1000 tasks
it’s a loop of ~10 – that’s still very fast and bound in practice.
here’s a test i did under CFS. Lets take some ridiculous load: 1000
infinite loop tasks running at SCHED_BATCH on a single CPU (all inserted
into the same rbtree), and lets run lat_ctx:
neptune:~/l> uptime
22:51:23 up 8 min, 2 users, load average: 713.06, 254.64, 91.51
neptune:~/l> ./lat_ctx -s 0 2
“size=0k ovr=1.61
2 1.41
lets stop the 1000 tasks and only have ~2 tasks in the runqueue:
neptune:~/l> ./lat_ctx -s 0 2
“size=0k ovr=1.70
2 1.16
so the overhead is 0.25 usecs. Considering the load (1000 tasks trash
the cache like crazy already), this is more than acceptable.
Ingo
Edited 2007-08-06 15:59

2007-08-06 3:52 pm
anevilyak
To be fair he only indicates that insertion/removal of threads to/from the run queues is O(log n) now, which would be correct if it is in fact using a red-black tree (I haven’t looked at CFS’s code to verify that). This doesn’t actually indicate whether or not the process of selecting the next thread to run is also O(log n), and I don’t think Jeff ever implied that it was. For instance, if the process of selecting the next thread to run is something like picking the topmost node of the tree, scheduling itself would still be O(1).
Edited 2007-08-06 15:53

2007-08-06 8:03 pm
martinus
You are right that picking the topmost node can be O(1), but as soon as you have to insert/remove elements too (which he has to), the overall complexity will be O(log n).

2007-08-07 9:46 am
pseudocode
O(1), O(log n)… who really cares, as long as the complexity is not exponential, it’s ok.
The adding/removing of threads is an exceptional event from the kernel point of view. Most of the time, the kernel deals with existing threads: sleep, wake-up, wait for I/O, run, switch, …
If the rate of added/removed threads is nearly the same than the rate of context switch, there’s something wrong in your system. 🙂

2007-08-06 7:31 pm
smitty
Another response by Ingo:
(disclaimer, i’m the main author of CFS.)
I’d like to point out that CFS is O(1) too.
With current PID limits the worst-case depth of the rbtree is ~15 [and O(15) == O(1), so execution time has a clear upper bound]. Even with a theoretical system that can have 4 million tasks running at once (!), the rbtree depth would have a maximum of ~20-21.
The “O(1) scheduler” that CFS replaces is O(140) [== O(1)] in theory. (in practice the “number of steps” it takes to schedule is much lower than that, on most platforms.)
So the new scheduler is O(1) too (with a worst-case “number of steps” of 15, if you happen to have 32 thousand tasks running at once(!)), and the main difference is not in the O(1)-ness but in the behavior of the scheduler.

2007-08-06 9:11 pm
FunkyELF
This kind of stuff really ticks me off.
O(log(n)) does not equal O(1) for small values of n!!!
That would be like saying okay, its really O(n^2) but since n has an upper limit of 32,000 the worst case is O(1,024,000,000) which is still O(1).
EVERY algorithm has limits…that doesn’t make them constant time algorithms!

2007-08-06 10:56 pm
smitty
I agree, but Ingo has a point. The reason we define algorithms as O(1) or 0(n) is to give an idea of what their real-world performance will be like. In this instance, due to the fact that n will always be very small, the O(log(n)) scheduler can actually outperform the O(1) scheduler which is not what you would expect since when talking about algorithms we usually assume n is approaching infinity.
So while it is technically a O(log(n)) algorithm, in practice it outperforms the O(1) algorithm. Therefore, even describing it as O(log(n)), while correct, is misleading if you are only trying to get a handle on how it actually performs.

2007-08-06 11:26 pm
Doc Pain
“The reason we define algorithms as O(1) or 0(n) is to give an idea of what their real-world performance will be like.”
To be honest… no. Characterizing algorithms according to a special O value is to give an idea of how consumption amount (time needed, ressources required, load created etc.) is related to request amount (items inserted, input data given etc.).
“In this instance, due to the fact that n will always be very small, the O(log(n)) scheduler can actually outperform the O(1) scheduler which is not what you would expect since when talking about algorithms we usually assume n is approaching infinity.
So while it is technically a O(log(n)) algorithm, in practice it outperforms the O(1) algorithm. Therefore, even describing it as O(log(n)), while correct, is misleading if you are only trying to get a handle on how it actually performs.”
This is correct, as long as you may be able to assume certain values or value ranges for n. As someone else replied, even O(n^2) or O(x^n) may look “good” – if n is small enough. 🙂
Remember: The O notation does not take values of n in mind.
2007-08-07 2:39 am
smitty
Yes, I’m well aware of everything you said. I guess what I should have said was that the O value for this algorithm is completely meaningless since for the possible values of n there is little difference.
Theoretical analysts and people making general purpose algorithms are certainly going to be interested in O values, but when I said we use it to get an idea of real-world performance what I meant was that developers using an algorithm really only care about how it applies to their particular uses.
2007-08-07 3:29 am
PlatformAgnostic
O notation ignores constant factors and constant addends, so you could always be below the crossover point for a less efficient algorithm. There are also probabilistic effects too. Did you know that qsort is technically O(n^2) under worst-case circumstances? Nevertheless we use it over merge-sort which has O(n log n) performance in all cases because it uses less memory and it’s faster on average due to the constant factors.
Also in some C runtimes (I’ve seen this in the MSVCRT), the sorts for the small subproblems of qsort are implemented as selection sorts because those have better cache and pipeline behavior for small data sets.
2007-08-07 4:04 am
smitty
Big O values are always the worst case. There are a couple of other greek symbols I’ve forgotten that are used for the best case and average case values. I remember the average case being quite a bit more difficult to calculate.
The entire notation system isn’t meant to be precise, it is to give a very rough estimate of how scalable an algorithm is to large values. Thus, an algorithm O(50000*n) still reduces to O(n) (linear) because for a significantly large value of n it will clearly separate itself from constant, logarithmic, quadratic, or exponential algorithms. Of course, it might make sense to use an O(n^2) algorithm if you knew n would never get very large in your application, or there could be plenty of other linear algorithms far superior.
Edited 2007-08-07 04:05
2007-08-07 2:44 pm
IamScared
Big O values are not always specified on the worst case circunstances. You can say that InsertionSort is O(n) on the best case, for example. But, given the circunstances, Big O notation gives an upper bound on the complexity of the algorithm.
The other symbols are theta(), omega(), o() and little_omega().
2007-08-10 10:00 am
yorthen
Also in some C runtimes (I’ve seen this in the MSVCRT), the sorts for the small subproblems of qsort are implemented as selection sorts because those have better cache and pipeline behavior for small data sets.
Are you sure you didn’t mean heap sort, (selection sort is O(n^2) )? There’s an algorithm called introspective sort that starts out as quicksort but then switches to heap sort, worst case it’s O(n log n) and is comparable to quicksort on all inputs.
2007-08-07 4:24 pm
g2devi
Actually, O(log(n)) might outperform O(1) for even large n or even every physically possible n. By Definition, O() rules out constants and any lower order terms as being insignificant. But in the real world, if the setup time for the O(1) scheduler is 1 day and the O(n^2) scheduler is zero, it doesn’t matter that O(n^2) has horrible scaling, it’s still better than the O(1) scheduler for virtually any server or desktop out there.
As far as the ULE3.0 versus CFS issue is concerned, I have no opinion, since I have no knowledge (I haven’t even tested them out). But I personally welcome the critique. If it’s invalid, it’ll be easily refuted, but if it’s valid, it’ll highlight a place that needs fixing.

2007-08-06 3:51 pm
netpython
I wonder if freeBSD does support core2duo,X2 or core2quad.

2007-08-06 5:47 pm
Dr_J
I wonder if freeBSD does support core2duo,X2 or core2quad.
Yes, including dual quad-core systems. Why would it not? It has supported SMP for a very long time, and the recent Intel and AMD chipsets are supported well.
Edited 2007-08-06 17:48
2007-08-06 7:18 pm
Oliver
Lol this is the funniest question since I read OSNews :o)

2007-08-06 3:53 pm
OStourist
I heard freebsd-7 uses gcc 4 something. I wonder
if that will help with all the broken applications .
(like yammi) . I am on DesktopBSD which uses 6.2
and it’s no better or worse than linux on average
for desktop responsiveness.(but is better in other ways)

2007-08-06 4:00 pm
jackson
That sounds like a ports issue to me not really a toolchain one. Have you filed a PR against yammi? I have not tried yammi, but I have not had any issues with ports whatsoever, as long as you read /usr/ports/UPDATING.
Like I posted above, I am using 7.0-CURRENT and have not encountered any major issues in ports. I can even get my game face on with UT, Quake3, Tremulous, etc. I am really, really looking forward to the official release of FreeBSD 7.0. FreeBSD rocks.

2007-08-06 4:20 pm
OStourist
I should have explained better . I like Amarok but want to try ALL the possibilies linux has like exaile and songbird, totem for movies and vlc..so far most have been good but i occasionally get a compile error during the build. This might be due
to the original software being compiled with gcc 4 and not gcc 3 since gcc 4 is the standard on linux now.
I’m not sure if reading /usr/ports/UPDATING will help
if the port you install is new.
Agreed FreeBSD rocks and wil try to use it after my last
“update” to Fedora Linux left me with a broken gdb and unusable Gnome.

2007-08-06 5:08 pm
netpython
update” to Fedora Linux left me with a broken gdb and unusable Gnome
The’re more linux distros you know :p
Though FreeBSD is a very fine OS.
2007-08-06 11:29 pm
Doc Pain
“I’m not sure if reading /usr/ports/UPDATING will help if the port you install is new.”
Have you tried installing a precompiled package (pkg_add -r)? In most cases you don’t need to compile your applications. The precompiled packages are usually quite up to date. Compiling from source is – in most cases – only needed at machines with lower speed where you want to achieve some speed gain due to optimization, or if you need special configurations to be configured at compile time (e. g. mplayer).

2007-08-06 10:32 pm
kkenn
gcc 4 does not change performance for ordinary applications, since most applications are not CPU bound.
If you are having problems with a particular application, please file a bug report in the usual way.

2007-08-07 4:00 pm
MrVain
If the constants are small and the input is bounded (here: 32000), you can discard theory and choose the best algorithm for this inputs. Small constants are more important than a fancy complicated algorithm with lower complexity.
Ergo; never mind if it is O(1) or O(log n), for this small input Ingo is right. Instead simulations (testing) would be an alternative instead of theory, to choose the best scheduler – in practical work.