This story begins, as they so often do, when I noticed that my machine was behaving poorly. My Windows 10 work machine has 24 cores (48 hyper-threads) and they were 50% idle. It has 64 GB of RAM and that was less than half used. It has a fast SSD that was mostly idle. And yet, as I moved the mouse around it kept hitching – sometimes locking up for seconds at a time.
So I did what I always do – I grabbed an ETW trace and analyzed it. The result was the discovery of a serious process-destruction performance bug in Windows 10.
Great story.


… you could argue that, that OS simply isn’t fit for purpose – the idea that any modern OS is brought to its knees after a few days of use… well, just its laughable
Edited 2017-07-10 19:18 UTC
Or you could say that Chrome’s poor thread safety, requiring separate processes, and lack of proper memory management in a ‘burn the process and let the OS figure it out’ has drawbacks.
One could say that but it would be stupid.
Unless I misread the article, this was Chrome’s BUILD process creating lots of processes (compilers etc.)
It is not chrome it is the chrome build system, that by default launches a process per virtual core, and doesn’t care about load. Though it wouldn’t be a problem if W10 didn’t have this performance issue on process teardown.
While the author talks about Chrome builds, I highly doubt it’s limited to the software he’s specifically using since the issue is in the Windows OS portions of code, not application code.
Windows has always sucks at this. They’ve always preferred using threads.
But as we know (some only learned this many years later) using processes has a big advantage:
https://en.wikipedia.org/wiki/Privilege_separation
So making that fast is actually a really good idea.
So why not do it right and use a microkernel?
Because microkernels are slow.
Not always – if used in conjunction with Software Isolated Processes – see MS Research’s Singularity project here: https://courses.cs.washington.edu/courses/cse551/15sp/papers/singula…
On p2. is a performance comparison table (necessarily out-of-date now, but I see no reason similar principles shouldn’t hold).
Funny that you should mention that paper now.
I only learned of SIP within the last month or so when it turned up in a paper arguing ways in which Rust could be leveraged to reduce the need to trade off between security and performance.
Aaah, that persistent urban legend Now we have a monolith kernel that proves to hog on process destruction, but micro kernel are slow.
  Now we have a monolith kernel that proves to hog on process destruction, but micro kernel are slow.
Is there any up to date benchmark to compare recent incarnations of both kernel type ?
Shit happens. Most people don’t use 24 core machines, much less create and destroy large numbers of processes on them. If you have 24 cores, you’re probably as likely running it on a backend server, with no GUI components and no mouse, with long lived highly threaded application server or VMs.
No OS is immune from horrible performance corner cases. Gtk+ on Linux (not sure about other Unixy systems) has some god awful bug that chews up CPU when otherwise idle, and despite it being reported numerous times on many bug systems, it’s still clearly not fixed.
I’m the first to bash Microsoft for general OS suckiness, and the mind boggles quite why the mouse needs the same lock as process exit, but in general I’ve found Windows 10 to be more than usable.
I would say anyone with a 24-core machine running windows 10 is probably creating and destrying processes quite often, as these machines will be used for thread-intensive workloads, such as video editing, 3d rendering, or even building software such as Chrome.
Edited 2017-07-10 19:51 UTC
Thread intensive is different to process intensive. Build processes running compilers is different to even something like a java compiler, which will kick off a process and may compile many source files in a single process with many threads.
I’d say the difference is pretty much marginal. It’s not like you’ll do a 3d render then reboot your machine, is it? Machines like that could have uptime in weeks
“I’d say the difference is pretty much marginal”
No, the issue here seems to be caused by a lock at process destruction. Obviously a thread would not have this issue.
Threads are destroyed when processes are destroyed. OK, not quite the same thing, but i’m comparing apples to pears, not oranges
I often run make -j100 on my Linux machine. The compile jobs are distributed, but it still creates even more process than the -j48 that they had in this example. Windows just had a nasty performance regression in process teardown here.
Windows’s process spawn/destroy has been stupidly expensive for decades. That’s what led to “shove everything into as few processes as possible” as the default model for complex Windows applications, which in turn degrades the value of memory protection and makes security that much harder to practice. The Unix tradition of lots of tiny processes and cheap IPC is simply a better fit for SMP, since 500 single threaded processes will happily scale out to dozens of CPU cores without their developers needing to make them “multicore aware.”
What? There’s no Unix tradition like that you speak of. And Unix (eee.. Posix) IPC being cheap? Only if we talk of the poorly designed interface – there’s a reason high-performance IPC are usually done with custom, non-standard libraries.
Windows processes are more heavy-weight than e.g. Linux processes. That’s not the reason threads are used on Posix systems. That’s not the reason threads scale better than processes on any (reasonable) operating system. The protection provided in a process will cost more than operating system supported threads and it will cost even more than application specific threading.
—
It would benefit us all if you read up on operating system design before posting bullshit. I’m f***ing fed up in opinionated people posting things that are obviously wrong for anyone knowing a tiny bit about the area. Worse than me being irritated is that people may believe what is essentially lies.
While there may not be a specific ‘tradition’ that mandates that, but it follows directly from the unix philosophy. The whole mentality of building a program to do one task and do it as well as realistically possible naturally results in programs being small (in terms of memory usage) and numerous.
Also, go take a look at all those ‘custom non-standard libraries’, almost all of them use shared memory, sockets, or pipes, which are POSIX IPC, or use the network stack, which will usually perform worse than POSIX IPC. There are some examples like binder that use custom kernel extensions, but those are pretty rare because they require full control of the platform from the bottom up to work.
Kind of like the Erlang philosophy about everything shareable nothing more than a message, Ahferroin7. Security wise.
I suspect he was talking about DBus, which uses unix sockets.
Shared memory is shared memory – it can be used for IPC but not by itself as it have no synchronization primitives. That can be easily added by using a user-level polling design (wastes cycles and power). So one would probably use one of the kernel supported signaling mechanisms and try to avoid synchronization as much as possible as the overheads are high.
And yes that’s how most custom IPC code does it. Avoiding the costly Unix IPC primitives as much as possible. How does that fit into the tale I replied to that claimed Unix IPC is cheap?
L4 IPC is cheap. So cheap a L4 based design builds other IPC primitives that are kernel based in other systems in userspace based on that basic L4 IPC. Often being faster than the optimized kernel space design.
QNX IPC is cheap-ish. It is higher cost than L4 as it chooses to use a copying design. That have the advantage of providing reliable transport for free while most other designs either have to trust the other party not to change out the data or having to do an extra copy itself. By doing a copy at the IPC level the data received is known not to change which simplifies reliability and security support.
Unix IPC isn’t cheap. Not overly expensive either but then I never claimed it was.
The argument about the kernel has never been true in the way most people refer to the UNIX philosophy, except in odd cases like MINIX 3 and GNU Hurd. Except for such cases, UNIX systems have pretty much always used monolithic (like SVR4) or hybrid (like Linux, BSD, Solaris, and most others) kernels. That said, they are almost all very modularly designed, which is itself a manifestation of the UNIX philosophy, just at a different level than most people think of it.
As far as graphics, X never followed the UNIX philosophy, but most of what exists in it now is because of unnecessarily complicated display systems it had to deal with originally. I’ll agree that most GUI toolkits are over engineered, but just like the OS kernels, most of them do follow the same general philosophy internally, even if you can’t link to only part of them, because it makes development easier when you have a properly modular program.
You have to understand too that the concept of what constitutes ‘one thing’ has shifted. For example, tasks which require data to be parsed or transformed for input or output generally include that as part of the ‘one thing’ that constitutes the task itself, as it allows for far greater efficiency. Listing just a handful of the stuff running on my laptop and what one thing it is doing:
* Unbound: DNS caching
* rsyslog: multiplexes logging data
* monit: process supervision
* NetworkManager: network connection management
* udev: hardware state change notification
* CUPS: print spooling
* postfix: mail transfer
* netdata: performance monitoring
There are dozens of other examples I can make, but even when using systemd, the UNIX philosophy is still alive and well on Linux, just at a larger granularity of task separation than most people think.
tidux,
While it’s true you can deserialize code paths across many cores using many processes, it’s also true that trivially forking child processes scales very poorly due to the overhead.
Take a look at apache’s MPM workers, in practice the overhead of spawning processes is a performance killer. This performance overhead leads to the need to prefork process before they’re needed, caching processes, and reusing them instead of giving each client a clean process. It works, but it also adds complexity that proponents of the model often overlook when considering the multi-process approach.
Also, it still doesn’t scale that well and the memory overhead can be especially onerous. On the servers I manage with apache, I’ve had to significantly cut down the number of processes apache is allowed to spawn to prevent the linux OOM killer and swap thrashing.
You can always buy a beefier server, but switching to an asynchronous daemon can be just as effective by simultaneously handling hundreds of clients per core without any of the overhead of hundreds of processes.
That’s Apache being shit, not a kernel limitation. Nginx and PostgreSQL do multiprocess far better, and of course Make and C/C++ compilers can scale up to dozens of threads trivially for building software on Linux/Unix.
Tidux,
I would say it is apache devs using every trick in the book to make the process per client model perform as best as it can. I cited apache because it’s probably the most popular example, but I would have cited postgres as well.
Although I didn’t mention it by name, nginx is what I meant as an asyncronous daemon that scales without using more processes. Lighttpd too.
Deserializing [problems at]: “Neuron”-ware will bring relief to this bump We are having at the CPU count.
Sure that’s GTK+ and not the stupid tracker app that runs in the background indexing stuff? Nuked it with kill -9 `ps ax | grep tracker | cut -f 1 -d ?` CPU droppd from 3.5ghz to 400-600mhz usage.
I bet it’s the GNOME Tracker process. KDE’s indexing has been far worse starting in 4.0. Another thing they don’t seem to like is large remote filesystems over <100Mbps links (e.g. sshfs over wifi).
One could do that but it would be stupid. Eh… did I just repeat myself?
This is a bug. Bugs aren’t unusual. It seems some optimization in the kernel wasn’t to well tested for all hardware configurations resulting in serious performance problems.
It should be patched fast just as it would be if FreeBSD or Linux was found to have a similar type of bug (and at least Linux have in the past).
This is not a bug. Need a word for a thing this SIZE.
I was under the impression that its a well known fact that Windows’ CreateProcess and ExistProcess are notoriously bad compared to Linux’ -very- light weight fork() and exit().
In my experience CreatreProcess is anywhere between 50-100 times slower than Linux’ fork/exec chain.
I do appreciate the effort spent by the OP in trying to locate one (of many) reason for the slowdown…
– Gilboa
Edited 2017-07-11 06:28 UTC
You know we can see who’s posting above your post? I know, I know – just a pet peeve.
Never claimed otherwise.
I would imagine that the MP code within the NT kernel is riddled with serialized code paths as it was never considered a major selling feature.
– Gilboa (<— Oops I did it again)
Edited 2017-07-11 14:17 UTC
As a humble Java programmer, can I ask 2 questions:
– What is the need for such high number of threads? Couldn’t a threadpool provide better utilization?
– Could fibers be used instead of threads in this case?
The point of the article is: Windows has always been slow for this use case. Other operating systems don’t have this problem (as bad).
Even older versions of the operating system don’t have it this bad.
Edited 2017-07-11 10:01 UTC
I understand that, but because there are many readers on this side with great technical insight, I am interested if someone knows why thread creation is used instead of fibres or a threadpool when threads are expensive either way.
I suspect, part of the reason is probably because most compilers and build systems have been build that way for ages. Even when using Java commandline build tools people are used to spanning javac multiple times when using build tools if I’m not mistaken.
Because most programmers do what they were taught and are afraid of anything new. They learnt threads and critical sections, and they’ll use it and nothing else, gosh darn it.
kwan_e,
You are right, habits die hard. We’ll keep doing things the same way despite problems (be it programming languages, networking protocols, etc). Often times the energy to change course exceeds our willingness to adapt.
Even if you and I committed to doing things a better way, we’re still held back by everyone else’s work too. One thing I have a lot of experience in is AIO, and unfortunately linux has poor&incomplete support for AIO from the kernel to libraries and there’s a terrible cost for going against the grain.
For example, you can implement a network daemon with nonblocking sockets, but the name resolver on linux is blocking, so if you want to use hostnames you end up with an AIO daemon that blocks or your AIO server has to use threads anyways in order to not block – it just makes things much more complicated.
Even asynchronous daemons like nginx are negatively affected. It used to be that if you specified proxy hosts by name instead of IP, nginx would block (I learned that the hard way when a production web server stopped functioning when an upstream DNS server failed). Nginx has since implemented their own internal asynchronous resolver, but having to reinvent the wheel sucks for a lot of reasons. How does it find the DNS server? Does it support ipv4/ipv6? Does it support both recursive and non-recursive servers? What about DNS-SEC? netbios? What about the hosts file? Fixing the limitations of the native libraries oneself leads to a lot of reinventing
Edited 2017-07-11 15:17 UTC
I’ve been using Boost ASIO heavily this past year and it solves most of that for me. I don’t even touch threads anymore, other than to service the io_service. I just post tasks to the io_service or strands and use the async functions exclusively. I guess underneath the covers, Boost ASIO may use internal threads with blocking to achieve asynchronous resolving (although its locking on epoll is supposedly lightweight), but none of that detail leaks out.
I don’t even use mutexes or condition variables anymore. Just pass functors between strands if I never need to communicate between tasks.
kwan_e,
I still need those because I still use threads, but in my own software designs I try to stay true to using them for CPU bound tasks rather than IO bound ones.
In order to compile code, the compiler will generate a thread for every snippet of code. A large amount of threads will be created and destroyed due to the massive amounts of code snippets in a project the size of Chrome. A threadpool probably wouldn’t help much due to the quantity of threads being created and destroyed.
Fibers are Java exclusive feature, and a really just threads restricted to usermode.
I understand you’re a java programmer. If any of the terms used (such as “compiler”, or “code”) confuse you, i can try to explain it in simpler english if you like.
Edited 2017-07-11 10:02 UTC
1. Don’t need to be a douche.
2. Fibre is not part of Java, that is part of Windows API.
3. The point of a threadpool is not to create threads.
4. You don’t have kernel locks in user mode.
5. If terms used(such as “douche”) confuse you, i can try to explain it in simpler english if you like.
I’m sorry, but words “JAVA programmer” universally have such a horrible reputation that it’s only understandable for people to automatically assume there’s a very good chance a person is a douche. I understand there are exceptions, though.
Edited 2017-07-11 12:23 UTC
Not thread: process. The distinction is important.
That is a good point.
It’s not threads, but processes. Java is an odd case for a compiled language in that the compiler is inherently multi-threaded and does a good job of parallelizing things itself (haskell is an example of another language that does this (at least, GHC does), although it’s not as good at parallelization). Chrome is however written in C and C++, where you traditionally spawn a compiler for each source file (because they’re not-multi-threaded, and for other reasons that I won’t go into). The issue arises in the fact that Windows 10 for some reason is serializing the teardown of the process contexts when each process dies (I’d be more than willing to bet the serialization involves memory management somehow), while older versions, and pretty much every other OS in existence, have no such locking constraint (or at least, they aren’t locking such a large section of code that it causes issues).
“…The issue arises in the fact that Windows 10 for some reason is serializing the teardown of the process contexts when each process dies…”
Audit reasons? See no reasoning, on “continuous” audit.
Even if you enable auditing on Windows 7, it doesn’t behave that way. And even if it did, that’s no excuse, since there should be a separate thread/process/fiber/whatever multiplexing the messages so that locking isn’t needed to synchronize the code sending the auditing events.
Just based on what I’m seeing, I bet the routine took the lock multiple places for short periods of time, and some greenhorn at MS thought it would be more efficient to just take the lock once for the entire stretch of code instead of only protecting the stuff that needed protected.
Going with the “greenhorn” hypothesis.
I’m not “a programmer, working for Google, focusing on optimization and reliability” but let me play the devil’s advocate and ask:
“What’s the point of comparing Windows 10 on a 24 core computer and Windows 7 on an Intel Core 2?”
I mean, how is the comparison meaningful when he’s not comparing different version of Windows on the same hardware? How can he put the blame on Windows 10 and not that specific hardware configuration? And why quote Amdahlรขโฌโขs law when only two cores are available in the “good configuration”? To look smart and impress people?
These are honest, non rhetorical and definitely non trolling questions, by the way.
RT.
On old hardware, with less CPU power running an old version of Windows, you would expect the problem to be much worse.
But it was not!
This shows, that it is mainly a problem/bug of newer versions of Windows (after Win7).
So yes: in this particular case, the comparison made sense.
No it doesn’t. He didn’t try Windows 10 on the older hardware so we just don’t know what would happen with that particular workload. In other words, both of you are making assumptions and my questions still stand. ๐
RT.
He probably should have tried Windows 10 and Windows 7 on the same hardware to get an accurate measurement of how the two differ. Especially since the problem correlates to the number of threads which is related to the number of cores. However, it is very likely that it would still show Windows 7 to be better in that specific use case. Windows 10 redesigned the end of process to be faster by in part not giving as much time to the rest of the system. This saves battery life.
With a system such as described, where many threads are created and instead of being closed as they finish are all closed at once in the end, the cumulative effect of all the thread closures being bunched together is an unresponsive systems for a duration. Every code design decision carries with it the risk of unusual code performing worse.
irrelevant.
Or you are the one making the illogical assumption, Win10 works miraculously faster on much older and less powerful hardware.
Relevant. You are also just making and assumption. No matter how logical or “obvious”, it’s still an assumption that has NOT been tested in real life.
If Windows 10 is slower (likely) or faster on the much older hardware is not relevant. What is relevant is if the same underlying issue is present.
Does this still happen on 16 threads?
I can see the scaling issue happening more until fixed which kind of reminds me of multi core cpus and older games on windows getting odd errors when they just hit the market, under windows xp especially.
I also find it a bit odd since on XP 64 bit, I did run dual itanium 2 cpus (16 threads) and I didn’t run into many issues, but then again, that was a lab build.
I am working on a project that uses two different compilers to compile my app for two different platforms: msbuild and clang.
Both target platform compile settings are set to use the maximum number of cores available on my host machine (i7 3770- 4 core/8 hw-threads).
When msbuild is used, the windows desktop is perfectly usable.
When clang is used, the mouse and other apps become barely responsive.
It sounds like maybe this is the same problem that the article mentions.
I had to cut the number of cores used by clang to a third just so I can have a usable machine while I compile.
I’m looking forward to MS pushing a fix for this!
But article don’t even touch, trying at a different rig. Is this glitch exclusive of his 24 core?