Compiler bug? Linker bug? Windows kernel bug.

Thom Holwerda 2018-02-26 Windows 12 Comments

Flaky failures are the worst. In this particular investigation, which spanned twenty months, we suspected hardware failure, compiler bugs, linker bugs, and other possibilities. Jumping too quickly to blaming hardware or build tools is a classic mistake, but in this case the mistake was that we weren’t thinking big enough. Yes, there was a linker bug, but we were also lucky enough to have hit a Windows kernel bug which is triggered by linkers!

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

12 Comments

2018-02-26 8:02 pm

malxau
For those reading the article to find the bug, it’s not yet diagnosed. The FlushFileBuffers workaround is heavyweight (I’m surprised he didn’t mention how slow it is), but it also alludes to the problem.

I don’t know the bug either, but can describe from a kernel point of view why these cases are complex.

A linker typically writes to files via a writable memory mapped view. These pages are written directly into the system cache and are considered dirty. The pages are free to be written out by the system at any time; the file system is responsible for ensuring that if a page is read from disk and the page was written, the data must be returned; but if the page has not been written, zeroes are returned. That’s conceptually simple, but the implementation is complex due to various factors. If you want to know more about this, I wrote about it at https://www.osr.com/nt-insider/2015-issue2/maintaining-valid-data-le… .

The next complication with linkers is executable file layout. Typical windows PE files are composed of a series of sections with different page permissions, but these are aligned at 512 bytes within the file. When an executable page is read from disk, it is stored at 512 byte alignment, but the corresponding data that the linker generated was at 4Kb alignment. So it’s not strictly true that because a (single) data page was written than an unaligned executable page can be read – both data pages that compose an executable page must be written before the executable page is read. This is default behavior, but it is optional – executables can be written with 4Kb alignment, they’ll just be somewhat larger. Some compilers have done this by default (eg. see http://www.malsmith.net/blog/good-compiler-bad-defaults/ )

If I were running the Chrome build process, I’d be tempted to compile this particular binary with 4Kb alignment and see if it fixes the problem. Another thing to check is that the file is not NTFS compressed (which requires writing 64Kb of uncompressed data and totally changes this logic.)

What’s frustrating about reading this article is that the bug could be in one of a handful of kernel components, but the investigation hasn’t gone far enough to even pinpoint which. (I hope I wasn’t the one who wrote it!)

Edited 2018-02-26 20:08 UTC
2018-02-26 8:44 pm

Poseidon
I’m a bit surprised that there haven’t been bigger errors, Microsoft has been dragging their kernel to modernity really quickly. I hope they emphasize even more on QA and get this fixed soon.

2018-02-26 9:19 pm

damp
I`m just surprised and a little envious, i want his skills.

2018-02-26 10:04 pm

tidux
> Building Chrome very quickly causes CcmExec.exe to leak process handles. Each build can leak up to 1,600 process handles and about 100 MB. That becomes a problem when you do 300+ builds in a weekend â€“ bye bye to ~32 GB of RAM, consumed by zombies. I now run a loop that periodically kills CcmExec.exe to mitigate this, and Microsoft is working on a fix.

What the actual fuck? This would be considered unacceptable on Haiku, let alone Linux or OpenBSD.

2018-02-27 10:29 am

avgalen
> Building Chrome very quickly causes CcmExec.exe to leak process handles.

What the actual f–k? This would be considered unacceptable on Haiku, let alone Linux or OpenBSD. [/q]
This is unacceptable on Windows as well, but it is not something an OS/Kernel should be bothered with. Building Chrome is a usermode process. If that process causes other programs to leak resources that is another usermode process. Usermode processes are configured by default to consume as many resources as are available. In the end that means that one usermode process can consume allmost all resources, which is what you would want in all normal scenarios (no leaks).

As long as the OS can still control those usermode processes the OS is working perfectly.

(ccmexec.exe isn’t even present on systems by default, it is a tool that enterprises use to monitor their systems for updates)

[q]The underlying bug is that if a program writes a PE file (EXE or DLL) using memory mapped file I/O and if that program is then immediately executed (or loaded with LoadLibrary or LoadLibraryEx), and if the system is under very heavy disk I/O load, then a necessary file-buffer flush may fail. This is very rare and can realistically only happen on build machines, and even then only on monster 24-core machines like I use.

Well, why wasn’t there a unittest for this exact scenario? /s

Edited 2018-02-27 10:42 UTC

2018-02-27 5:12 pm

tidux
> Well, why wasn’t there a unittest for this exact scenario? /s

You don’t need a unit test, just a system design that doesn’t constantly thrash disk like a retard. This architecturally can not happen on Linux.

2018-02-27 6:33 pm

Zan Lynx
Pretty confident about Linux there, aren’t you?

Haven’t you read about or experienced Linux’s enjoyable bugs with O_DIRECT, and mixing memory mapped with read() / write() IO? I think I recall some bugs with Linux AIO io_submit() too.

Sure, those were fixed. But at one point in time there were inconsistent views of IO, just like what this Windows bug sounds like.

Ooh, while Googling about I found another one about transparent huge pages and O_DIRECT causing screw-ups in Linux.

I like Linux, but don’t put it on a pedestal.
2018-02-27 7:42 pm

avgalen
You don’t need a unit test [/q]
Maybe you didn’t know it, but /s indicates sarcasm. Of course there wasn’t a unit test for it because the circumstances are way to extreme for a unittest.

just a system design that doesn’t constantly thrash disk like a retard.

It isn’t a system design that trashes the disk like a retard. The guy is compiling Chrome which normally trashes the entire system (under Linux as well). The mentioned bug has the specifics that “if the system is under very heavy disk I/O load”

[q]This architecturally can not happen on Linux.

Of course it can. There is nothing in the architecture of Linux that prevents 1 usermode process from taking up almost all the systems resources, effectively blocking a 2nd usermode process from performing well. Just like under Windows this is the normal behavior and as long as the OS is still capable of controlling both usermode processes they will both continue to run and do their work. Now there are certainly differences in how cpu/mem/io/caches are allocated but those differences cannot guarantee that both programs will get enough resources.

(here is a nice, although dated, architectural comparison with some scheduler characteristics: https://www.ukessays.com/essays/information-systems/compare-cpu-sche…)

2018-02-28 5:35 am

kwan_e
There is nothing in the architecture of Linux that prevents 1 usermode process from taking up almost all the systems resources,

https://en.wikipedia.org/wiki/LXC
2018-02-28 10:40 am

badpixel
/etc/security/limits.conf

I used it more than 10yrs ago to avoid system locks, while some process leaks memory extensively.

New approach is cgroups.

And LXC is another beast, meant for lightweight virtualization.
2018-02-28 4:35 pm

avgalen
Interesting links! It doesn’t seem like any of that would prevent the mentioned issue by default though.

The main problem is always that you want all resources to be available to every program…but you don’t want 1 program to take up resources that another program needs.

So you end up configuring a generic system for a specific workload

or you end up with a system where 1 program can limit the performance of the 2nd program

or you end up with a system where a program runs non optimal

2018-02-26 10:34 pm

moronikos
I loved the description. They didn’t find the bug, but by reordering some stuff, it went away for a year.

In my shop, we have a saying, “What goes away on its own, comes back on its own.” That is definitely my experience. But sometimes, after digging and digging and digging, you still haven’t found the problem. And, if you can find a way to at least appear to go away, sometimes that is the best you can do for a while.