Linked by Thom Holwerda on Mon 26th Feb 2018 18:13 UTC
Windows

Flaky failures are the worst. In this particular investigation, which spanned twenty months, we suspected hardware failure, compiler bugs, linker bugs, and other possibilities. Jumping too quickly to blaming hardware or build tools is a classic mistake, but in this case the mistake was that we weren’t thinking big enough. Yes, there was a linker bug, but we were also lucky enough to have hit a Windows kernel bug which is triggered by linkers!

Thread beginning with comment 654140
To read all comments associated with this story, please click here.
He hasn't found the bug
by malxau on Mon 26th Feb 2018 20:02 UTC
malxau
Member since:
2005-12-04

For those reading the article to find the bug, it's not yet diagnosed. The FlushFileBuffers workaround is heavyweight (I'm surprised he didn't mention how slow it is), but it also alludes to the problem.

I don't know the bug either, but can describe from a kernel point of view why these cases are complex.

A linker typically writes to files via a writable memory mapped view. These pages are written directly into the system cache and are considered dirty. The pages are free to be written out by the system at any time; the file system is responsible for ensuring that if a page is read from disk and the page was written, the data must be returned; but if the page has not been written, zeroes are returned. That's conceptually simple, but the implementation is complex due to various factors. If you want to know more about this, I wrote about it at https://www.osr.com/nt-insider/2015-issue2/maintaining-valid-data-le... .

The next complication with linkers is executable file layout. Typical windows PE files are composed of a series of sections with different page permissions, but these are aligned at 512 bytes within the file. When an executable page is read from disk, it is stored at 512 byte alignment, but the corresponding data that the linker generated was at 4Kb alignment. So it's not strictly true that because a (single) data page was written than an unaligned executable page can be read - both data pages that compose an executable page must be written before the executable page is read. This is default behavior, but it is optional - executables can be written with 4Kb alignment, they'll just be somewhat larger. Some compilers have done this by default (eg. see http://www.malsmith.net/blog/good-compiler-bad-defaults/ )

If I were running the Chrome build process, I'd be tempted to compile this particular binary with 4Kb alignment and see if it fixes the problem. Another thing to check is that the file is not NTFS compressed (which requires writing 64Kb of uncompressed data and totally changes this logic.)

What's frustrating about reading this article is that the bug could be in one of a handful of kernel components, but the investigation hasn't gone far enough to even pinpoint which. (I hope I wasn't the one who wrote it!)

Edited 2018-02-26 20:08 UTC

Reply Score: 7