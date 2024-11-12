Speaking of Steam, the Linux version of Valve’s gaming platform has just received a pretty substantial set of fixes for crashes, and Timothee “TTimo” Besset, who works for Valve on Linux support, has published a blog post with more details about what kind of crashes they’ve been fixing.
The Steam client update on November 5th mentions “Fixed some miscellaneous common crashes.” in the Linux notes, which I wanted to give a bit of background on. There’s more than one fix that made it in under the somewhat generic header, but the one change that made the most significant impact to Steam client stability on Linux has been a revamping of how we are approaching the
setenvand
getenvfunctions.
One of my colleagues rightly dubbed↫ Timothee “TTimo” Besset
setenv“the worst Linux API”. It’s such a simple, common API, available on all platforms that it was a little difficult to convince ourselves just how bad it is. I highly encourage anyone who writes software that will run on Linux at some point to read through “RachelByTheBay”‘s very engaging post on the subject.
This indeed seems to be a specific Linux problem, and due to the variability in Linux systems – different distributions, extensive user customisation, and so on – debugging information was more difficult to parse than on Windows and macOS. After a lot of work grouping the debug information to try and make sense of it all, it turned out that the two functions in question were causing issues in threads other than those that used them.
They had to resort to several solutions, from reducing the reliance
setenv and refactoring it with
exevpe, to reducing the reliance on getenv through caching, to introducing “an ‘environment manager’ that pre-allocates large enough value buffers at startup for fixed environment variable names, before any threading has started”. It was especially this last one that had a major impact on reducing the number of crashes with Steam on Linux.
Besset does note that these functions are still used far too often, but that at this point it’s out of their control because that usage comes from the libraries of the operating system, like x11, xcb, dbus, and so on. Besset also mentions that it would be much better if this issue can be addressed in glibc, and in the comments, a user by the name of Adhemerval reports that this is indeed something the glibc team is working on.
I tend not to use environment variables much in my own software, but I appreciate the fact that many libraries do and this is a valid concern there.
They also allude to the behavior of multithreading and forking, which again is a unix specific issue. Those of us coming from windows usually look for some kind of spawn syscall, but linux doesn’t have one.. Instead the spawning functions defined by POSIX are implemented as a wrapper on top of fork or the clone syscall in linux. Furthermore while fork works fine for academic examples, in large production processes (think of something like a database or web browser), forking can be quite inefficient. Not only are there performance concerns, but multithreaded programs can cause undesirable side effects.
Another problem I only encounter with linux software is file handles inadvertently passed to children. Linux does not offer a direct way to pass specific file handles to child processes and all handles get passed by default, This makes it a bad API IMHO. File handles can be trivially leaked without anyone noticing. This might have security implications. Personally, I try to remember to set CLO_EXEC consistently on every single file handle and socket that I open, but it’s easy to miss in code using default flags and most software developers rely on 3rd libraries that might not set CLO_EXEC or expose handles at all. There is no good solution to this and as a result I’ve seen some code using hacks like this before calling exec “for(int i=0; i<1024; i++) close(i)" just in case something is holding on to file handles without our knowledge and we don't want to pass it along.
Just don’t fork() around with getenv/setenv!
Nico57,
Most programs can just getenv when setting up before multitasking in the main loop. They will probably never experience corruption. But I agree with the author that there shouldn’t be a fault and the use of these functions should be well defined even for MT software.
This day i went back to linux again from another exodus to OS/2 and BeOS. This time full out 8x7900XTX and quad EPYC cpus (not the fastest in single core, mut fastest so far in multightreaded) 128 cores x4. with ram amounts i could not even have been possible to even have as storage before.
Yeah i still have to use my Thinkpad to work on the system. so it is not for “me” per se, but now i am forced to finally move off arcaos sadly… since this is the end. The end of “stuff working” on old systems and especially 32bit ones unless something happens soon.
To be fair linux allready suports more windows games than windows. But this is great.
I’ve fixed this problem in Firefox some time back by using function interposition:
https://bugzilla.mozilla.org/show_bug.cgi?id=1752703
https://hg.mozilla.org/mozilla-central/rev/1760c9f902bf
https://hg.mozilla.org/mozilla-central/rev/79dc5e93cef4
Since we can’t change all our dependencies (they are too many) I’ve slotted some thread-safe environment manipulation functions between the code calling them and the libc implementations. This way the calling code is transparently forced to acquire a lock before manipulating the environment and requires no changes (and no adjustments for forked off processes). This is achieved by linking our own library (mozglue) before libc on Linux, and by hooking dlsym() on Android where libc might have been loaded before we get a chance to load our code. I’ve eliminated all crashes related to environment manipulation with this trick, with the only exception of cases where glibc calls those functions internally, but that’s a genuine glibc bug that needs to be fixed upstream.