Geico, an American insurance company, is building a live-patching solution for the Linux kernel, called TuxTape.
TuxTape is an in-development kernel livepatching ecosystem that aims to aid in the production and distribution of kpatch patches to vendor-independent kernels. This is done by scraping the Linux CNA mailing list, prioritizing CVEs by severity, and determining applicability of the patches to the configured kernel(s). Applicability of patches is determined by profiling kernel builds to record which files are included in the build process and ignoring CVEs that do not affect files included in kernel builds deployed on the managed fleet.
↫ Presentation by Grayson Guarino and Chris Townsend
It seems to me something like live-patching the Linux kernel should be a standardised framework that’s part of the Linux kernel, and not several random implementations by third parties, one of which is an insurance company. There’s a base core of functionality for live-patching in the Linux kernel since 4.0, released in 2015, but it’s extremely limited and requires most of the functionality to be implemented separately, through things like Red Hat’s kpatch and Oracle’s Ksplice.
Geico is going to release TuxTape as open source, and is encouraging others to adopt and use it. There are various other solutions out there offering similar functionality, so you’re not spoiled for choice, and I’m sure there’s advantages and disadvantages to each. I would still prefer if functionality like this is a standard feature of the kernel, not something tied to a specific vendor or implementation.
What’s the point? Why would an insurance company create its own kernel live patching solution?
In fact, what Linux distribution are they using? Are they not using RHEL, SLES or Ubuntu, like 99% of the large companies in the world.
Because they don’t want to pay for those vendors’ implementations, or need features those vendors don’t provide.
When you are large enough it can often be cheaper to maintain your own in house implementation rather than use an external vendor, especially where all of the source code is open so you only have to maintain customizations rather than the whole stack.
Big players like Google, AWS, even MS do this already.
Agreed. Although those “in-house” distros are usually adaptations of something that already exists. Both Amazon Linux and Azure Linux are clearly based off Fedora for example. Then again, very small teams create distros from nothing. Take Chimera Linux for example. Basically a one man project until recently and entirely novel (not derived from another distro).
If you have the expertise and bandwidth in-house to create a live kernel patching mechanism, maintaining a distro is likely not too heavy a lift.
I am curious why there are so many implementations though. Does anybody with knowledge on the subject know what is preventing one from gaining momentum and becoming the de-facto solution?
No, maintaining your own Linux will not be cheaper. I work for one very well-known, prominent Linux vendor and precisely those large companies are our customers. It might appear to be cheaper but over time, it’s more expensive and those companies painfully learn that.
zxspectrum,
Outsourcing is trendy in the MBA crowd, they get to outsource tasks they don’t care about while cutting IT headcount. This can make sense for the company and it might cost less to outsource, but it isn’t strictly always true. If your company is paying millions of dollars of service fees, then it might well be cheaper to get in house staff to do the work. FOSS licenses are quite permissive and there may not be enough benefit in paying. There are lots of variables so while it’s not a nice clean answer: it depends.
Some of my work is being a subcontractor for companies who have service agreements with their own clients. I’m often surprised at just how much overhead exists in these service agreements. A significant about of the service costs can go to middlemen who don’t materially contribute. Often times it would be a lot cheaper to cut out the middle men.
There’s absolutely no way it’s going to be cheaper. Linux has too many moving parts. You would need to hire a ton of people to service that DIY Linux, and one person leaving could create havoc in your team due to all know-how leaving in that area. The illusion of DIY Linux, DIY Kubernetes, etc is powerful but over time, organizations learn it was a huge mistake.
zxspectrum,
We’re going to have a difference of opinion then. I maintain my own linux distro for my servers and it doesn’t take up all my time, so it’s not like I have zero authority on this subject. What you say is impossible, others have found a way to do.
I’ll admit it can take work. The nature of this work depends a lot on a company’s specific needs for sure. Like I said: whether it’s cost effective depends on lots of factors. License fees can add up quickly if you have lots of servers while the marginal costs to hire a developer approach zero at scale.
I see that you’re following the RHEL sales pitch, but I know for a fact that you’re wrong about what people with appropriate skills can do with FOSS.
Live kernel patching involves some hacks that make it less appealing for standardization. The live patching kernels can’t be arbitrarily changed and it doesn’t result in the same kernel at the end, it’s more like a custom version of the original kernel still running. Patches have to be carefully crafted and tested. Structures can only be expanded if they had enough empty space appended to them originally. Kernel threads need more interlocks to enforce an orderly live migration. 3rd parties usually only support a subset of kernel versions, however mainline kernel devs put out a lot more kernels and the work to support livepatching between them could represent a higher workload without a correspondingly higher revenue stream for the linux kernel. Even if it were available most users would be better off with a standard kernel.
I propose a different solution: rather than a mainline live patching solution, howabout a feature to standardize live process migration? As an admin it’s something I’ve wanted to be able to do with userspace processes. And as a bonus (assuming the ABI for process migration is stable) it would effectively let you upgrade the kernel by migrating all the processes to a new server running a new kernel. If you’re running a physical server, you could perform the migration twice: once to a temporary server while the physical server gets updated, and then back to the permanent server. If it’s a virtual server you can just skip the temporary server and de-provision the original server. after migration. This feature would have benefits over live patching a kernel. Consider the server would be available for all kinds of maintenance, including hardware/fixes/cleaning/upgrades and even physical relocation without any software downtime.
Qemu has similar functionality but it would be nice to have support for this at the kernel level. This works best when you have virtualized storage and networking – things that linux can do, but it would be nice to have a standardized processes/commands to put all of this together at a high level. For example, if you have a SAN for network storage, this migration is quite trivial. If you don’t have network storage, then you have to migrate the storage as well, which is still doable but becomes cumbersome for large local disks.
Something I wanted to try back when I was doing hobby OS development was to write a kernel with a coherent hibernation serialization format, that wasn’t tied to a kernel version/in-memory kernel data structures. That way, you could live upgrade the kernel by “hibernating” to RAM, and then “restoring” from RAM with the new kernel. If your kernel is small enough, that should be achievable with just a very brief pause in the system. The Linux kernel might be a bit too monolithic for that approach though, it’s more suited to microkernels.
markr,
Yes, that’s another idea that could work. You are thinking of serializing the kernel. You could upgrade every process used by the microkernel separately..
The way I envisioned it is that it’s not the kernel that would be serialized, but the userspace processes. The process would be saved via enumeration to some consistent state (memory, threads, file/socket handles, signal handlers, timers, etc) and it would be restored with minimal impact on the kernel. I think linux could handle this quite well.
For most resources linux already has the necessary kernel mechanisms to reload a previous state. For example applying memory maps and opening files to a specific position already work. Tap/tun interfaces can already be redirected on the fly, even udp sockets. All of this might conceivably even be done from a privileged userspace process that can access other process resources (ie via procfs). Off the top of my head the only resources that can’t be migrated (without kernel support) are tcp sockets because opening and closing tcp sockets create side effects that can’t be squelched in linux and we don’t want the migration to create side effects. But it’s a feature that could be added to support live migration.
My use cases are all server based, but if local GUI applications are in scope for migration it makes things trickier especially if a GPU is involved. I’m not so sure that GPU drivers have the ability to fully save/restore state. In theory they might, in practice I’m not so sure. If not, then it might not be possible to migrate those processes (in such cases graceful shutdown should be an option). But this could be over thinking it anyway, the motivation for zero downtime is usually remote users isn’t it? It seems easier to schedule downtime for local users.
Alfman,
Yes, this might be a better idea, as “live patching” is anathema to modern hardened kernel security.
(I remember the hoopla around Windows x86_64 implementing that, but having to back track after Symantec ad led to European Union forcing them to keep the kernel hackable. And then last year we had that massive malfunction scandal from a third party doing exactly that).
Anyway…
Back to hibernation. Linus does not even allow stability for ABIs, I don’t think he would be okay with stability of in memory structures. So, even though noble, this might also be a futile approach.
sukru,
It would only need to facilitate pausing/saving/loading/resuming process state. I am suggesting that a stable ABI for migration/serialization doesn’t require a stable kernel ABI. It’s the state from the application’s point of view that needs to be consistently saved & restored and not the state from the kernel’s point of view! The userspace process doesn’t care about kernel structures representing a socket, open file, process, page maps, etc. These things obviously need to be serialized, but their representation can (and do) change from one kernel to another without affecting process execution. So long as two kernels can save and restore the application state as userspace sees it, then migration can succeed regardless of how significant the changes in the kernel. Userspace applications (and their serialized representations) can work without without caring about how the structures are represented in the kernel.
Kernel state that is invisible doesn’t need to be serialized. As long as they share serialization ABI formats/protocols, even kernels built decades apart should support process migration between them. Obviously we need to ensure CPU being migrated to supports the CPU flags the process expects, but this is a reasonable requirement.
I think this would be a pretty cool feature to facilitate some interesting use cases like a home server running on very low power hardware (think atom) until demand ramps up and can migrate to faster hardware on demand.
Alfman,
That is an interesting idea, but I would still be pessimistic about it.
To be fair, it exists today. The Xbox Series consoles (both X and S) has a feature called “Quick Resume”. It is basically “Alt-Tab” and “Hibernate” combined on steroids. You can switch between 6-8 games at will quickly as they are suspended into fast SSD. And they will also survive reboots, or even OS upgrades.
(It is not perfect though. For example online games will disconnect. And sometimes they will fail to resume).
Why can’t Linux do this?
It would require much tighter control of not only the application images, but also their states. State including their kernel structures, things like sockets (yes, they would still need them even if local).
Basically:
1 – Image of all programs that are running
2 – Their “sensitive” memory maps, those having to do with kernel structures
3 – All external surfaces, including pipes, network sockets, open files, hardware state, etc.
4 – Being able to bring up that state, but still allowing to swap in the processes. (i.e.: make systemd load a graphical desktop, but get the old X11 or Wayland server back in, or even the old systemd)
It is not impossible. But the last step is the tricky one. As the system and applications do not have a strict boundary under linux.
sukru,
I don’t think I’ve explained myself well because what I’m thinking would not be all that difficult to do.
All the resources that a process needs to access (files, sockets, etc) already have the syscall facilities needed to put the kernel in a state to access those resources. These syscalls needed to exist to access those resources in the first place. So it’s not strictly necessary to create brand new facilities that do the same thing for migration. Most of the heavy lifting to get/set the kernel state is already implemented!
In cases where getting the kernel back into the right state causes side effects (like connecting) where we might need new getters/setters but most of the important kernel state is already accounted for. For example linux sockets use get/setsockopt and the beauty of this solution means you don’t have to reengineer anything, the interfaces are already highly stable and portable between kernels spanning decades.
I’d agree that the kernel keeps additional heuristics/stats/etc, like cpu time (available in procfs). This is normally “invisible” to the application and is therefor not likely to affect execution after migration. Even this information could be migrated too, if procfs is appropriately writable. The migration process could be granted an appropriate “migration” capability for this purpose.
All this information is already available at “/proc/?/maps” and the syscalls to get into this state already exist as “mmap” and friends. I think you might even be able to do this part without any new kernel features.
Yes, but I think I’ve covered these already. In linux normal processes don’t manipulate hardware and because of this hardware state doesn’t need to be included. This makes things a whole lot simpler.
The reason I was trying to justify limiting the scope to network services is because local GPU applications are more complicated and are not as important in terms of 24/7 uptime. “My use cases are all server based, but if local GUI applications are in scope for migration it makes things trickier especially if a GPU is involved….”
I think process boundaries are good enough. A migration tool might even be smart enough to keep pipes connected after a migration by using a proxy process.
BEFORE MIGRATION
AFTER MIGRATION
The use of socat is just illustrative, a dedicated migration tool could keep better track of pipes. Using this technique you might migrate processes back and forth between computers all day long! Shared memory would be harder to support across a network (obviously), but even that could be supported if you migrated processes that share memory together. Obviously for the migration to go smoothly the open files should be identical, but you could also proxy files via NFS.
I think there’s quite a lot that would be doable in terms of process migration without having to introduce too much complexity in the linux kernel to achieve it.
Alfman,
Yes, I see your point for non-desktop usages. But then it is infinitely easier to just have each service on a virtual machine or docker image, and use their native migration services to move them between hardware.
This way the host can be upgraded without interruption at the higher levels.
(Docker does not have live migrate, at least as far as I know, but since most services are stateless, it is easy to just “reboot” instances).
Now… if we really wanted suspend resume, once again, the issue is Linux system is tightly coupled with linux userspace.
Yes, we can snapshot looking at /proc, /sys or similar.
However, say your service depends on data mounted on zfs connected though iscsi or similar (over Ethernet or FC). How do you bring that up?
You need to replay the init sequences.
Which means you can’t just dry bring up an “image”.
sukru,
I see your point, I mentioned qemu before and vmware has it too. But whole VM migration doesn’t solve the problem that inspired this discussion, which is being able to update the kernel without interrupting the applications running under it.
For my home use case I was thinking of moving many idle daemons to low power hardware (like a router that runs 24×7) and then be able to migrate them on demand so that high power servers can be powered off/sleep when loads are low. You’re right that VM migration would work here but this approach may be too memory hungry for the low power devices I was thinking about, especially given that the VM would need to be provisioned for expected high loads.
Yeah it’s too bad, I looked…
https://stackoverflow.com/questions/52788712/how-can-i-perform-live-migration-of-a-docker-container
https://forums.docker.com/t/container-live-migration/2285
Sure, if you can get away with a restart, that’s a simple solution. Live migration becomes more interesting for network daemons that are actively serving users. where rebooting would abruptly cancel the sessions. Think asterisk hanging up phone calls, streaming websocket applications that disconnect, mariadb connections that would be lost, surveillance software lapses, etc.
Granted, most of us expect technology downtime, it’s just interesting to think about what linux could do with migration.
It’s funny you should mention this because NAS/SAN file systems is the preferred way to configure VMs that you intend to migrate because it helps reduce the overhead of migrating local file systems.
The NAS/SAN support comes for free and helps ensure that the pre/post migration states are consistent.
The migration process just needs to reopen the files. As long as they were flushed on the old system then the data will still be consistent.
You can already serialize and restore running applications on Linux:
https://criu.org/Main_Page
draethus,
Thanks for linking that, I’ll have to take a closer look.
Personally I’m more interested in live migration than process suspending and restoring on a single system, but obviously these are similar.
It’s interesting to see the project referencing many of the same challenges discussed in the thread…
https://criu.org/Advanced_usage
There’s a lot of insightful information. While it doesn’t appear that this project’s aim was to support migrations, it’s certainly a good starting point and demonstrates useful techniques.
Thank you for the link, I leaned something new today (and remember the old days, … Virtuozzo wow)
IBM achieves live patching for AIX (IBM Unix variant started in 1986) via a surrogate partition. Similar to your idea above and gets around the need for pre allocated space for updates.
The procedure is something like:
– The surrogate partition is created and updated with the new kernel and patches.
– Workloads from the original partition are transferred to the surrogate partition using a checkpoint and restart mechanism. The checkpoint process is part of the Metacluster Checkpoint and Restart (MCR) kernel extension which can stay running while most kernel processes are suspended. It’s closed source.
– The workloads resume on the surrogate partition in a chrooted environment5.
The checkpoint and restart mechanism cannot be live updated. Neither can It modify the core kernel functionality or update the virtual machine itself. Seems like a limited scope live update procedure, but those updates are more permanent that the live patching you describe. Just desk research here for fun. Not an expert.
Iapx432,
Thank you for that information. AIX is very foreign to me, but the ideas aren’t platform specific and it makes sense that they’ve been done before.
The reason I like the idea of process migration is because the ability to upgrade the kernel is a side effect of the feature.
I remember some project managing to get two different kernels running on the same hardware concurrently but I can’t remember the project involved. Obviously the hardware resources have to be carefully dived up between kernels. I don’t think this would be a very “safe” method, but it might offer another solution that could technically work and be used for in-place kernel upgrades.
That might have been QNX. I seem to remember someone mentioning that before.
You need a hypervisor to start/stop the two different kernels, right?
You are talking about further development of CRIU.
But what I understood is that these kernel live patching is often applied on systems that have peripherals attached or long running networking sessions running through them like VoIP systems.
Lennie,
I am also talking about migrating systems with running daemons. Side peripherals may be a bit harder to support on account of the fact that hardware has state information that may need to be specially handled.
Just an example to consider:
A webcam may be capturing at time of “patching”. For the application to keep working uninterrupted we have one of two choices:
A) Don’t touch the hardware’s state. The new kernel must be prepared to handle the webcam in it’s current state by migrating information from the old driver. This may be technically doable, but it makes drivers more complicated (especially with an unstable ABI).
B) Reset the hardware normally, but have the userspace migration layer instruct the driver to put the webcam back into an acceptable state so the software doesn’t “see” the interruption.
Another example is a scanner. #A seems to be the only workable choice. Getting a scanner back into the same state mid-scan seems impossible.
Another example could be a mouse or keyboard peripheral. In this case the state information is so basic that #B seems like the obvious choice.
Regardless of the approach, the fact that low level knowledge of the hardware/drivers needs to be considered within the migration process is problematic for migrating peripherals. Live patching gets around this by keeping the same drivers running…their goal is to mitigate code vulnerabilities without changing kernel structure, however it ends up creating a new kernel that’s different than what you get running a normal update.