Automating ZFS snapshots for peace of mind

Thom Holwerda 2024-08-21 FreeBSD 10 Comments

One feature I couldn’t live without anymore is snapshots. As system administrators, we often find ourselves in situations where we’ve made a mistake, need to revert to a previous state, or need access to a log that has been rotated and disappeared. Since I started using ZFS, all of this has become incredibly simple, and I feel much more at ease when making any modifications.
However, since I don’t always remember to create a manual snapshot before starting to work, I use an automatic snapshot system. For this type of snapshot, I use the excellent zfs-autobackup tool – which I also use for backups. The goal is to have a single, flexible, and configurable tool without having to learn different syntaxes.
↫ Stefano Marinelli

I’m always a little sad about the fact that the kind of advanced features modern file systems like ZFS, btrfs, and others offer are so inaccessible to mere desktop users like myself. While I understand they’re primarily designed for server use, they’re still making their way to desktops – my Fedora installations all default to btrfs – and I’d love to be able to make use of their advanced features straight from within KDE (or GNOME or whatever it is you use).

Of course, that’s neither here or there for the article at hand, which will be quite useful for people administering FreeBSD and/or Linux systems, and who would like to get the most out of ZFS by automating some of its functionality.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

10 Comments

2024-08-21 11:44 am
Bill Shooter of Bul Platinum Prime
I’ve seen a few distros over the years that have automated BTFS snapshots before every software upgrade. Its copy on write so the disk space used wasn’t horrendous, but it was a bit of a pain to manage them all.
openSUSE tubleweed was the last one I used.
I’ve also heard Linux mint’s Timeshift (https://github.com/linuxmint/timeshift) can be used on Fedora to do what you’re looking for. Haven’t tried it myself. May look into that. I’m on Ubuntu now, sadly. Fedora choked on some of my hardware sadly. Ubuntu for whatever reason seems more stable on it. Maybe I’ll transition to Mint one of these days.

2024-08-21 1:21 pm
Alfman verbose=1
Do you know if they fixed btrfs raid so that a failed disk doesn’t result in emergency boot shell? I put btrfs through it’s paces and was happy with every other aspect. In the forums canceling bootup and not running “degraded” was touted as a feature, but this single deficiency keeps it unfit for production purposes. I like the features that btrfs has over traditional raid & mdadm based solutions but unfortunately this lack of robustness fails to keep production systems online.
It you don’t use btrfs for raid, then it’s a non-issue. It’s possible to run btrfs over mdadm, but now your loosing btrfs’s own raid features. Placing raid outside of btrfs means you loose the ability to differentiate between raid copies using btrfs file integrity. Ideally btrfs raid would give more confidence in file integrity than relying on the disk to report errors. And mdadm doesn’t have the ability to mix and match different sized raid devices and to redistribute the raid array dynamically. I really like these improvements coming from traditional static raid setups! However they have to fix the showstopping boot problem 🙁

2024-08-21 5:39 pm
Bill Shooter of Bul Platinum Prime
No, I’m not really up to date on all the btrfs developments. I don’t use it in raid configuration. There are some other btrfs complains that caused me to question my decision, but I haven’t found anything else as good yet. But I should initiate another search. bcachefs also sounds good, but too early for it.
2024-08-22 7:13 pm
mbq
Btrfs should not go into degraded if you still have enough disks to support requested number of copies, i.e. “raid1” has to have at least 2 good disks. So for “raid1” that can survive one drive loss without degrading, you just need 3+ disks. This is actually a better option than mdadm; not going degraded means it only postpones the inevitable crisis and makes it worse, since now the remaining single healthy copy is at risk. Btrfs simply immediately requests corrective action as soon as self-healing is no longer possible.
Anyhow, neither RAID not snapshots are backup, since they are not independent from the live data; fortunately we have tarsnap, restic, borg or kopia, all based on a git-like hash store which allows de-duplicatation by content, and they play nicely with snapshots to ensure they work on an atomic view of the data state.

2024-08-22 9:39 pm
Alfman verbose=1
mbq,
Btrfs should not go into degraded if you still have enough disks to support requested number of copies, i.e. “raid1” has to have at least 2 good disks. So for “raid1” that can survive one drive loss without degrading, you just need 3+ disks. This is actually a better option than mdadm; not going degraded means it only postpones the inevitable crisis and makes it worse, since now the remaining single healthy copy is at risk.
I’d have to test what btrfs does with a 3 disk mirror, but it’s not likely a configuration I’d use. Adding more disks for additional redundancy is fine, but these are choices that need to be made by the admin: 2 disk mirror, 3 disk mirror, 4 disk mirror, raid 10, raid 6, raid 60, standby disks, or whatever. There are lots of possible raid configurations and it’s nice to have these options. But with respect to fault tolerance, when a single disk failure causes a system to fail bootup, sometimes at a remote location, btrfs’s current raid 1 implementation objectively fails to provide the desired fault tolerance
Anyhow, neither RAID not snapshots are backup
Yes I agree, raid doesn’t replace backups or other forms of redundancy like failover servers.
Just to be clear, I wouldn’t mind btrfs dropping to an emergency boot prompt being an option. But all other raid providers I’m aware of handle single disk failure without going offline like btrfs does. Mdadm/dmraid/zfs/perc/etc, all can handle single disk failure without going offline, BTRFS stands alone in leaving the system unbootable. All I am saying is that this shortcoming should be addressed for people who want to run a raid 1 configuration to improve fault tolerance against a single drive failure.
I’ve also heard to avoid using btrfs raid5/6 because it may cause corruption under power loss. IIRC the documents warned about this, but I’m not sure if btrfs 5/6 has improved since then. I tend to use raid 6 and raid 10 for servers and raid 1 on personal computers.

2024-08-21 4:13 pm
crazy-weasel
You can use timeshift with any distro, as long as you have the required btrfs subvolumes. (and to be correct: timeshift can use rsync too)
I use it with Debian Sid.

2024-08-21 2:29 pm
zBeeble
You know that ZFS is a first class filesystem for FreeBSD, right? FreeBSD’s ports contains zfstools, which is a tiered snapshot system controlled by ZFS properties and compatible with many ZFS replication systems. The only pandering I do with it is to wrap it’s script (and the script of the replication) with lock(1).
The default config is good, but I often extend the 15 minute snapshots to 8 rather than 4. More context.
2024-08-21 3:40 pm
cb88
I implemented ZFS snapshots hourly at my company for all our engineering data via VSS over SMB… less headaches for me to deal with since all the engineers can grab whatever snapshot they need from the last month, and I only get involved if *old old* data needs recovery.
2024-08-22 10:39 am
nwildner
I’ve migrated my current setup of luks2+lvm+ext4 and Unified Kernel Images to zfsbootmenu+zfs with encrypted datasets and I can’t complain.
I have more raw speed than using luks, OpenZFS offers systemd timers for snapshots that read zfs attributes for each dataset and created daily snapshots for me, and I can’t be more happy(takes snapshots from root and home datasets, does not takes snapshots for /var/log and /var/cache stuff).
Nvidia module during this weird 550 version panicked kernel updates, making my laptop half-dead and I had to resort to liveusb to recover it. Now with snapshots everything became easy if I ever need to boot on a pristine d-1 state. Having zfsbootmenu as the only EFI binary here helped a lot and I can keep the kernel and initrd inside the encrypted dataset.
Btrfs required a plain /boot so it was a no to me. I’ve lost some unimportant data(Steam games basically) on my btrfs drive so, I could not trust it as my root filesystem.

2024-08-22 12:48 pm
Alfman verbose=1
nwildner,
I’ve lost some unimportant data(Steam games basically) on my btrfs drive so, I could not trust it as my root filesystem.
I have read about corruption issues in the past, but not recently. I have a long term plan to switch in the back of my mind so I’m genuinely interested if you can provide more details here.