I Fat-Fingered My Mail Server Into Oblivion

May 1, 2026

Last week I wrote about setting up a three-node Talos Kubernetes cluster on my Proxmox host. What I left out was what happened the day after I got it running.

Today I tried checking my mail. I don't do that too often -- my mailserver is only used to receive my mailing list mail, and I prefer to access it with `mutt`. It's only available from my home network, so it's not something I do multiple times per day. Today I tried, but I couldn't connect to the mail server. I couldn't find the mail server. I checked, and the VM was just gone.

No backups

I run nightly backups on all my Proxmox VMs to a NAS. All of them, it turned out, except the mail VM and the spam VM. When I set them up I never added them to the backup job. The backup job covers VM IDs explicitly -- it does not pick up new VMs automatically -- and I just never went back and added them.

So there were no backups. Not one. The most recent backup for either machine was: none, going back to when I originally built them.

This is where most stories like this get bad. Mine did not, and the reason is Ansible.

Everything was in Ansible

A few weeks ago I wrote up the entire mail stack as an Ansible playbook. Postfix on the mail VM, Dovecot, a small Docker-based policy server, the spam VM with rspamd and redis and a local BIND resolver, the FreeBSD jails on the MX host. Everything. All the lookup tables, all the config templates, all the secrets in a vault file. The playbook can build the full stack from nothing.

I did this partly because the setup is complicated enough that I knew I would not remember how it worked in six months, and partly because I wanted to be able to explain it to someone else. I did not do it because I expected to need it two weeks later.

But I needed it two weeks later.

The recovery

The mail stack runs on Debian. My Proxmox host had no Debian image on it -- I do most things on NixOS and FreeBSD lately -- so the first step was downloading the Debian 12 cloud image directly to the hypervisor.

From there, creating the VMs with cloud-init took a few commands. Static IPs, SSH keys from the hypervisor's root authorized_keys, that was it. Boot them up, wait for them to respond to ping, then:

ansible-playbook site.yml -e @secrets.yml --limit mail,spam

Forty-five tasks on the mail VM, twenty-eight on the spam VM. Zero failures. Postfix, Dovecot, rspamd, redis, BIND, Docker, NFS mount, all the lookup tables, sieve scripts, the policy server container -- all of it, built from scratch, in about four minutes of playbook time plus however long the package installs took.

There were a few things to clean up manually after. The VMs needed to rejoin the VPN and got new addresses, so I had to update the configs on the MX host that reference those addresses. The TLS cert needed to be pushed from the machine that manages it. Stale host keys in my SSH config. The usual friction of a machine that used to exist and now exists again with different keys.

But the mail stack itself? Ansible handled it completely.

The feeling

I have tested backups before. Restored a VM to check that the backup was good, poked around, deleted the test restore. It feels low-stakes because the real machine is still running. You are not actually depending on it.

This was different. The machines were gone. My email was down. The Ansible playbook was not a backup I was testing in a controlled way -- it was the only thing between me and a full manual rebuild from documentation and memory. And it just worked.

That is a feeling I have not experienced before with infrastructure. Not relief exactly, more like genuine surprise at having built something that held up when it actually mattered. The playbook had never been run in anger before. It had never had to actually recreate anything. It did, and it was fine.

Lessons

Fat-fingering things is going to happen. I have been doing this long enough to know that expecting otherwise is wishful thinking. What matters is what happens after.

Two things I am fixing now. First, mail and spam are in the backup job. New VMs go in the backup job the same day they are created, full stop. Second, I am documenting which VMs are in the backup job and checking it any time I add a new one, because apparently trusting myself to remember is not a process that works.

The broader point is the one I already believed but now believe with more conviction: write the Ansible playbook before you need it. Not as a backup -- as documentation with the side effect of being executable. The discipline of expressing a setup in Ansible forces you to actually understand it, and when something goes wrong you find out whether you understood it correctly.

Turns out I did. That was a good feeling.

back