Teaching My Proxmox Host to Restart Itself

May 5, 2026

My main Proxmox host -- a Ryzen-based machine -- has been locking up. Hard. Not kernel panics, not OOM kills, not any kind of graceful failure. Just complete unresponsiveness: no ping, no SSH, nothing on the console. The only recovery is a power button hold.

This has been happening for a few months. Intermittent enough to be annoying, frequent enough that I can't ignore it. The machine hosts everything I run -- VMs, containers, the Kubernetes cluster I just wrote about -- so when it goes down, everything goes down with it.

First Suspect: NFS

The machine mounts an NFS share from my NAS. NFS has a long history of being the thing that locks up Linux when the server hiccups, and I had made some changes to the mount options around the time the freezes started. The symptom of a hung NFS mount -- process stuck in D state, kernel waiting forever for a response that isn't coming -- can sometimes pull an entire machine under if enough processes pile up behind it.

So I changed the mount options. The original entry in /etc/fstab was a basic hard mount:

nas:/volume1/data  /mnt/nas  nfs  defaults  0  0

I switched to a soft mount with an explicit timeout:

nas:/volume1/data  /mnt/nas  nfs  soft,timeo=30,retrans=3,rsize=65536,wsize=65536  0  0

With soft, failed NFS operations return an error to the calling process instead of retrying forever. The theory was that a hung NFS server would no longer be able to wedge the kernel indefinitely. Remounted it, watched carefully.

It still locked up.

Second Attempt: Firmware

AMD ships stability fixes for Ryzen platforms through AGESA microcode updates, which get bundled into motherboard BIOS releases. Hard freeze bugs -- especially ones with no kernel panic, no log, nothing -- have historically been addressed this way. It's not glamorous, but flashing a BIOS is a lot less disruptive than replacing hardware.

First I checked what was running:

dmidecode -s bios-version
# 2.A40
dmidecode -s bios-release-date
# 01/20/2026

The manufacturer had a newer release. I downloaded the latest, put it on a FAT32 USB drive, and flashed it from the BIOS's built-in update utility. The process took about three minutes and the machine came back clean.

It still locked up.

At that point I was running out of easy answers. The usual suspects for hard freezes on Ryzen that survive a BIOS update are C-states, the power management driver, and RAM. C-states are fixable with a kernel parameter:

# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1 idle=nomwait"
update-grub

That's still on the list. So is setting up netconsole to capture whatever the kernel is doing right before things go dark -- it streams kernel messages over UDP to another machine in real time, so you get a record even if the local disk never gets written to:

# on the machine that's freezing
modprobe netconsole netconsole=6666@192.168.1.HOST/eth0,514@192.168.1.RECEIVER/

# on the receiving machine
nc -u -l 514

And then there's RAM. I have not run memtest86+ yet, and I am in no hurry to, for reasons I'll get to.

The Bandaid

The right thing to do here is keep going down the list. Disable C-states. Run memtest. Set up netconsole. But I'm leaving for Seattle next week and I don't want to come back to a homelab that has been down for five days because the host froze on day two.

What I did instead was set up a hardware watchdog.

sp5100_tco

AMD processors include a hardware timer called the TCO watchdog, exposed on Linux as the sp5100_tco module. The key property of a hardware watchdog is that it runs independently of the operating system -- it's a timer on the chip that keeps counting regardless of what the CPU is doing. Software kicks the timer periodically to say "still alive." If the kicks stop -- because the kernel is hung, because the CPU is stuck in a bad C-state, because something has gone deeply wrong -- the timer expires and the hardware resets the machine.

First, check whether the module is already loaded:

lsmod | grep sp5100
# sp5100_tco             16384  0

If it's not there, load it and verify the watchdog device appeared:

modprobe sp5100_tco
ls -l /dev/watchdog0
# crw------- 1 root root 10, 130 May  5 14:22 /dev/watchdog0

Persist it across reboots:

echo 'sp5100_tco' > /etc/modules-load.d/sp5100_tco.conf

Now tell systemd to kick the watchdog. These go in /etc/systemd/system.conf:

[Manager]
RuntimeWatchdogSec=60
RebootWatchdogSec=10min

RuntimeWatchdogSec=60 means systemd sends a keepalive to the watchdog every 60 seconds. If systemd dies or the kernel freezes for longer than that, the hardware resets the machine. RebootWatchdogSec=10min gives a 10-minute window for a clean reboot to complete before the hardware forces one.

Apply it without rebooting:

systemctl daemon-reexec

Verify it's actually running:

wdctl /dev/watchdog0

The output should look something like:

Device:        /dev/watchdog0
Identity:      SP5100 TCO timer [version 0]
Timeout:       60 seconds
Pre-timeout:    0 seconds
Timeleft:      56 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Supports setting timeout       0           0

The important line is KEEPALIVEPING=1 -- that confirms systemd is sending the keepalive and the hardware timer is running. The Timeleft counter counts down and resets each time the ping comes in. If it ever hits zero, the machine reboots.

What's Left

I've been consolidating my task tracking into a distributed todo.txt setup -- a plain text file, synced across machines, with priority tags and project markers. It's not sophisticated, but it's always available and it survives anything. The remaining investigation tasks for this system look like this:

(A) Run memtest86+ on Ryzen system
(B) Try processor.max_cstate=1 idle=nomwait kernel params
(B) Try amd_pstate=passive kernel param
(C) Set up netconsole to capture kernel logs before freeze
(C) Attach PiKVM for remote power control
(C) Set up external watchdog script via PiKVM reset_hard

The A-priority item is memtest86+. It has been A-priority for a while. I keep not running it.

The Thing I'm Not Looking Forward To

The reason I keep not running memtest is that I'm a little afraid of what it's going to tell me.

Hard freezes with no kernel panic, no log entry, nothing -- that's a profile that fits bad RAM. The CPU doesn't panic because the CPU is fine. The kernel doesn't log anything because the memory it would write to is what's lying to it. The machine just stops making sense and locks up. It's also the profile for bad C-states and bad AGESA, which is why I chased those first. But I've updated the BIOS and the freezes continue, and at some point you run out of cheaper explanations.

I will run memtest. I just have not done it yet, partly because the machine needs to be taken offline for hours to do it right, and partly because if it fails I have to buy new RAM. DDR5 prices are not what they were when I built this system. Whatever they are now, I'd rather not find out until I have to.

This is not rational. I know it's not rational. Bad RAM doesn't get better because you haven't confirmed it's bad yet.

The Feeling

This is a bandaid. I know it's a bandaid. The machine is still locking up for reasons I haven't fully nailed down, and the most likely remaining cause is one I've been avoiding. A watchdog doesn't fix any of that -- it just means the machine comes back faster. The next freeze will still happen. VMs will still go down hard instead of cleanly. Whatever was happening when the freeze hit will be interrupted and unclean.

But there's something to be said for a homelab that fails and recovers versus one that fails and stays down. I'm not on call for my own infrastructure. I'm not going to be staring at dashboards from Seattle hoping nothing goes wrong. The watchdog means that if there's a freeze, things come back on their own within a minute, and by the time I'm home there's nothing to rescue.

The memtest can wait until I'm back. Probably.


back