It happens, hopefully not that often but it isn’t avoidable. Virtual machines do crash sometimes and this isn’t exactly a pleasure for sys admins. So what can be done to avoid it?

The following script will start your existing vms, which have the option “Start at boot” set to yes. The interval for this service can be changed in watchdog_vm.timer.An exclusion is also possible, if needed. In this case I did it with the tags feature. So if your vm has the “Start at boot” box ticked, but has one of the following example tags attached to it, the vm will be skipped. Nonetheless you can still easily change the .sh file to suit your needs, as this is just meant as an inspiration.

watchdog.sh

#!/bin/bash

for vmid in $(qm list | awk '{print $1}' | sed '1d')
do
    onboot=$(qm config $vmid | awk -F'[= ]' '/onboot/{print $2}')
    tags=$(qm config $vmid | awk '/^tags:/{$1=""; print $0}')
    if [ "$onboot" == "1" ] && [[ ! "$tags" =~ "test" ]] && [[ ! "$tags" =~ "fuss" ]]; then
        status=$(qm status $vmid | awk '{print $2}')
        if [ "$status" == "stopped" ]; then
            qm start $vmid
            echo "Started VM $vmid"
        elif [ "$status" == "running" ]; then
            echo "VM $vmid already running"
        else
            echo "Failed to determine status of VM $vmid"
        fi
    fi
done

watchdog_vm.service

[Unit]
Description=Check every VMs status with the 'onboot' parameter and start the VM in case it's not running
After=network.target

[Service]
Type=oneshot
ExecStart=/root/watchdog.sh

[Install]
WantedBy=multi-user.target

watchdog_vm.timer

[Unit]
Description=Timer for watchdog_vm.service

[Timer]
OnBootSec=15min
OnUnitActiveSec=2min
Unit=watchdog_vm.service

[Install]
WantedBy=timers.target

Followed by

systemctl daemon-reload
systemctl enable watchdog_vm.timer
systemctl start watchdog_vm.timer