Flumble wrote:I can let the monitor purposefully stop with an error code to trigger OnFailure, but that's as ugly as having a separate script to send emails.
What's ugly about that? Your watchdog reports failure via exit code, and systemd does the notification as configured. That seems like it's working as intended.
It would be cleaner to have the OnFailure on the server where the minecraft server is running, but apparently that's not an option?
Also, what exactly is causing the minecraft server to fail? Does the server process crash? Does the server process enter a deadlock?. Does the whole server crash and fail to reboot? Network connectivity issues? Something else entirely? Depending on the problem, I'd suggest different fixes.
My eye is caught by the influxdb TICK stack
now, mostly because it comes with docker images and a simple logging plugin
that runs a script periodically and takes care of the output by itself ...and you can configure it to send emails when something goes wrong.
That seems like overkill. If you seek reliability, then complexity is your enemy. Unless you're willing to ignore issues like updates, security fixes and general maintenance, then dumping a docker image somewhere is usually more trouble than it's worth. It's certainly more brittle than a sendmail script. I may be a pessimist, but I wouldn't be surprised if that monitoring solution has more downtime than the minecraft server you meant to monitor.
On my home network, I'm using collectd. It's still overkill for monitoring a single service, but it has plugins for many things I want to monitor (network bandwidth on my router, temperature of all CPUs, free disk space and SMART errors etc), supports custom plugins (e.g. I have a few raspberry pis with DHT22 temperature/humidity sensors) and it has threshold-based alerts, including things like "only notify if the server couldn't be reached 3 times in a row".
Distributed logging is supported, e.g. every computer runs a collectd instance, which reports the data to a central server where it gets written to some kind of database. Collectd is flexible about its storage backends. IIRC influxdb works, but I'm using rrdtool for its simplicity and built-in graphing features.
Configuration of the whole thing takes a while, but installation is trivial (most distributions have a package), updating is trivial (the configuration format is very stable; only once did I have to manually adjust something) and the overhead is low enough that I can just start an instance on each computer I need to monitor, including the low-powered raspberry pis.