I do a lot of things using passive checks — if there are things I want to keep an eye on without actually watching all the time.
For example, consider the following:
define service {
use bidaily-service
name bidaily-service-passive
active_checks_enabled 0 ; service is passive only
passive_checks_enabled 1 ; enable passive which seems redundant but for clarity
check_freshness 1 ; ...but check freshness to catch when the service isn't reporting in
freshness_threshold 129600 ; == 36 hours: echo 36 60 \* 60 \* p|dc -- to catch 2 failures
check_command panic-run-in-circles-shouting ; command to be run when freshness fails
register 0
}
… and an instance of that template:
define service{
use bidaily-service-passive ; passive: triggered by /etc/cron.daily/mirror-idisk
host_name localhost
service_description iDisk Sync
}
In this case, I also have a script /etc/cron.daily/mirror-idisk that backs up my Apple iDisk (I love backups) and finishes with:
date "+%s PROCESS_SERVICE_CHECK_RESULT;localhost:iDisc sync:0:iDisk Sync OK %Y-%m-%d_%H%m" >> /var/nagios/rw/nagios.cmd
As you can see, this script does its work, and drops a successful return code into Nagios; Nagios simply shows it with a happy green marker on the status page.
What happens if the script’s action fails? It gives a bad result, and Nagios reports that.
What happens if the script has an error and chokes and dies? Nagios sees no result for 36 hours, and executes the “panic-run-in-circles-screaming” command. In my case, that’s another command that puts a failure into a queue, but that’s a bit tangental.
This is quite effective especially when my Nagios is tied to my Jabber, and can escalate to a twitter feed that reaches my by SMS. I know that errors will reach me, so I never have to check the status screen.