May 09

Recently, we have a strange situation that certain critical users cannot log into an FTP server. Of course, Icinga is helping me to check this going forward:

First, define a service check:

define service{
use bidaily-service
host_name ftp.example.com
service_description FTP Login ftp.example.com-scott
check_command check_ncftpls!'ftp://scott:tiger@ftp.example.com/'
notifications_enabled 0
}

Next, catch that odd case when the script itself is missing (in past, payload of Nagios packages has added/dropped parts that I need)

define servicedependency{
dependent_host_name ftp.example.com
dependent_service_description FTP Login ftp.example.com-scott
host_name localhost
service_description Runnable check_ncftpls
execution_failure_criteria w,c,u
notification_failure_criteria w,c,u
}

Finally, the script itself:

#!/bin/bash

NCFTPLS=$(which ncftpls) ||{ echo "FAIL ncftpls not found"; exit 2; }
test -x ${NCFTPLS} || { echo "FAIL ${NCFTPLS} runnable|"; exit 2; }

${NCFTPLS} $@ &OK"; exit 0; }

echo "${NCFTPLS} failed"
exit 2

Now, I could’ve/should’ve used the hostname in the check itself, but I was more interested in just getting it there. I will probably clean it up someday, make it more reusable, but there it is.

Note that I did not establish a dependency on the ncftpls -bearing package itself in my RPM hierarchy simply because it’s perfectly fine for the “runnable” to fail, and the script itself will never thereafter hit the FTP server until the script it safely runnable. Sure, it’s listed as a failure, but it’s a choice against a huge dependency that typically brings in 100 packages of inconsistent perl and such (hey, “just hit cpan”, they’ll do that in datacenters, sure)

May 07

I’m updating my LDAP patch for Nagios based on the most-recent release; I’m also doing it as a git repos so that it’s reusable in a more independent way.

First, there are a few non-LDAP-specific changes needed:
1) commit 06d6ca4e7dfc44b1f93dcd836625ec20a1bbc3f1 — use true/false rather than only 0/1 for booleans
2) commit b37f9f5cbc8cc93796ec68d7f7359634eca56ed3 — propagates EPOCH and BROKER build flags through specfile

Next, there are LDAP-specific changes:
1) commit 561f2521aac88244694dcd0ea264acaa3c6796a2 — read in the LDAP-based config as described in http://wiki.nagios.org/index.php/LDAP-Configured_Nagios

This is all available in git://git.chickenandporn.com/nagios.git

I haven’t ported over my test-harness, so it’s fairly unknown code right now. I’m using it, but shifting back to Icinga.

Jul 11

I do a lot of things using passive checks — if there are things I want to keep an eye on without actually watching all the time.

For example, consider the following:

define service {
        use                             bidaily-service
        name                            bidaily-service-passive
        active_checks_enabled           0        ; service is passive only
        passive_checks_enabled          1        ; enable passive which seems redundant but for clarity

        check_freshness                 1        ; ...but check freshness to catch when the service isn't reporting in
        freshness_threshold             129600   ; == 36 hours: echo 36 60 \* 60 \* p|dc    -- to catch 2 failures

        check_command                   panic-run-in-circles-shouting        ; command to be run when freshness fails
        register                        0
        }

… and an instance of that template:

define service{
        use                             bidaily-service-passive        ; passive: triggered by /etc/cron.daily/mirror-idisk
        host_name                       localhost
        service_description             iDisk Sync
        }

In this case, I also have a script /etc/cron.daily/mirror-idisk that backs up my Apple iDisk (I love backups) and finishes with:

date "+%s PROCESS_SERVICE_CHECK_RESULT;localhost:iDisc sync:0:iDisk Sync OK %Y-%m-%d_%H%m" >> /var/nagios/rw/nagios.cmd

As you can see, this script does its work, and drops a successful return code into Nagios; Nagios simply shows it with a happy green marker on the status page.

What happens if the script’s action fails? It gives a bad result, and Nagios reports that.

What happens if the script has an error and chokes and dies? Nagios sees no result for 36 hours, and executes the “panic-run-in-circles-screaming” command. In my case, that’s another command that puts a failure into a queue, but that’s a bit tangental.

This is quite effective especially when my Nagios is tied to my Jabber, and can escalate to a twitter feed that reaches my by SMS. I know that errors will reach me, so I never have to check the status screen.

Sep 12

What does it mean to survive autoheader? Portability, easier maintainability.

BTW: The patch is right here: nagios-autoheader.patch

Autoheader is a tool that creates the header file associated with autoconf-generated configure; autotools (ie autoreconf) tend to assume that if you’re using AC_CONFIG_HEADERS(), you have a header file generated from your configure.in or configure.ac.

FWIW, Nagios’ use of AC_CONFIG_HEADER(file1 file2 file3) is actually converted to AC_CONFIG_HEADERS(file1), but not using the plural makes it confuse autoheader a bit.

Consider that maintaining twenty files is more difficult than maintaining one; maintaining two files is only slightly more difficult, but still is an entry-point for human error.

Just like driving on the left (former British colonies, for example) is more difficult after driving on the right (everywhere else); for the same reason, doing things in a way that differs from the mainstream is more difficult for others — who are used to the mainstream — to adapt to. The corollary to this: approaching the mainstream way allows more developers to maintain your work.

The other benefit of joining the mainstream is that you gain from how the mainstream has “moved on”, and developed added benefits and utilities. For example, automake reduces the maintenance of makefiles, and gives you “make dist”. Nagios has a bunch of unusual scripts ot maintain versions of things, but Autotools do that by defining in the configure.in and substituting at ./configure time. This ties into the “maintaining fewer files” above as well as doing things in the conventional manner.

Note that it makes no difference whether a project has done it a certain way since USL times — a new user sees it for the first time only when they first see it, with no regard for how long it’s been like that. This is to say that if it looks broken when the user first starts to work with it, it doesn’t matter how long it’s been broken, or if that method wasn’t considered “broken” when filing a stack of cards for batch-processing.

The small change I’ve done today allows Nagios to approach the current conventional method, and opens the path for further enhancements in a step-by-step progression of little changes at a time.

Aug 14

despite a slooooow connection to a buildserver (and no, I don’t want to spend another 5 hours to setup a VM, just wanted to get it done), I finally updated my Nagios/LDAP work to a “cvs update” of this weekend which includes v3.2.0. I also edited the deliverable specfile so that the schemas for LDAP are included.

These schema files are the ones I use in actual testing.

Changes: version bumps:

  • bump nagios to 3.2.0

Changes: added items:

  • added dhcp.schema
  • added dnszone.schema
  • added nagios.schema

The build may have a slew of warnings still, I have some cruft in the code just while I was looking for buy-in. I was initilly shot-down, apparently the core config inside Nagios is somewhat hallowed-ground, and it might be wrong to edit that code. Instead, I should try to do it in a plugin, but in the plugin, I would need to completely redefine the existing configuration code — and maintain it in parallel — or I lose the existing configuration.

I want to emphasize: this adds capability, not replace. Without the ldap_server config, Nagios acts like normal. Undefining the build option means Nagios cleanly stops understanding LDAP. Maybe if it’s written here as well, someone will read it.

The build is available here: (20090814 refers to the CVS update date)

Raw sources: