Friday, January 15, 2010

Fun with Nagios, part 2: Check Yer Backups

Being the second in a series of occasional posts on neat things you can do with your Nagios monitor that you might not have thought about.

I've been bitten by missing backups too often over the years, to the extent that I don't really feel secure unless I've got copies of each important system in at least three places. That leads to a lot of backup files, so wouldn't it be handy to have a process that checks whether your backup files are up to date and sends you a loud warning if not ?

Enter the trusty Nagios once again, in the guise of the check_file_age plugin. This plugin lets you check not only whether a file has been updated recently, but also that it has a minimum size; this way you can make sure that your backup put a new file in the right place and wrote something useful to the file.

Here's an example nrpe.cfg entry that runs on our backup server and checks the backup file from our JIRA server:

command[check_backup_jira]=/usr/local/nagios/libexec/check_file_age -c 86400 -w 86400 -C 500000 -W 500000 -f /NAS/Backups/eng-infrastructure-server/jira/`date +"%Y-%b-%d"`*.zip

The JIRA backup files are named for the date, so this command makes sure that a file named like "2010-Jan-15-*.zip" exists, the file is no older than 24 hours (86400 seconds) and is at least 500kb in size.

Et voila - now I have a sanity check that the backup process has run, has put the backup file in the right place, and that the file contains something.

The next step I'm working on is to check that each backup will actually restore to something useful - after all, as this Joel on Software article attests, backups are all well and good, but not much use if you can't get back what you need from them.

That's a subject for a future post, but I'm thinking that it should be straightforward, at least for our web apps like JIRA and Confluence. I should be able to restore the backup to a standby version of the web app running on a different machine and have a Nagios check_http or WebInject test verify that the content has been updated. I'd also get the benefit of being able to switch over to the standby with minimal downtime; looking forward to trying this out !

Entropy at Work

Like anyone who studied physics at school or college I knew the concept of entropy, but I only found out this week that Linux systems have their own version of the phenomenon.

I was setting up an RPM repository and wanted to create a GPG key for signing the repository files. Near the end of the key setup, I got this message:

Not enough random bytes available. Please do some other work to give
the OS a chance to collect more entropy! (Need 276 more bytes)

Entropy, eh ? Well, according to the physics I remember, the entropy of the universe is continually increasing; however; it turns out that a Linux system doesn't classify as its own universe and its entropy can go up and down, as evidenced by:

watch cat /proc/sys/kernel/random/entropy_avail

I could see the entropy value merrily going up and down between near zero and a couple of thousand, seemingly at random. After watching in fascination for a while, I developed the theory that the universe must comprise all of the Linux machines anywhere, merrily swapping their entropy bytes in some vast, complex dance which mere mortals cannot contemplate.

Turns out that the reality is a bit more down to earth; Linux calculates entropy by transforming interrupt events - activity from input devices such as the keyboard and mouse, or file system activity.

If you're in front of your system you can just point on the keyboard to generate enough entropy bytes to make your GPG key, but I happened to be working on an Amazon EC2 instance that I was logged into remotely. I managed to get enough randomness in the end by downloading a few large files from external web sites, and running "ls -R /" a couple of times to run through the entire root file system. Even though the entropy_avail value kept going up and down while I was doing this, it turned out that the GPG key generation just kept taking what was available as new bytes, and after 10 minutes or so had enough bytes to finish:

++++++++++.++++++++++>+++++...............>+++++............<.+++++....>.+
++++<.+++++>.+++++..................+++++^^^^^^^^^

gpg: key 2BC5527E marked as ultimately trusted
public and secret key created and signed.

Hopefully this post will help you to avoid one of those "WTF" moments; thanks to Carson Reynolds for providing most of the source material.