Being the second in a series of occasional posts on neat things you can do with your Nagios monitor that you might not have thought about.
I've been bitten by missing backups too often over the years, to the extent that I don't really feel secure unless I've got copies of each important system in at least three places. That leads to a lot of backup files, so wouldn't it be handy to have a process that checks whether your backup files are up to date and sends you a loud warning if not ?
Enter the trusty Nagios once again, in the guise of the check_file_age plugin. This plugin lets you check not only whether a file has been updated recently, but also that it has a minimum size; this way you can make sure that your backup put a new file in the right place and wrote something useful to the file.
Here's an example nrpe.cfg entry that runs on our backup server and checks the backup file from our JIRA server:
command[check_backup_jira]=/usr/local/nagios/libexec/check_file_age -c 86400 -w 86400 -C 500000 -W 500000 -f /NAS/Backups/eng-infrastructure-server/jira/`date +"%Y-%b-%d"`*.zip
The JIRA backup files are named for the date, so this command makes sure that a file named like "2010-Jan-15-*.zip" exists, the file is no older than 24 hours (86400 seconds) and is at least 500kb in size.
Et voila - now I have a sanity check that the backup process has run, has put the backup file in the right place, and that the file contains something.
The next step I'm working on is to check that each backup will actually restore to something useful - after all, as this Joel on Software article attests, backups are all well and good, but not much use if you can't get back what you need from them.
That's a subject for a future post, but I'm thinking that it should be straightforward, at least for our web apps like JIRA and Confluence. I should be able to restore the backup to a standby version of the web app running on a different machine and have a Nagios check_http or WebInject test verify that the content has been updated. I'd also get the benefit of being able to switch over to the standby with minimal downtime; looking forward to trying this out !