Optimal Operations: 2009

Monday, December 28, 2009

Fun with Nagios, part 1: timeliness of web content

I mentioned in an earlier post that I'm a big fan of the Nagios open source monitoring system. With very little work (on Linux, a couple of package installs and a bit of service configuration should do the trick) you can have a system up and running that will check all your servers for availability, disk space, memory and CPU usage and alert you when any resources exceed their limits.

It's pretty easy to add additional Nagios checks that will monitor all kinds of other useful things. I'll be making a few posts this week on that subject; today's deals with checking that your web site content is up to date.

One of my company's products is a web service that provides a speech-to-text transcription of news videos. The service pulls in videos from RSS feeds provided by the news source, runs the speech-to-text analysis and posts the results in another RSS feed (one per news source). As well as making sure that the output RSS feed is available 24x7, I wanted to check that new articles are being added each day. There are about 75 different news sources and thus output RSS feeds to check, so automation was pretty much mandatory.

This turned out to be a breeze, using Nagios' trusty check_http command and making use of the fact that you can pass the results of a command execution as a parameter to check_http. It's standard for each article in an RSS feed to have an attribute containing the publishing date; in my case this looks like:


    Mon, 28 Dec 2009 23:40:04 GMT

To check that this feed has at least one article published today, here's the custom command from the RSS feed server's Nagios configuration file:

check_command check_http! -u /feeds/1010 -s "`date +\"%d %b %Y\"`"

This uses the standard -s parameter to check for a string in the HTTP response from the URL http://feedserver.mydomain.com/feeds/1010 but makes the string to be checked the output of the "date" command in the format used by the output RSS feed, e.g. "28 Dec 2009" in my case.

It was easy to add checks for the rest of the output feeds (which all have the same date format) by creating additional checks and just varying the feed URL.

The only other thing to watch here is that Nagios runs all its checks 24x7 by default, but the first new article of the day might not be published until some time well after midnight. To avoid getting "out of date content" alerts as soon as the date changes , you can set up a custom time period in the Nagios server's timeperiods.cfg file, e.g.

define timeperiod{
timeperiod_name news-content
alias Times when content from today should be present
sunday 8:00-24:00
monday 3:00-24:00
tuesday 3:00-24:00
wednesday 3:00-24:00
thursday 3:00-24:00
friday 3:00-24:00
saturday 8:00-24:00
}

This says that at least one article from today should be present at all times after 3am on weekdays and after 8am on weekends - we learned by experience that weekends are slower news days. Use the custom time period in the configuration entry for your date check:

define service{
use generic-service
host_name feedserver
check_period news-content
service_description Content updated: feed 1010
check_command check_http! -u /feeds/1010 -s "`date +\"%d %b %Y\"`"
}

and you're good to go.

More on useful Nagios checks later in the week; I'd love to hear your own favourite ones in the comments.

Tuesday, December 8, 2009

Exchanging Answers

I'm a regular user of the programming Q&A site stackoverflow.com and its companion for sysadmins, serverfault.com. Both sites have saved me a lot of time digging around the wilds of the web for answers, so I've tried to give back by answering questions whenever I can.

There are a few reasons why I like these sites: the flat organization of the questions and answers makes it really easy to narrow down the things you're looking for; the email and RSS notifications of new questions and answers are highly configurable; and - an unexpectedly good incentive for me - you can build up reputation points for asking good questions and giving good answers (as voted on by the other users of the sites).

There's now a similar Q&A site for software testers at http://testing.stackexchange.com - if Server Fault is Stack Overflow with a beard, ponytail and sandals, Testing.Stackexchange might be SO with half moon glasses and a clipboard. I think this could be a great forum to concentrate testing information; there's a lot of test related stuff on Stack Overflow already, but it's often buried with more development-oriented content. I'll be hanging out there regularly from now on.

Friday, November 13, 2009

The USB Breathalyzer

Here's some Friday fun - ever wanted to stop your sysadmins from logging in drunk and doing an "rm -rf /" on the production servers, or prevent your developers from checking in code that was written under the influence ? You need a USB breathalyzer !

This thread on serverfault.com has some other interesting ideas for preventing GUI (Geeking Under the Influence).

Monday, November 9, 2009

Advance your career over lunch

One of the online forums I frequent had a question today about what web resources engineers can use to advance their careers. Now don't get me wrong, learning from the web is all well and good. However, you'll learn far more in a given time from interacting with people than with a web page, simply because you can have a two-way conversation. Just about every time I've asked a friend or coworker for technical help, I've come away with information that I'd never thought to ask for directly.

Here's some ways you can work on your career advancement and learn at the same time:

Take every opportunity to learn from people with experience that you'd like to have - while being respectful of their time, of course. Even the busiest people are generally flattered to be asked and happy to talk about what they're working on; sharing coffee, lunch or beer with them doesn't hurt either !
Talk to everyone in your organization who you either work with directly, or whose job interests you. For example, you might want to move on from testing into development, operations, project management or many other related fields.
Attend conferences to meet up with your peers. Getting the budget to attend can be hard these days, but there are plenty of free opportunities around; have a look for local events at http://www.meetup.com/, for example. If you feel up to it, presenting at a conference or meetup is a great way to expand your network and gain valuable presenting experience.

Last but not least, the more people you engage with, the more you'll be giving an impression of yourself as someone who already has a particular set of skills, but is also smart and eager to learn. That certainly won't hurt your career development !

Neat trick for getting relative dates in a Linux shell script

I wanted to get the SVN updates done in the last 24 hours for a nightly script that updates and restarts one of our applications. The command would look something like:

svn log -v -r {#Calculate yesterday's date somehow }:{#Calculate today's date in yyyy-mm-dd format}

Getting today's date is easy enough using the "+" parameter with a format string, but yesterday's date seemed more tricky; checking the man page for the date command didn't look very promising. It looked like I'd have to do the famous programming exercise of keeping an array of the number of days in each month, checking whether yesterday was still this month or last month, etc. etc.

However, using info date instead of man date gave a lot more information, including this very useful option:

-d DATESTR'
`--date=DATESTR'
Display the date and time specified in DATESTR instead of the
current date and time. DATESTR can be in almost any common
format.

DATESTR can be a word or phrase like "tomorrow" "yesterday", "last week", etc. So my problem was solved in a single line:

svn log -v -r {`date -d "yesterday" +"%Y-%m-%d"`}:{`date +"%Y-%m-%d"`}

Moral of the story - always look at the info page as well as the man page !

Wednesday, November 4, 2009

Forced education #1 - DNS and DHCP with dnsmasq

One of the, ahem, pleasures of working at a startup is that you often have to learn something new in a hurry when something goes wrong. This happened to me the other day when people on our office network suddenly stopped being able to get to anywhere else on the internet or our internal network.

After a bit of troubleshooting, I found out that we still had inbound and outbound connectivity, but DNS had intermittently stopped working - so, since normal people don't carry a long list of IP addresses in their heads, their access to anywhere via its host name was gone.

We had all our office network clients pick up their DNS server via DHCP from our LinkSys AV-082 router. I tried switching the DNS servers defined in the router to OpenDNS in case our ISP's DNS servers were having a problem, but the problem persisted.

Some internet digging and talking with a couple of folks who know a lot more about networking than me led me to the conclusion that the router's DNS and DHCP handling was acting up. Since it still seemed to be working fine as a regular router, I decided the best thing would be to turn off the router's DHCP server and run DNS and DHCP from one of the office servers instead. I set up the people in the office to point directly to OpenDNS to get them going while I was messing around, and got to work.

There's a nice, lightweight Linux package called dnsmasq that will handle both DNS and DHCP. I found it super easy to set up on one of our Fedora servers, thanks to Keith Fieldhouse's article on linux.com; the hardest part was typing in all the MAC addresses for servers that I wanted to have a fixed IP address, since the router's web UI wouldn't let me copy and paste the DHCP MAC to IP address mappings.

The only other tweak I had to make was to configure the network adapter on the dnsmasq machine to have a fixed IP address, rather than trying to get it from DHCP.

After a couple of hours of setup and careful testing, everything was up and running. An extra benefit of having a local DNS server is that it caches all its results, so that lookups to a host that it already knows about are nice and fast. In fact, I've now set up a local dnsmasq server just for DNS at home, since anything that makes web browsing faster has to be a good thing !

Thursday, October 22, 2009

Using your IT monitor for automated testing

An automated IT monitoring system is a must-have for any team these days; few things give me more pleasure (at work) than seeing a web page full of green boxes telling me that all our systems are running smoothly. I like Nagios because it's robust and easy to install / configure / extend, but this tool is showing its age a bit; you might want to look at Zenoss or Zabbix if you're starting from scratch. Nagios' inability to show trend graphs out of the box might be an annoyance - more on that in a future post.

Monitors are great for alerting you when a system goes down or runs out of resources, but I hadn't used them for QA until recently, when I realized that they're also ideal for smoke testing new product builds. If you're working with web applications, this is a piece of cake - all you need to do is:

set up a cron or batch job to install your new build on a schedule;
have your monitor tool check one or more URLs in your application.

Since monitoring tools can alert you by email or SMS when anything goes wrong, you can get timely feedback if something goes wrong - maybe last night's changes broke the app so that it doesn't even come up any more, or perhaps there's an unexpected change to the content returned by your test URL. If you do your reinstall overnight, the alerts can be waiting in your in box when you start work the next day (Nagios also has a nice Firefox plugin that turns red when there's an alert, in addition to setting off a loud siren that's hard to ignore !)

The monitor runs 24x7, so you also get the built-in bonus of being alerted when somebody inadvertently shuts down your test application; you'd also get warned if the app crashed after running out of resources, although you'd still have to go in and debug the cause in that case.

Here's how a typical test would look in Nagios:

check_command check_http!-p 8080 -u /test/url/ -s "Text Expected"

This uses Nagios' built-in check_http command to go to the URL /test/url on port 8080 and look for a specific string in the HTML output (the host details are specified elsewhere in the Nagios setup).

check_http has lots of other useful options such as checking for a regular expression, sending POST data, controlling the response to a redirected page, checking the age of the returned document or checking the validity of the site's SSL certificate.

check_http does have a couple of limitations:

it will only handle basic web server authentication; you can get around this for web apps that use authentication libraries by using the Nagios plugin for WebInject, which will handle web pages with input fields. WebInject also lets you set up multi-step tests that navigate to more than one page; you can write some fairly sophisticated tests this way without having to resort to a full-blown web testing tool.
because check_http does a simple HTTP GET, it won't return any page content that's rendered in the browser; this limits what you can check for if your pages include a lot of JavaScript. If you need to test pages with a lot of JavaScript, you'll need to use a more comprehensive Web testing tool like Selenium that can emulate a browser. In this case I'd recommend integrating the tests with your continuous build system rather than the IT monitor, although tools like Nagios have a simple plug-in architecture that lets you integrate additional tools fairly easily.

Take a look at this approach if you need to get some basic testing in place and are strapped for time, people and / or money - all the tools mentioned above are open source.

Wednesday, October 21, 2009

How I got here

I've had quite a varied career in high tech, starting off as a developer and doing spells in QA, customer management and project management (some of the transitions involved bouncing from the UK to the US twice, but that's another blog post).

For the last few years I've settled into technical operations, which I'd essentially define as all the things that keep an engineering team running smoothly. The job can cover a lot of ground: testing, customer support, project management, IT and, of course, the people management work involved in keeping all those things running. Recently I've been working with small startups, so I get to do a lot of hands-on work as well as steering the ship.

Tech ops is a big problem space with a lot of competing best practices (and I kind of like it that way), but I've found that there are a lot of commonalities, too; more on that in future posts.

I'm looking forward to sharing some of the knowledge I've picked up along the way and, hopefully, learning more from you all !

Optimal Operations

Monday, December 28, 2009

Fun with Nagios, part 1: timeliness of web content

Tuesday, December 8, 2009

Exchanging Answers

Friday, November 13, 2009

The USB Breathalyzer

Monday, November 9, 2009

Advance your career over lunch

Neat trick for getting relative dates in a Linux shell script

Wednesday, November 4, 2009

Forced education #1 - DNS and DHCP with dnsmasq

Thursday, October 22, 2009

Using your IT monitor for automated testing

Wednesday, October 21, 2009

How I got here

About Me

Talks I Did in 2013

Subscribe To Optimal Operations

Blog Archive

My Blog List

Stack Overflow flair

Server Fault flair