Appboy provides a platform for customer relationship management for mobile app developers. Developers integrate our SDK into their apps, and this SDK communicates with an internal API.
A few months ago, I was working on improving performance of the API by offloading some processing to Resque workers on another server. The code worked fine, was reviewed, passed all the tests in our continuous integration environment and automatically deployed to our staging servers. Excited, I picked up an iPhone to try it out and logged into the database to confirm that the processing had happened. It had not.
For the first step in investigating what was up, I went to our Resque dashboard. From there, it was pretty obvious what the problem was.
No workers? What?! Monit makes sure that all of our Resque workers are running and starts them if they’re not. I SSH’d into the machine for some watch 'ps aux | grep resque' action. The workers were definitely starting, all right. The problem was that they kept dying fairly soon after, only to be restarted by monit. Kind of like a daemon’s Groundhog Day.
Tracking down the problem was straightforward — I simply started a Resque worker myself and looked at standard error. The issue was pretty stupid, something like a typo in a config file that the workers loaded. I fixed it quickly and resumed my testing.
But something else had me worried: I hadn’t been alerted to the fact that something was wrong. We take a lot of precautions to lower the probability that we deploy and break something. First, we aim for very high code coverage and deploy only after tests pass. Second, our deployment script runs automated smoke tests against our servers and rolls back code if they fail. Third, we use New Relic and Monit for system level and process monitoring. Additionally, we have uptime monitors at Cloudkick pinging our servers frequently. But somehow I deployed and still broke something.
Looking for a solution
I spent some time looking into our existing monitoring tools to try to create a test such as “ensure that at least one Resque worker is running at all times.” At my old job, we used SiteScope for monitoring; I was very familiar with writing out custom script monitors that could do what I needed, I just needed the right tool in which to do it. I had some pretty simple requirements to start:
- The monitor should be able to run an arbitrary script.
- Ideally, I could configure the same script to run in multiple environments so I could test my staging servers with the exact same monitors that test my production servers.
- Monitoring shouldn’t be tied to a specific server since a functional monitor can test multiple parts of the system at once.
- There should be a way to communicate that failure was occurring.
- Communications should be customizable per monitor and per environment. This way I could get paged for critical problems, not minor ones or ones in our staging environment.
When I thought about it more, I came up with some other requirements as well:
- The system should be able to take action based on failures. I want monitoring processes that can also keep my business going: automatically restart systems, failover to secondaries, capture logs when bad things happen, automatically file issues on Github for errors, etc.
- I should have the ability to turn the monitors off somehow (possibly by descheduling from cron, although disabling individual monitors would be a nice-to-have).
- The ability to change notification thresholds to avoid spam.
I did find some services where I could write script monitors, but each had some drawbacks. Cloudkick lets you create custom plugins, but that starts at $99/month for only a few servers. Scout is considerably cheaper, but still focused on server-level monitoring. So I did what most people in the Ruby community do when they have a problem: write a gem for it. Mine is called Listerine. Listerine is a simple functional monitoring framework that enables you to quickly create customizable functional monitors with full alerting capability.
Using Listerine, you can write monitors that do things like:
- Make sure that you have at least X available Resque workers and spin up a new EC2 instance if you need it.
- Ensure that your database was actually backed up to S3 last night and file a Github issue if it was not.
- Interact with an API and check return values or database changes.
- Ensure that you always have at least 1000 users in your database for fear of Bobby Tables.
Listerine allows you to define simple script monitors that contain an assertion. When the assertion is true, the monitor has succeeded. When the assertion evaluates to false the monitor is marked as failed, sends a notification, and can run optional code on failure. Unhandled exceptions from the assertion are caught and treated as failures, with the exception text and backtrace included in the notification.
Here’s an example:
Awesome! Now we have a monitor. If you were to run this file, the output would look like:
* Database online PASS
As I mentioned above, ideally I wanted to run the same monitors against staging that I did in production. Back when I used SiteScope, it was always a big pain to make multiple monitors with different configuration values. With Listerine, you can just specify a list of environments and change configuration based on the current environment.
Here’s a simple version of our monitor that makes sure that there is always at least 1 Resque worker.
But what about the requirement to change the communication per environment? I don’t want to get woken up at night when there’s an error on staging. That’s where criticality levels come into play. Set a criticality level on the monitor using is, and then define a recipient for that level. Here’s the same monitor but with different recipients:
Listerine comes with a Sinatra-based front end server where you can check the latest status of your monitors and enable/disable them.
There’s other cool stuff that Listerine can do, such as customizing the notification thresholds (e.g., to only get notified after 3 failures and then only keep notifying every 5 failures), or taking an action automatically when failures occur X times consecutively. This is helpful because this lets you configure Listerine to only wake you up when things are really wrong. Check out https://github.com/appboy/listerine for more details.
How Listerine has helped us
We benefited from Listerine immediately after putting it to work for us. The first big win? This line was in our post-deploy hook:
monit -g appboy_resque_workers restart all
I put it in there after following the instructions on Engine Yard’s documentation for configuring and deploying Resque on Engine Yard.
It’s obvious, but that line restarts all our workers at once across all our systems. We failed the “ensure that at least 1 Resque worker is running at all times” monitor immediately during the next deploy. Now we do staggered restarts to ensure we have enough capacity to not let jobs sit in the queue for the too long.