Tuesday 2011-09-20

A couple days ago the load on haller.ws had a distribution that looked like:

because one of my scripts went into an infinite loop and happily sat there, munching on my CPU.

On a normal day, the load distribution looks like:

I don't want to sit around and monitor my machines, I'd rather have my machines monitor themselves and alert me when things get out of whack. Which means defining what "in whack" means. Say we put a threshold of 2 for the load on haller.ws. Any time a report job took too long and overlapped with rss2email, I'd get a false positive.

Cranking the threshold up to 4 to avoid those would also run the risk of not detecting a process that was munching on cpu and yielding because it was relatively unprioritized. Which makes for a false negative.

However, testing the statistical distribution of the load works: the expected distribution (l19) fits an exponential distribution and the broken distribution (115) doesn't.

To automate this, I grabbed the load data using collectd and config'd it to drop the load data to a CSV instead of RRD. Then I wrote a quick script for R where we exit(0) when the distribution fits the exponential, and exit(1) otherwise:

library(MASS)
load <- read.csv(
	sprintf("/var/lib/collectd/csv/%s/load/load-%s",
	system("hostname -f", intern=TRUE),
	system("date -d yesterday +%Y-%m-%d", intern=TRUE)))
load <- load$shortterm[ load$shortterm > 0 ]
ll <- logLik(fitdistr(load, "exponential"))
exit_code <- 1
if (ll > 0) { exit_code <- 0 }
quit(status=exit_code, save="no")

and run it with:

> Rscript collectd-load.r || echo "load out of whack"

We can further tighten this down by fitting the rate for the expected distribution (here it's 9). And instead of evaluating on a daily basis, eval more often; we just need to figure out how large a rolling window of data to keep.