Waldo Jaquith

Limiting *nix program runtime.

I posted a question to the Neon Guild mailing list a few days ago about a vexing little problem. For Richmond Sunlight I’m taking regular snapshots of the websites of every member of the General Assembly, as an archive for folks to browse months or years from now, to see what sort of promises legislators made, how they presented themselves, etc. That’s done via an automated script, grabbing a couple of sites every day without me interacting with it at all. The trouble is that wget has no function to be limited temporally, so it’ll run forever and ever when it encounters a website with a badly-written CMS that generates recursive links like http://example.com/about/print/print/print/print. The solution is to create a wrapper for wget that puts it on a timer—if it runs longer than a preset limit, it gets cut off. My friend Jeff Uphoff envisioned this solution and then whipped up a quick program in C to accomplish it. I’m sharing it here for googlers looking for a solution to the same problem. Up notes that it’s in the public domain.

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int
main (int argc, char **argv)
{
 if (argc < 3) {
   fprintf (stderr, "Usage: %s maxtime command [arguments]\n", *argv);
   return 1;
 }
 alarm (atoi (*++argv));	/* First arg is time limit (in seconds) */
 ++argv;			/* Next arg is command to exec */
				/* All remaining args are passed to
				   executed command as its args */
 if (execvp (*argv, argv) != 0) {
   perror ("execvp");
   return errno;
 }
 return 0;			/* Never reached */
}

If you decided to name it alarmlimit.c, just compile it like such:

# gcc -Wall -ansi alarmlimit.c -o alarmlimit

If you wanted to run top for no more than ten seconds, you'd do this:

alarmlimit 10 top

Note that you could also solve this with ulimit -t, but ulimit is based on processing time, and a program that's not especially processor intensive (like wget) could run for much, much longer than the amount of time that you specify.


2 Comments

Won’t the -l parameter to wget do what you want, with the added benefit that it won’t cut off the retrieval if there is some other cause for the delay?

The wget man page seems to indicate that the default recursion limit (the -l flag) is 5. Or are you overriding that?

Posted by Sean on 1 October 2009 @ 11pm

That won’t do what I want, unfortunately. I’m happy to recurse as deeply as a (reasonable) site will go in order to get the whole thing—10 levels, 15 levels, the sky’s the limit. The goal here isn’t to limit how much of a site that I can archive, it’s to prevent over-archiving of sites, which is to say the infinite duplication of pages due to poorly-designed URL structures. Given, say, four URL-passed options (print, single page, e-mail this page, and export as PDF) available for every page on a website, that might only recurse five levels deep, but using up every combination of those five for every page on a website could take months. So it really needs to be a time-based thing, I think.

Posted by Waldo Jaquith on 2 October 2009 @ 9am