The Hidden Cost of Server Parameters

Thursday, 19 Oct 2006

The Hidden Cost of Server Parameters

A decision that comes up fairly often when writing server software is whether to hard-code a value or make it configurable. It's easy to conclude that more configuration is better, because when you hard code something, you'll have to recompile and deploy a new version of the server when you want to change it. But resorting to configuration parameters too frequently has a cost that adds up after a while.

For example, suppose we are writing a servlet that does a redirect when the moon is full:

  if (isFullMoon()) {
    response.sendRedirect("http://www.example.com/fullmoon.html");
  }

The unit test for this code probably looks something like this:

  public void testRedirectWhenMoonIsFull() {
    setUpFullMoon();
    sendRequest();
    checkWasRedirected("http://www.example.com/fullmoon.html");
  }

So far so good. But the problem with redirects is that they work for a while and then break. Maybe we're concerned about whether the server hosting the destination page might go away. Wouldn't it be better to have a --enableFullMoonRedirect flag? That way a sysadmin can fix the problem without needing help from the developers.

So, we double the number of unit tests:

  public void testRedirectWhenMoonIsFullAndEnabled() {
    setEnableFullMoonRedirect(true);
    setUpFullMoon();
    sendRequest();
    checkWasRedirected("http://www.example.com/fullmoon.html");
  }
 
  public void testDontRedirectWhenMoonIsFullAndNotEnabled() {
    setEnableFullMoonRedirect(false);
    setUpFullMoon();
    sendRequest();
    checkWasNotRedirected();
  }

The reason we need two tests is to make sure that the flag actually works. You might think this is too trivial to test, but I've seen this bug in actual production code. Someone added a flag, someone else refactored, and the flag stopped working. There was no test and the flag was never actually used, so nobody noticed. As a result, the flag actually had negative shareholder value due to the confusion when we thought it worked but it didn't.

Writing two tests isn't so bad when the flag is actually necesssary. But suppose we have more flags? Every time we add another flag, the number of possible configurations at least doubles. We can't test every combination, but we should test a reasonable subset of them (all off, all on, each one individually, and maybe a common configuration or two).

And oh yeah, what we really need to test is the production configuration. For a redirect to happen correctly, the code, the production configuration, and the destination server all need to cooperate.

It might take some work, but we can test that too: copy the production config file, make the minimal amount of changes to get it to work in a development environment, run the server, and then see if it still does the redirect.

But notice how much harder this is to test than if there were no flag at all.

So, what are the alternatives?

I'd rather treat rare config changes like this as a "fire drill" for the team's security response procedures. How long does it take the team to diagnose, test, and deploy a one-line bugfix? How can we make that happen faster?

Of course, we don't want to do fire drills too often, or we'd never get any work done. To avoid that, the next step is to automate the response. If redirecting when the destination server is down would be a disaster, we can write code that polls the destination occasionally and stops redirecting if the poll fails. You don't have to poll very often to respond faster than a human would.

respond | link | /code