Sunday, July 13, 2008

redundancy and unexpected interdependency

A couple weeks ago, the office network went down. Not surprisingly, we couldn't print or get to any electronically filed documents. Maybe a bit more surprisingly, the phones also went out because we've moved to fancy new Cisco network-based phones, and they took voice mail--which is now e-mail-based--with them. And so did the fax machines, because when faxes come in they get scanned and e-mailed to us. The net result was that we were almost dead in the water. Fortunately, the Blackberries stayed up, so we could use cell phones and could still send e-mail through the Blackberry server.

These sorts of unexpected interdependencies are cropping up more and more often as we begin to link one network to another, and as traditional network technologies shift around. I've already written about how my (now former) cell phone's alarm clock stopped working when the phone couldn't reach the cell network. There was a radio discussion today about an issue with digital TV: many people put battery-powered analog TV sets in survival kits so they can get news if there's a natural disaster, but (1) the set-top converter boxes that we'll need to use once HDTV becomes standard require wall power to run, which would make those analog battery powered TVs useless, and (2) many radio stations no longer have a news department, acting more as satellite feeds for a central office, so your transistor radio may not be able to pull in anything useful, either. Oh yeah, and many folks are moving away from conventional land-lines at home and going to VOIP phones or cell phones exclusively.

The classic way to deal with system failure is through redundancy: if the TV network goes out, you still have radio, which can get you much of what you need, or maybe you still have telephone. And so forth. The problem is that redundancy only works where you can stop a failure in one system from propagating to other systems. That's one reason most software doesn't have a redundant design: it's very hard to predict how far the effects of a failure will propagate through software, so it's generally not cost-effective to make it redundant. As our networking capabilities become more sophisticated, we're getting to the point where it's becoming harder to figure out what effect a failure of one network will have on the other networked systems we use.

The right way to do this, from an engineering perspective, is to take these sorts of problems into account when designing the systems, and then to periodically test them to make sure we got it right. That's not going to happen, at least right now, for social and economic reasons. The second-best approach would be to invest time and money into developing the design and evaluation tools necessary to either constrain the effects of a failure, so they don't propagate to other systems, or to at least be able to predict the likely failure modes and how far they'll propagate. We might eventually do that, but it'll take time. What's most likely to happen is the same thing that happened in structural engineering a hundred years ago: we may well see a series of unexpected, unforecasted failures, and they'll likely continue until we implement one of the first two approaches above.

No comments: