Tuesday, October 18, 2011

With all these backup systems, there's still a single point of failure

TechCrunch reported this about RIM's explanation for the worldwide email delivery issues in October 2011:

RIM held a quick press conference call today to address the ongoing outages which started in Europe but have spread to the rest of the world, including the US. The message was straightforward: a “core switch failure” in their European unit (though they did not give the exact location) that failed to turn over to one of the backup systems. The total failure resulted in a backlog of messages that they are chewing through at this moment.

So RIM had backup systems, but they were dependent on a "core switch" - ostensibly a single one - to do its job. This story reminded me of an experience I had a few years ago.

My company had contracted with a large hosting provider for data center services. The company touted their highly secure building, bulletproof systems, redundant power & network, and backup diesel generators. We signed up and felt fully protected.

A few months later, a massive snowstorm hit the town where the data center was located. The electricity went out soon after the storm started. A few hours later, our servers went down. What happened?

We found out the next day, after we came back online, that a technician had noticed fuel leaking from the diesel generator as they prepared to start them up. This was a fire hazard, of course, so the generators remained powered off until the mess could be cleaned up and the fire department could affirm that the generators were safe.

So this facility with redundant everything had, after all, a single point of failure: if fuel was leaking, they couldn't provide backup power. Herein is a lesson for business continuity folks everywhere.

Here's RIM Co-CEO Mike Lazaridis talking about the issue:



No comments:

Post a Comment