Thursday, September 5, 2013

Improving large, distributed information systems by inducing failures

I've been thinking about this powerhouse paper from the Association for Computing Machinery's acmqueue site for the last week. It's called "The Antifragile Organization: Embracing Failure to Improve Resilience and Maximize Availability" by Ariel Tseitlin.

Taking its starting point from Nassim Taleb's arguments about antifragility (see references here and here), Tseitlin discusses ways of making large distributed information services antifragile - i.e., able to capitalize and improve themselves in the wake of disruption. His focus is on testing and simulation, the question of how to exercise highly complex systems to ensure they will not collapse as a result of stressors.

As Tseitlin observes, traditional scripted testing is utterly unsuited to this task - in systems of any size, it's impossible to build (or even imagine) the total number of test cases required to prove a system's robustness. Moreover, even the largest test system is a fraction of the size and complexity of the production environment.

Taking a radically different approach, companies like Amazon and Netflix are increasingly causing intentional disruption within their production systems to assess whether their resilience mechanisms are working as expected - and whether new vulnerabilities have emerged. Tseitlin describes several ways that this is done:

Once you have accepted the idea of inducing failure regularly, there are a few choices on how to proceed. One option is GameDays, a set of scheduled exercises where failure is manually introduced or simulated to mirror real-world failure, with the goal of both identifying the results and practicing the response—a fire drill of sorts. Used by the likes of Amazon and Google, GameDays are a great way to induce failure on a regular basis, validate assumptions about system behavior, and improve organizational response.

But what if you want a solution that is more scalable and automated—one that doesn't run once per quarter but rather once per week or even per day? You don't want failure to be a fire drill. You want it to be a nonevent—something that happens all the time in the background so that when a real failure occurs, it will simply blend in without any impact.

One way of achieving this is to engineer failure to occur in the live environment. This is how the idea for "monkeys" (autonomous agents really, but monkeys inspire the imagination) came to Netflix to wreak havoc and induce failure. Later the monkeys were grouped together and labeled the Simian Army.

Netflix's "monkeys" include, among others, a Chaos Monkey, which randomly terminates virtual instances in the production environment; and a Latency Monkey, which inserts delays into various components of the network.

These are run regularly and the Netflix team carefully measures to see whether the system adapts suitably to the disruption. The random and low-level nature of these tests helps avoid limits of human-scripted testing, and the fact they are running in the production environment means they are not limited by the size of a test environment.

Is it risky? Not if the system has been engineered not to collapse under stressors. Top quality distributed systems are built to isolate failures and degrade gracefully rather than result in catastrophic downtime. As such, it does require a system that has been in production long enough to develop stability - Twitter in its first two years would not be a good candidate for this, but Twitter at present would be.

Tseitlin concludes the paper by discussing how these exercises in resilience are building toward true antifragility, by using tools such as post-exercise blameless postmortems and requiring that developers also be operators, the better to anticipate code that might cause operational issues down the line.

No comments:

Post a Comment