We live in a world that’s obsessed with the cloud and hosting almost everything it possibly can on it. It’s new, it’s easy to access, almost anyone can do it and it’s cheap. Unfortunately with everyone jumping on the bandwagon it’s starting to show its cracks, and many businesses aren’t building their web applications the right way or even considering the new ways their applications could fail.

A while ago, Netflix released a tool for Amazon EC2 called ‘Chaos Monkey’ which is not only one of the best tools ever thought up, it’s also every administrator’s worst nightmare. You see, Chaos Monkey is a tool that randomly kills instances and other services in order to test failure.

In a post titled “5 lessons we’ve learned from using AWS” from 2010, Amazon details the hard lessons it learnt in the cloud. Whilst this is a very old post, it’s also extremely relevant even now as more businesses wade their way into the cloud. This part is particularly unsettling for the engineers among us:

“One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.”

If you’re in infrastructure and that doesn’t give you the chills, then you must not work for a company with critical services. The notion seems ridiculous — even crazy — at first, but after you consider this for a while, you might realize that this is one of the best ways you could ever plan for the inevitable. Almost nobody turns off instances at random “just to check” if it’s redundant.

Coding Horror put it best in their post about Chaos Monkey: “Raise your hand if where you work, someone deployed a daemon or service that randomly kills servers and processes in your server farm. Now raise your other hand if that person is still employed by your company.”

Too many companies deploy systems without any really well thought out redundancy strategy – be it on a VMWare platform or in the cloud – there’s almost always a single point of failure that’s been swiftly overlooked. Unfortunately, when it comes to real world, the Chaos Monkey could strike at any time, in any form, and there’s no choosing how.

If you went to the management at your company and told them that you wanted to make the company network and services more redundant they’d likely give you permission to up front. But, if you told them you wanted to make the services more redundant by randomly switching off core infrastructure you’ll likely be shown the door with haste.

The Chaos Monkey isn’t just a tool – it’s a reality that companies need to face. Once an organization is able to realize that shutting down servers at random shouldn’t interfere with anything, the Chaos Monkey suddenly becomes reasonable. Almost an assurity that the network design is sane.

Failure is inevitable, especially in infrastructure you can’t control on a low level. Amazon’s EC2, Azure and Rackspace’s cloud have all had their fair share of issues (as does any IT system), but designing in failure guarantees that you’re ahead of the pack. It also happens when you least expect it and at the worst possible time. Most infrastructure people are probably thinking about the time they were paged at 3AM because a instance froze up, now.

Netflix said, when it open-sourced the Chaos Monkey tool last year, that it had run the tool for a year and it had randomly terminated a staggering 65,000 instances. Amazingly, the company says that usually nobody notices, but they are still learning from the surprises that Chaos Monkey can bring.

Testing DR systems and failover functionality has long been manual functionality. If a user wanted to simulate a failure they’d need to manually terminate a machine or push a button in a DR system to actually do something. Even for large companies, that’s not done very often (and isn’t a small task).

Ensuring that your design is rock solid even if it means doing something a little crazy, is essential. When you’re still up and your competitors (who use the same service) are down you won’t be so crazy anymore. Redundancy tests can and should be occurring all the time

IT is dynamic and systems change so often that actually testing them in the real world can be hard to actually achieve. A fix you might have put in yesterday may have completely broken the network’s redundancy, or taking a machine offline for a few hours could completely cripple the network.

Somehow even though Amazon’s cloud (any many others) have been around for so long many companies seem to struggle with hosting in it. Every time there’s an issue with EC2 a slew of websites go down. Reddit and Heroku still can’t tolerate a massive Amazon failure. I can’t speak for their infrastructure, but there’s always more work to be done towards fail-proofing.

Chaos Monkey does seem on the surface to be using a sledgehammer to solve smaller problems. It is, somewhat, but it’s also key to unearthing issues that would eventually become major outages. Problems can be solved before they’re critical. By knowing that something could break at any given time Netflix is able to plan and have the mindset of ensuring that no single point of failure could cripple the service.

That mindset is important for the cloud. Anything can happen; especially considering how young the technology is. If you’re hosting anything that’s external facing, no matter how big or small, you should build it to be as redundant as you can possibly afford. It’ll save you in the long run.

If you’re ready to unleash your own flavor of Chaos, Netflix has actually open-sourced their Chaos Monkey on GitHub. Their version allows the Chaos Monkey to unleash it’s doom on a schedule so that engineers are available to respond in case something terrible actually happens. Perhaps before you implement it, though, you should try it in a sandbox.

Image: Thinkstock