So AWS Went Down – 12 Steps You Can Follow to Minimise Future Risk

So AWS Went Down – 12 Steps You Can Follow to Minimise Future Risk

We witnessed an unexpected disaster this week – Amazon Web Services, a major hosting solution provider, suffered downtime. This meant that many businesses relying on the AWS infrastructure were up for an unwelcome break.

For some, this was only a minor hiccup and barely harmed their business at all. For others, however, each minute of the unplanned downtime meant serious loses. Imagine launching a massive marketing campaign only minutes before AWS goes down. It would mean thousands, maybe even millions of dollars in losses.

If you’re one of those businesses that did or could potentially suffer very real damage because of their hosting provider’s issues, this article is for you. Protect your business by taking the 12 safety steps below.

From the Business Perspective

1. Create a graph to see how much of your business is dependent on AWS
It doesn’t have to be a pretty graph, but it should be readable and show real data. The point is to be aware of the potential risks and to make a rational decision about investing in extra infrastructure. Preparing for your hosting provider’s downtime is a costly endeavour and you only want to do it if you really need to. Also, once you do have indisputable evidence that your business is at risk, the investment won’t seem quite as large.

2. Test to see if the dependency threatens your whole business running well
You probably don’t need each one of your servers to keep your business going, but make sure you know the difference between your production and development environments, and what would the consequences be, if any program or subsystem fails.

3. Create a fall-back plan for each subsystem
Building a disaster recovery plan for your infrastructure is necessary (and we do hope you have one), but you probably shouldn’t depend on it to get your app or store back up in a short amount of time. You should have a quick fix for each of your crucial subsystems. Remember to consider the geographical location and find the most cost-effective solution. Sometimes, it might be enough if you make an offline backup yourself and store it in your drawer. Other times, it won’t. You need to know the difference – for each of your subsystems.

4. Hire a team to monitor your infrastructure – or take advantage of automatic tools
Either option is fine, as long as someone or something keeps a watchful eye on your online services. Sure, you could check it once every five minutes to see whether your website is still running, but how long are you going to go without sleep?

The advantages of hiring a team are that you’d be working with human beings capable of their own judgement and of adapting to unexpected situations. They could also, say, call an engineer in the middle of the night and make sure this engineer started fixing the crisis. It’s not a cheap solution, however, so going with automation might be the better option. You’ll still need someone tech-savvy to set it up and maintain it, but you won’t have to worry about the human factor so much.

Whichever option you pick, monitoring is pretty much mandatory if you lose money for every minute of downtime. Sometimes, even paying more than you stand to lose will be the right choice, as you avoid damage to the unmeasurable asset that is your brand.

5. Don’t be afraid of investing in a fall-back plan
Redundancy in your infrastructure might seem, well, redundant, but it might actually be the thing that saves your business. It’s better to invest and survive a potential crisis than hold onto the money in short term and sink your company.

6. Avoid total vendor locking
Is your whole infrastructure hosted by Amazon? If yes, you might have made a mistake. Sure, AWS is reliable, but it isn’t perfect. Their servers will suffer downtime, just as they have recently. Cloud infrastructure might seem like it has infinite uptime, but that’s not the case. Look at your provider’s SLA. AWS, for example, guarantees 99.95% availability, which equals 21.56 minutes of potential downtime per month. Also, though it’s not something they will want to do, Amazon might have to break their SLA if, for example, they get hit by an earthquake.

You can prepare for this if you keep your systems on servers belonging to various providers, in various locations. Remember that server rooms, though they tend to have all kinds of security and safety measures, won’t withstand everything that is thrown at them.

It might be a good idea to have your own backup servers and hold onto copies of your data. Sometimes, an extra machine under your desk will do the trick, and sometimes you’ll need an actual server rack from a professional provider. The goal is to not be dependent on a single vendor.

From the Technical Perspective

7. Your systems really should have redundancy
By redundancy I mean that there should be additional and unused resources in case you suddenly need them. Modern technology helps us prepare for all kinds of horrible scenarios. Your server explodes? Not to worry, you’ve had a backup all along!

You (or your development and maintenance team) will have to choose the best strategy for your situation. Whether you go with an active/active or an active/passive high availability cluster shouldn’t depend on how much the solution will cost you, but on which option is ultimately better for your business.

8. Keeping backups is always a good idea
Extra copies of your data and code are a precaution I’m sure you’ve already implemented. It would be seriously risky to keep no backups at all. You can have them on-premises or at another location, online (e.g. with Amazon’s Storage Gateway) or offline. Either way, you’ll be safe in case of data loss or temporary inability to access your data.

Your backups don’t need to be super recent, but make sure you know how long they stay relevant. Does your system change a lot every week (e.g. with new orders if your business is an ecommerce)? Consider making copies of your data every night. Remember, though, that backups for some systems will never be 100% up-to-date and calculate the cost of making them often against the cost of losing a portion of your data. You’re looking for a reasonable middle ground.

9. Consider increasing your system’s resilience through autoscaling and load balancing
This will save you a lot of stress when one small part of your infrastructure falls, as the idea is that these solutions keep your systems going smoothly when you suddenly need more resources, for example. Don’t expect them to do anything when a whole region of AWS servers goes down, though.

Autoscaling is a measure that will prevent a sudden excessive activity (e.g. a spike in the number of visits to your website) from having a negative influence on how your system is running. This is done by automatically assigning more resources (processing power, RAM, storage space, etc.) to your infrastructure. Amazon is among the hosting providers that offer this service.

Load balancing means distributing workload between different computing units within your infrastructure. Computers can only handle so many threads at a time, which makes buying a more powerful machine not the right answer to every problem. With load balancing, you can make sure that users’ requests are being handled within a reasonable timeframe and that you’re using your resources in the optimal way. Additionally, as you don’t technically need these extra computing units for your system to work per se (you need them for it to work well), they increase the redundancy of your infrastructure.

10. Make sure your disaster recovery plan is solid
A solid plan ought to be tried and tested rather than based on pure assumption. Netflix do it right: they use Chaos Monkey to test their infrastructure. Even if it turns out that your infrastructure fails and preventing this isn’t a viable solution (e.g. because of the cost), you will at least know how the whole mess will go down, which will help you do damage control and bring your infrastructure back online quickly.

11. Your team needs to do regular system healthchecks and implement custom system alerts
These should be an integral part of your monitoring. Healthchecks are requests sent to a server every 1-5 minutes to check whether it’s live. If, for example, three consecutive requests fail to go through, the server likely requires reanimation. You want to know what’s wrong as soon as it goes wrong and where the issue lies – this way your team will be able to start solving it immediately.

Another important element of your monitoring should be custom alerts for particular elements of your system, especially the ones with a high workload or ones that can bring your whole infrastructure down. Here’s a book that might help you develop the right approach. As these custom alerts are an element of monitoring automation, they shouldn’t be a terribly costly feature.

12. Your infrastructure should be highly decoupled
When one of your subsystems or programs goes down, you don’t want it to drag down everything else with it. Separation is crucial. It probably won’t keep your systems running at 100% capacity, but 90% is good enough when the alternative is a complete system meltdown. Let your team invest their time into building a good, scalable system infrastructure for your business. This is something you will never regret.

My System Is as Safe as It Can Be. Now What?

Congratulations! You’re among the very few who have decided that they don’t want to leave their livelihood and life’s work to pure chance. Now you can relax and keep your supply of popcorn ready for when it’s time to watch the others panic. Or read this AWS Lightsail review.

Seriously speaking, though, things can still go wrong. As safe as possible will never be 100% safe, so stay up-to-date with the developments in system administration and security. Keep an eye on your hosting provider – if they seem to be having trouble, it might be best to move your systems elsewhere. Above all, stay vigilant, and don’t hesitate to ask others for advice. I, for one, would be happy to answer your questions.

(Rafał Wiliński and Olga Trąd contributed to this post)

This post is part of our contributor series. It is written and published independently of TNW.

Read next: Think wireless is the state of the future? Think again…