Lessons from the Amazon S3 Outage: Self-Protection is a Must

Lessons from the Amazon S3 Outage: Self-Protection is a Must

By Mehdi Daoudi, CEO of Catchpoint

 On February 28, Amazon Web Service’s S3 storage service experienced a lengthy outage, bringing down or impairing thousands of websites, applications and IoT-connected devices that rely on it. This was the second time in a few months that a major internet infrastructure collapsed, creating a domino effect across many sites and services worldwide. The Dyn DNS routing service outage from November also clearly demonstrated the contamination effect that major multi-tenant internet services can inflict when they experience problems.

With these outages becoming more commonplace, we are left with a cold, hard truth – 100 percent uptime for major internet services is unrealistic, no matter how reputable a service provider may be. We can’t blame Amazon, because outages are inevitable and no company, no matter how big or powerful, is completely immune. The responsibility lies with us, particularly with our tendency to concentrate too much reliance in the hands of a single external provider.

Cloud services like Amazon or Microsoft’s Azure are often shrouded in a perception of unmatched strength and infallibility. While most cloud services are quite reputable and robust, the cloud is still just a collection of servers, switches, and someone’s code. Many cloud providers are building out infrastructure quickly to support more workloads and customers, which makes them more prone to issues as the number of performance-impacting variables increases. After all, this isn’t the first time AWS has failed.

While the convenience, cost-savings and flexibility the cloud provides is clear, this shouldn’t equate to blind reliance. Business users must ultimately claim full responsibility for the performance (speed, reliability) of their own cloud-based applications and services. What can be done?

  • Cloud service providers must be monitored around-the-clock, as much or even more than your on-premises infrastructure. You must be constantly gauging their response levels, as geographically close as possible to the datacenters serving your end users. This provides the most realistic view into your provider’s infrastructure health.
  • Monitor your own end users’ performance levels, also from the closest possible geographic vantage point. This enables businesses using the cloud to identify when a cloud service provider may be experiencing a problem, so that it may be flagged to the provider preemptively. Analyzing end-user performance data alongside provider performance data also delivers a “complete picture” that enables businesses to determine when a performance problem may lie on their own side (for example, an internal infrastructure element). Additionally, it can help determine instances where the cloud service provider may be performing well, but the geographic distance of a particular end-user segment warrants adding or shifting resources.
  • Communication with end users is also crucial when catastrophe strikes your cloud service provider. End users don’t care who or what is the cause of your service being unavailable; they’ll simply blame your business. So, it’s up to you to tell them. While that may not give you a completely free pass, it’s at least an explanation. Amazon took the proper steps in communication by being upfront and transparent about the issue across multiple platforms, allowing some reprieve for their business users during a time of chaos.
  • Perhaps most importantly, have a backup contingency plan in place. A failsafe contingency plan, like distributing to multiple cloud services and zones, is what determines the amount of damage an outage like this has on a business user. It’s not Amazon’s responsibility to create a redundancy plan for business customers; it’s the customer’s job to make sure their business is covered.

We live in an age of increasing dependency on external internet services. The most important takeaway from these outages is the need to adjust accordingly to ensure our critical applications and brand reputations are covered, through ongoing monitoring, redundancy plans and honest, proactive end-user communications. This may require time and effort. However, it will make all the difference in the world the next time an unpredictable internet event comes to pass. It’s only a matter of time before the next one strikes.



This post is part of our contributor series. It is written and published independently of TNW.

Read next: 8 leading web hosts for a start-up company