As you probably already know, some of your favorite Internet services were unavailable for the better part of a day due to a massive Amazon cloud outage caused by a single point of failure due to a storm in Virginia. Along with Instagram and Pinterest, Netflix was one of the companies affected.
The outage was a harsh reminder of how much we have come to rely on cloud computing, both as a consumer and a business. Netflix took to their tech blog today to discuss some of its findings during a weekend its customers won’t let them forget for quite some time.
F**k it, we'll do it live!
Our biggest ever edition of TNW Conference is fast approaching! Join 10,000 tech leaders this May in Amsterdam.
Greg Orzell and Ariel Tseitlin of Netflix called Friday’s outage “one of the most significant outages in over a year.” While it only affected those in the United States, it lasted three hours, which is a lifetime for any service, let alone one that serves up streaming movies.
While some of this post might be overly technical for most folks, I will share a few snipits that demonstrate how dedicated Netflix is to pushing cloud computing forward:
Netflix made the decision to move from the data center to the cloud several years ago. While it’s easy and common to blame the cloud for outages because it’s outside of our control, we found that our overall availability over the past several years has steadily improved. When we dig into the root-causes of our biggest outages, we find that we can typically put in resiliency patterns to mitigate service disruption.
Consumers don’t understand these things, all they know is that they couldn’t watch movies last Friday. This fact isn’t scaring Netflix into changing its strategy, though:
The state of the cloud will continue to mature and improve over time. We’re working closely with Amazon on ways that they can they improve their systems, focusing our efforts on eliminating single points of failure that can cause region-wide outages and isolating the failures of individual zones.
It’s still mind-boggling to me that a single point of failure, caused during a storm, can bring services like these down to their knees. A lot of people pay for Netflix, which makes its situation completely different from that of Instagram or Pinterest. Whenever you are charged for a service, you have a right to know what happened, why it happened and be told that it won’t happen again.
Can Netflix make sure it never happens again? Nothing is perfect, of course, but hopefully the company has learned some valuable lessons on how to attain redundancy for situations like this.
In additions to build its “Cloud Operations and Reliability Engineering team”, Netflix says it’s not jumping off of the cloud anytime soon:
We take our availability very seriously and strive to provide an uninterrupted service to all our members. We’re still bullish on the cloud and continue to work hard to insulate our members from service disruptions in our infrastructure.
I just want to watch movies, I don’t care if they host the service on the moon.