RAMP is a sellout conference taking place tomorrow and Friday. It’s designed to help developers learn from those who have been in the trenches of building highly scalable backend systems.
It takes place in Budapest over the next two days and is organized by the teams behind some of Hungary’s most successful startups: Prezi, LogMeIn and Ustream. You’ll be able to watch it right here on The Next Web, so look out for our post containing the livestream shortly before 9am CEST.
As a warmup for the main event, we spoke to Jeremy Edberg, a Reliability Architect whose job is to keep Netflix online.
Edberg, previously Reddit’s Chief Architect and Lead Technologist (and first paid employee), is an expert at making services scale from millions to billions of pageviews. As you’ll see below, we found out what, exactly, a Reliability Architect does, and how Netflix stays online despite amounting to 30% of all Internet traffic. Edberg also provided some excellent advice in regards to how this applies to startups.
HW: To get things started, can you share what it means to be a “Reliability Architect?”
JE: The Reliability Architect encompasses a couple of different, equally important roles. One part is to be the call leader for outages and make sure that customer pain is being relived as soon as possible. Another part of the role is to evangelize best practices around the company. A third part of the job is writing tools that help maintain reliability, like a tool to intelligently route alerts to right people over the right medium. Our team is also responsible for the Chaos Gorilla as well as a couple other members of the Netflix Simian Army.
HW: If you had to simplify your role, would it be accurate to say you’re the person responsible for keeping Netflix online?
JE: I would say that everyone at Netflix has a responsibility to keep Netflix online, but that it is my primary concern.
HW: What are some of the traffic/load challenges related to doing that? Any stats you can share?
JE: Traffic and load is always an important challenge, but we’ve put a lot of effort into running the service programmatically so that scaling is done as automatically as possible. All of our frontend systems use auto-scaling on AWS, so as more people watch, the systems will scale up automatically, and are designed to scale and deliver a great experience. We use autoscaling to maintain reliability and as a side effect it also saves a bunch of money. As the load increases so do the number of machines — we scale our fleet by 50% daily.
Of course there is also the well-known stat that Netflix is responsible for 30% of all internet traffic. Our secret is that most of that is not served from AWS. We have our own network of hardware all over the world to take care of that (that’s our Openconnect CDN).
The most interesting part of scaling the service is scaling the engineering organization behind it. As the company grows, our most important concern after providing an amazing customer experience is maintaining our culture of Freedom and Responsibility; allowing our engineers to do what they think is right and not getting in their way. Of course this presents a unique challenge for our team of reliability engineers. It’s always a fine balance between innovation velocity and reliability, but I think we’ve done a pretty good job.
HW: You mentioned Chaos Gorilla and Netflix Simian Army. Can you explain to TNW readers what these initiatives do?
I would like to start by pointing out a paper one of my colleagues recently released. That article goes into great detail about the Simian Army, but briefly, the simians are our way of maintaining anti-fragility. That is, any large complex system will always be in a state of degradation — our theory is that we should induce that degradation ourselves in a controlled manner, so that we can make sure we are resilient to that type of failure.
The Chaos Gorilla is one of these programs, which shuts down every instance in one of our Amazon zones (much like pulling the plug on one of your datacenters to see what would happen). All of our systems are designed to run with only two of their three zones, and the Gorilla helps us test that and make sure that we continue to have the necessary level of redundancy.
HW: What are some best practices you’ve pushed for lately? Any best practices that could apply to startups?
JE: Pretty much all of them apply to startups as much as the largest companies. For context it is important to point out that Netflix is an SOA, which means that each service talks to the others via a REST api. So one best practice is to make sure you are generous what you accept and stingy in what you give. ie. If something sends bad data, try to figure out what’s supposed to happen, and at the same time, make sure the data you send back is as well-formed as possible. This could apply to a startup too, even one that isn’t using SOA, as long as they are accepting user data. They too should try to accept data in as many forms as possible and return the most well-formed data to their users.
Another best practice is to use caching and use it properly. One of the hardest problems in Computer Science is cache invalidation. If you design your software to handle stale data in a way that you can consider everything immutable, then you don’t have to worry about invalidation anymore (and if you can’t do that then you should definitely put a lot of thought into invalidation!). Additionally, if you put a cache in front of your datastore, it can increase reliability because if your datastore goes down or becomes slow, you still have the hotest data in a cache. All of these caching related ideas can apply to startups too.
HW: When you say “tools that help maintain reliability,” can you share an example? How many of these alert systems exist today?
JE: One of the tools our team develops and maintains is the alert gateway. All of the alerts that are generated flow through the alert gateway, which has intelligence about where to route the alerts and at what priority. A feature that we’re adding is the ability to correlate different alerts into a new alert. So for example, you could have three minor alerts from three different systems, but if all of them happen within one minute, it would generate a new major alert.
Another tool we maintain is one which keeps track of changes to the production environment. Through both polling and getting pushed information, we try to have a complete record of every change in production. That way when there is an outage we can look through the history and try to find out what changed that could be causing the outage. We can also alarm on certain events happening within a certain timeframe.
HW: Would you mind sharing a preview of what you’ll be talking about at RAMP?
JE: At the RAMP conference in Budapest I’ll be telling to story of how Reddit scaled up to billions of pageviews, what we did right, and some of the mistakes we made. I’ll talk about how we used a bad hash key which made it almost impossible for us to grow our cache, and also how we decided very early on to use a key/value store before such things were really mainstream.
Image credit: Justin Sullivan / Getty Images