This article was published on June 22, 2012

Twitter explains its “turbulence” from today in-depth, and it wasn’t due to an attack

Twitter explains its “turbulence” from today in-depth, and it wasn’t due to an attack

As we reported earlier, Twitter had some serious problems today, going down for over an hour and then back up and down over the next few. The company didn’t provide much of a reason for this, other than it was a “cascading bug”, but the company had more to say now that the issues are behind them.

It appears that once Twitter thought it had everything under control, the wheels came off again. I noted earlier that the Status page said that everything was fixed, but was quickly removed in place of an “in progress” message.

Here’s what Twitter’s VP of Engineering, Mazen Rawashdeh, had to say about its issues, and offers Twitter’s “sincere apologies”:

Not how we wanted today to go. At approximately 9:00am PDT, we discovered that Twitter was inaccessible for all web users, and mobile clients were not showing new Tweets. We immediately began to investigate the issue and found that there was a cascading bug in one of our infrastructure components. This wasn’t due to a hack or our new office or Euro 2012 or GIF avatars, as some have speculated today. A “cascading bug” is a bug with an effect that isn’t confined to a particular software element, but rather its effect “cascades” into other elements as well. One of the characteristics of such a bug is that it can have a significant impact on all users, worldwide, which was the case today. As soon as we discovered it, we took corrective actions, which included rolling back to a previous stable version of Twitter.

We began recovery at around 10:10am PDT, dropped again around 10:40am PDT, and then began full recovery at 11:08am PDT. We are currently conducting a comprehensive review to ensure that we can avoid this chain of events in the future.

For the past six months, we’ve enjoyed our highest marks for site reliability and stability ever: at least 99.96% and often 99.99%. In simpler terms, this means that in an average 24-hour period, has been stable and available to everyone for roughly 23 hours, 59 minutes and 40-ish seconds. Not today though.

We know how critical Twitter has become for you — for many of us. Every day, we bring people closer to their heroes, causes, political movements, and much more. One user, Arghya Roychowdhury, put it this way:

It’s imperative that we remain available around the world, and today we stumbled. For that we offer our most sincere apologies and hope you’ll be able to breathe easier now.

As the company states, the service has sported a 99.96%+ uptime in the past few months, but when it comes to a utility of this importance, it’s not how often it goes down, it’s how long it’s down. This is a real-time product, and when a real-time product goes down, it’s a real big deal.

Having said that, this level of transparency from someone with a lot of responsibility within the company is appreciated, and it’s something that shouldn’t be lost on all of us that use Twitter.

Image: Tashmahal via Flickr