Microsoft has posted its promised follow-up post to its previous short note on why its Azure cloud computing service went down for over two hours last week in Western Europe. Previously, the company placed the blame on a “misconfigured network device” that disrupted traffic inside of Azure.
In its new entry, Microsoft broke down what went occurred in great detail [Formatting: TNW]:
Windows Azure’s network infrastructure uses a safety valve mechanism to protect against potential cascading networking failures by limiting the scope of connections that can be accepted by our datacenter network hardware devices. Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand.
However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity. Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages. The increased management traffic in turn, triggered bugs in some of the cluster’s hardware devices, causing these to reach 100% CPU utilization impacting data traffic.
To solve the problem, Microsoft tweaked the limit settings. The company has also promised to fix all the bugs that cropped up during the issue.
It’s the same question you hear constantly: can you trust the cloud with your mission-critical hosting, storage, and computing needs? Given how professionally bouts of downtime like this tend to be handled these days, I almost want to ask, as a reply: do you want your in-house team trying to solve the problem, or the teams that Amazon, Rackspace, Microsoft, and others have on hand to quash any problems that will, inevitably, occur?
Even more, that question isn’t even one that is too appropriate to ask. Think of this way: why is it such a big damn deal when a major cloud provider has a hiccough? Because of how massive their userbase is, and how critical they are.
Top Image Credit: Robert Scoble