Updates at the foot of the post.
Yesterday something very unexpected happened; all of the BBC’s web empire went down, leaving its News, Sports and popular iPlayer websites completely inaccessible.
“This event was off the charts”
Gary Vaynerchuk was so impressed with TNW Conference 2016 he paused mid-talk to applaud us.
With unconfirmed reports of an attack by the online collective Anonymous, BBC staff and reporters immediately took to Twitter to broadcast what had occurred on the website, live-tweeting any updates they received from BBC engineers. It quickly emerged that there were network issues, although at the time the specific causes were unknown.
Within an hour, the website returned and access seemed to be back to normal for most users.
This morning Richard Cooper, Controller Digital Distribution & Operations took to the corporations Internet Blog to identify what happened during yesterdays outage. Cooper highlighted two networking issues that left the BBC’s online services inaccessible for what must have been the longed hour in some BBC employees lives.
Our systems are designed to be sufficiently resilient (multiple systems, and multiple data centres) to make an outage like this extremely unlikely. However, I’m afraid that last night we suffered multiple failures, with the result that the whole site went down. Enough of the systems were restored to bring BBC Online pretty well back to normal by 23:45, and we were fully resilient again by 04:00 this morning.
For the more technically minded, this was a failure in the systems that perform two functions. The first is the aggregation of network traffic from the BBC’s hosting centres to the internet. The second is the announcement of ‘routes’ onto the internet that allows BBC Online to be ‘found.’ With both of these having failed, we really were down!
We’ll be taking a very hard look at what we need to do to make sure that this doesn’t happen again.
A routing issues, such as the one experienced by the BBC, affects how user requests are delivered to the BBC website. Firstly, traffic was not being adequately routed from the corporation’s hosting servers to the rest of the internet. Secondly, DNS servers were affected, meaning the link between the BBC website and its domains were not performing as they should.
As expected, service was resumed very quickly, but full service was not restored for over five hours as engineers worked to fix issues and return the website to its previous state.
Update: Shortly after publishing the story, we were made aware of an article by The Guardian which reported that the BBC and Siemens (its IT contractor that handles its infrastructure) are engaged in a row after the corporation published an internal email between the two companies on its news website.
Siemens executives are said to be furious at the BBC for publishing an internal email which The Guardian has quoted:
“The offending email sent by Siemens to BBC staff on Wednesday morning said: “Cause of issue: Faulty Switch … Services Impacted: Everything.”
Siemens network engineers remotely powered down equipment at a second Internet connection at Telehouse Docklands. This got things back up and running again.
They then isolated the core router in Telehouse Docklands, and restored power to it. Once power was restored and the router was running in a satisfactory way, they reconnected to the internet and BBC networks in a controlled manner. Further investigations are ongoing to identify the root cause of this fault.”
The offending article has since been amended, quoting the blog post above. Shame it won’t appease the executives at Siemens.