Gmail users yesterday experienced quite a lengthy outage, lasting up to 11 hours for some. Google today has apologized for the issues (including messages arriving slowly and unavailable attachments), which it calculates affected almost a third of emails that route through its servers.
Google notes that most messages were unaffected: 71 percent had no delay. Nevertheless, the remaining 29 percent saw an average delivery delay of 2.6 seconds. That doesn’t seem like much, but Google did admit that about 1.5 percent of messages, which is likely many millions if not billions of emails, were delayed by more than two hours.
So. Much. Tech.
Some of the biggest names in tech are coming to TNW Conference in Amsterdam this May.
Here’s what happened, according to Google:
The message delivery delays were triggered by a dual network failure. This is a very rare event in which two separate, redundant network paths both stop working at the same time. The two network failures were unrelated, but in combination they reduced Gmail’s capacity to deliver messages to users, and beginning at 5:54 a.m. PST messages started piling up.
Google’s automated monitoring alerted the Gmail engineering team within minutes, and they began investigating immediately. Together with the networking team, the Gmail team restored some of the network capacity that was lost and worked to repurpose additional capacity, clearing much of accumulated message backlog by 1:00 p.m. PST and the remainder by shortly before 4:00 p.m. PST.
The only good part about the whole fiasco was that Gmail seemed to work otherwise (although some users beg to differ). Google notes users could log in, read messages (those that were actually delivered), send email, and so on.
Nevertheless, the company says it will be taking steps over the next few weeks to make sure these issues don’t happen again by increasing network and backup capacity for Gmail as well as making message delivery more resilient “even in the event of a rare dual network failure.” Most importantly, Google says it is updating its internal practices so it can “more quickly and effectively respond to network issues.”
That was the real problem yesterday: the issues went on for many hours, with no end in sight. Then again, at least the outage didn’t last for three days.
Top Image Credit: Johannes Eisele/Getty Images