Friday, December 15, 2006

Google Mail and Talk outage

Even Google is not king of 5 nines reliability...

Around 10:15 am (Central Europe time) GMail started spitting out "Server Error We're sorry, but Gmail is temporarily unavailable. We're currently working to fix the problem -- please try logging in to your account in a few minutes." POP access was also timing out from standard email client. At the same time logging onto GTalk would lead to contact lists in error

<iq type="error" id="aacca" to="jlseguineau@gmail.com/psi100C6E2E"></iq>
<query xmlns="jabber:iq:roster">
<error type="wait" code="500">
<internal-server-error xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/>
</error>
</query>
</iq>

Around 10:30 the contact list access was re-established, and normal operations resumed on GTalk. But GMail was still unaccessible. Only towards 10:42 did it come back to a pallid life, but even at that time some function of the interface were spitting operational errors. UI functionality came back to normal around 10:45, with SMTP out re-established around 10:47, but at the time SMTP in was still not restored. In addition, the content of my "sent Mail" folder had disappeared... At 10:52 SMTP in was restored and I was able to receive some earlier test messages. Still, not all of them, as they were held back on their originating server following the outage, waiting to be retried later. At 10:57 the content of my my "sent Mail" folder re-appeared. And at around 11:02 the service was restored to its normal working state. In the end, the recovery was well orchestrated, and Google has certainly put some thought behind its procedure. Every external access was degraded gracefully, be it POP/IMAP, web UI or XMPP. Similarly, the service came back to life gradually, without spike, with different functions being restored independently. This incident brings to light some of the architectural options they took, and show how their messaging infrastructure is integrated.

In particular, contact lists are definitively shared between mail and instant messaging, which is why GTalk showed partial service outage when the contact list service was not operational. It also mean that Google could very easily come up with a presence enabled address book where end users could aggregate their contacts.

In the end, although I experienced a mere 40 minutes of inconvenience, the service was restored to its full capacity through what look like an entirely automated process. Lesson to be learned by some would be web 2.0 entrepreneurs…