The Wisdom of the Cloud: What Can We Learn From Yesterday's Outage?

“Oy vey!” a Big Think blogger wrote to me yesterday. “Is BigThink.com down?” Oy yey, indeed it was. And we certainly weren’t alone.

Amazon in recent years has moved beyond its core retail business to rent data storage and server time through its Amazon Web Services division. Companies like Big Think, Foursquare and Instagram have taken advantage of Amazon’s low-cost offering. Yesterday we had a hiccup.

Here is how Big Think’s developer, Jason Mayfield, explains it:

Amazon Web Services (AWS) provides the underlying infrastructure for hosting Big Think’s various web properties. AWS operates its services in separate, resource-isolated “availability zones,” with one or more availability zones being part of one of several “regions.”

Yesterday, beginning in the early afternoon and lasting into the evening, the availability zone hosting bigthink.com experienced what Amazon is calling “degraded performance” in several key resources necessary for the site.  As a result, bigthink.com was unavailable for a couple of hours.

When we became aware of the outage, we first attempted to shift the bigthink.com resources to another, unaffected availability zone.  Our efforts there failed, as the mechanisms for doing such a move were also affected by the outage.

Our next recovery step was to bring up a temporary server with another infrastructure provider, which we were able to do, and by mid-afternoon that temporary server began responding to requests from a portion of our users.  Because of the nature of domain names and Internet routing, it took as much as an hour for all user requests to be routed to that server, and some users still experienced some errors while attempting to visit the site.

Around 5:30 pm EDT, Amazon’s infrastructure began to stabilize.  We continued to monitor until our original servers became available to respond to requests, at which time we began migrating our traffic back to the original servers.  By early evening, the original Amazon-based infrastructure was fully operational and no further traffic was going to the temporary server, which we then took offline.

We are developing a plan for both protecting against such outages in the future, as well as implementing better disaster recovery procedures for those instances where our best efforts at protection fail.

What’s the Big Idea?

Yesterday’s outage was a clear reminder of how dependent Big Think and other sites are on cloud services, its perils, and how we have a real need for what Jason described as “better disaster recovery procedures.” So let it be a learning moment. To put it another way, as Kelly Clay wrote in Forbes:

“…companies must be able to build in resiliency so if and when AWS goes down again (and it will), these websites won’t be affected to the degree that sites like Reddit are currently experiencing.”

So are public clouds like AWS unreliable? While there have been past mass outages — most notably last June — experts argue that services like Amazon are indeed reliable, but need to be managed properly. Netflix, for instance, even built a tool called Chaos Monkey that wreaks havoc on infrastructure in order to help engineers learn from and resolve problems.

Big Think has interviewed many experts on cloud computing and they have addressed both its game-changing potentials as well as our anxieties about security. In other words, the cloud may be described as a global brain that enables collective intelligence. And yet, not everything belongs in the cloud, as the programmer David Heinemeier Hansson tells Big Think in the video below.

For instance, why do we need to bring the desktop to the Web?

On the consumer level, Hansson wonders why people need to edit photos and videos online. If there is no great benefit to that, then using the cloud to do this work is neither necessary or perhaps even desirable, Hansson argues. We can both use the cloud and also “use the awesome local, graphical power and computing power of a modern computer to do those other heavy things,” Hansson says.

Watch here:

Image courtesy of Shutterstock

Follow Daniel Honan on Twitter @Daniel Honan