Recovering from Amazon cloud outage

by Drew Engelson — 21 Apr 2011

Amazon Web Services (AWS) have been integral to the successes of nearly all recent project launches at PBS. All of our core applications are deployed out on AWS EC2 servers and RDS database instances. While we have experienced an occasional component failure, these have been infrequent. When failures have occurred, we have typically been able to leverage the agility of the cloud to quickly and easily work through them.

AWS East is down

April 21st was quite another story! Early that morning, AWS began experiencing connectivity issues affecting Elastic Block Store (EBS) volumes in the Northern Virginia region (us-east-1), and hence any EC2 or RDS instances that depend on EBS. Oy vey!

The outage was heavily covered in the press:

What did this mean for PBS? Well it took out our main portal site (PBS.org) and many core services (Merlin API, COVE video API, TV Schedules API, mobile apps for iPad and iPhone, and more) for a while. Ouch! While we try to leverage multiple availability zones where possible, we are entirely in Amazon's East Coast data centers.

Recovering from AWS outage

Since the outage affected only EBS-based EC2 and RDS services in the East region, a path to workaround the outage seems obvious:

Avoid EBS (at least for now)
Go West, young man!

This is exactly what we did this morning. We relaunched some applications from backups on temporarily EBS-less servers. And we migrated some RDS database servers to the West coast region (us-west-1). Once this was accomplished, our public facing systems were back online.

Looking forward

Now what? How can we depend on AWS in the future? Should we migrate our services elsewhere? There is a universal truth that applies here: sh*t happens

In practicality, being in the cloud has provided such an improvement in our overall stability and ability to manage our infrastructure with minimal resources that I can no longer imagine life without it.

So the real question is, "How can we reduce our exposure to this in the future?"

Any ideas?

AWS East is down

Recovering from AWS outage

Looking forward

Links

Drew Engelson