The Great AWS Outage of 04/21/11 is over (for some).

In Cloud Computing on April 23, 2011 at 7:18 am

As of this writing AWS has recovered enough EBS volumes from their backups (S3 I suppose) from the previous day. This could only mean that the networking event they were referring to is actually some disastrous component/hardware failure event. AWS users who were using EBS as their primary data store (since its supposed to be replicated within an availability zone — which translates to being backed-up in a single data center) were particularly effected. When we began using AWS a couple of years ago, we also got burned by EBS (Elastic B*** S***). EBS I/O latency and reliability is high and low respectively, after a couple of ‘events‘ (to borrow AWS’ favorite term) we realize EBS is not good for high I/O applications (e.g. database).

Hence our current (and soon to be expanded) strategy:

1) Use older AMI’s which defaults to activating the instance’s disk (the one which is physically part of the machine)

2) Use the instance’s disk space for high I/O applications store (e.g. database) and replicate to another instance and backup — in our case everything, including the ‘hosts’ file — several times a day against an attached EBS volume. EBS volumes “are automatically replicated on the backend (in a single Availability Zone)“. However, as experience have shown us (annually it now seems), this is not enough so…

3) Use S3 snapshots to backup the EBS volume in item 2. This way you have a third backup Snapshots are “automatically replicated across multiple Availability Zones” which means you’ll have a copy of your backups in different data centers within one region.

On a typical day, the AWS support forums are full of benign questions and a lot of pleading (since majority of users do not pay for Premium Support — the one that comes with a phone number to call — many end up posting requests in the forums in the hopes that an AWS engineer will pick their request and do something about it, us lot can be pathetic really). Well not today, there seems to be an Amazonian revolution in the offing with some calling for a class-action and others throwing in the towel. Here’s a favorite:

Farewell EC2, Farewell
Posted by: pixspree

Posted on: Apr 22, 2011 1:40 PM

Farewell EC2,

I am honestly going to miss you. You were my first and only time in the cloud. You were so experienced, and I was so… fresh. We had some great times. I remember that one time when we stayed up all night learning how to create AMI’s. We had so much fun launching and terminating, except for that one time when the accidental termination policy wasn’t set. Man that really sucked. I know you didnt mean it, just like now I know you didn’t mean to walk out on me for almost 2 days, but you gotta understand. I have needs. I need to get back in the cloud and I need support. You know the kind where it’s included in the overpriced retail cost. The kind where I can call up a support agent intead of begging Luke to stop my instance in the forums. I know Luke probably has a bigger keyboard then me, but that’s not the point. The point is that I thought we had a connection. Like a real port 21 connection. I guess I should have gotten the picture when you first refused my connection two nights ago in the wee hours of the morning. I thought it was just because I was drunk but it turns out I was wrong. I guess you just aren’t interested in me anymore… And that’s ok. It’s my time to move on. I hope you don’t forget me, because I won’t forget you and your cute little EBS volumes that I loved to attach and detatch.

Farewell EC2, We had a good go. I’m switching DNS.

P.S. I want my Jovi records back.

It would be interesting to know what really happened this time, and what AWS will do to soothe our collective ‘apprehensions’ and regain our lost — or at least degraded — confidence. They can start by immediately dropping prices across the board (to stop the migration out of AWS or at halt of expansion using AWS services — I already know we’ll do the latter) as we don’t really know (nor care) what they will be doing to make sure this so called network event does not happen again, because we already know.. it will.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: