json

Archive for April, 2011|Monthly archive page

AWS says sorry…but what about the rebate?

In aws, Cloud Computing on April 28, 2011 at 10:38 pm

7 days and still no post mortem as promised and it turns out the so called “network event” is actually a hardware failure resulting in some .07% of the EBS volumes lost. Now they are talking about “the hardware” in the singular, what happened to the high reliability claims? all snake oil?

Here’s their letter of apology (from Bus. Int)
————————–
Hello,

A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful. The hardware failed in such a way that we could not forensically restore the data.

What we were able to recover has been made available via a snapshot, although the data is in such a state that it may have little to no utility…

If you have no need for this snapshot, please delete it to avoid incurring storage charges.

We apologize for this volume loss and any impact to your business.

Sincerely,
Amazon Web Services, EBS Support

This message was produced and distributed by Amazon Web Services LLC, 410 Terry Avenue North, Seattle, Washington 98109-5210
—————————

You have to hand it to AWS not forgetting to remind the customer about the S3 storage fees.

The Great AWS Outage of 2011 Officially over

In aws, Cloud Computing on April 25, 2011 at 11:35 am

Well its official, after 78 hours Amazon has declared the emergency over:

7:35 PM PDT As we posted last night, EBS is now operating normally for all APIs and recovered EBS volumes. The vast majority of affected volumes have now been recovered. We’re in the process of contacting a limited number of customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes.

If you believe you are still having issues related to this event and we have not contacted you tonight, please contact us here. In the “Service” field, please select Amazon Elastic Compute Cloud. In the description field, please list the instance and volume IDs and describe the issue you’re experiencing.

We are digging deeply into the root causes of this event and will post a detailed post mortem.

post mortem.. oh the irony.

Questioning the AWS infrastructure

In aws, Cloud Computing on April 25, 2011 at 10:39 am

With the dearth of information from AWS, and from past experiences, we can only hazard a guess how the AWS infrastructure looks like.   However, the Great AWS Outage (not completely recovered as of this writing) gives us some ideas.. this is a snapshot of the outage so far, showing that majority of the API services were affected.Great AWS Outage of 2011

Pending the official post-mortem.. here’s a couple of possibilities:

  1. All these services run on the same public EBS layer.  When that failed they all failed.  This is the most likely reason but how does it explain the Elastic Beanstalk API failing as well (and this does not seem to be region-centric)?   Also there could be a connection with the EBS failure on the 19th.. which resulted in a much bigger problem 2 days later. From the status page (EC2 N. Virginia):
    [RESOLVED]Increased error rates for Instance Import APIs in US-EAST-1
    4:38 AM PDT Between 02:55 am and 04:20 am PDT, the Instance Import APIs in the US-EAST-1 Region experienced increased error rates. The issue has been resolved and the service is operating normally.
  2. The US-EAST API infrastructure failed.  Call it the Battle: Los Angeles scenario.. where the weakest link of the invading aliens just happened to be their Command and Control (C&C) — the movie sucks btw and these bunch of aliens obviously haven’t learned the lessons of their cousins from ‘V‘ and ‘Independence Day‘.  This will explain the Beanstalk failing as well. Which means this one is not replicated to the US-WEST or anywhere else.

Too bad, we were seriously considering migrating our databases to the RDS and were just waiting for the beta bugs to be weeded out.. what a relief.

AWS Outage Exhaustion.. 4 days and counting?!!!

In aws, Cloud Computing on April 25, 2011 at 7:53 am

Ok this will be my last post about the Great AWS Outage which, from the posts in the support forums, seems to be still affecting a substantial number of users, not quite the rosy picture AWS posted in their status board:

Apr 24, 2:06 PM PDT We continue to make steady progress on recovering the remaining affected EBS volumes. We are now working on reaching out directly to the small set of customers with one of the remaining volumes yet to be restored.

We’re getting better support and uptime with our lone linode box.   I suspect, we’ll be moving some servers there after this disaster.

The Day the (AWS) Cloud Died

In Cloud Computing on April 24, 2011 at 4:53 pm

Well looks like the Great AWS Outage is now on its third day.. Despite the change on status from red to yellow on their status site the east availability zone is still not 100% up and running. For many us this is one big pin prick to our cloud balloon specifically the one that says ‘AWS’ on it.

The news is now being picked up everywhere and I don’t think they will be able to just sweep this under the rug.

The confidence of majority of AWS users have been shaken to the core! I don’t mean to sound so melodramatic but I could have been screaming for blood if we (in the west zone) got hit as well. As for this posts title..

http://blogs.forbes.com/ciocentral/2011/04/22/the-day-the-cloud-died/

The Great AWS Outage of 04/21/11 is over (for some).

In Cloud Computing on April 23, 2011 at 7:18 am

As of this writing AWS has recovered enough EBS volumes from their backups (S3 I suppose) from the previous day. This could only mean that the networking event they were referring to is actually some disastrous component/hardware failure event. AWS users who were using EBS as their primary data store (since its supposed to be replicated within an availability zone — which translates to being backed-up in a single data center) were particularly effected. When we began using AWS a couple of years ago, we also got burned by EBS (Elastic B*** S***). EBS I/O latency and reliability is high and low respectively, after a couple of ‘events‘ (to borrow AWS’ favorite term) we realize EBS is not good for high I/O applications (e.g. database).

Hence our current (and soon to be expanded) strategy:

1) Use older AMI’s which defaults to activating the instance’s disk (the one which is physically part of the machine)

2) Use the instance’s disk space for high I/O applications store (e.g. database) and replicate to another instance and backup — in our case everything, including the ‘hosts’ file — several times a day against an attached EBS volume. EBS volumes “are automatically replicated on the backend (in a single Availability Zone)“. However, as experience have shown us (annually it now seems), this is not enough so…

3) Use S3 snapshots to backup the EBS volume in item 2. This way you have a third backup Snapshots are “automatically replicated across multiple Availability Zones” which means you’ll have a copy of your backups in different data centers within one region.

On a typical day, the AWS support forums are full of benign questions and a lot of pleading (since majority of users do not pay for Premium Support — the one that comes with a phone number to call — many end up posting requests in the forums in the hopes that an AWS engineer will pick their request and do something about it, us lot can be pathetic really). Well not today, there seems to be an Amazonian revolution in the offing with some calling for a class-action and others throwing in the towel. Here’s a favorite:

Farewell EC2, Farewell
Posted by: pixspree

Posted on: Apr 22, 2011 1:40 PM

Farewell EC2,

I am honestly going to miss you. You were my first and only time in the cloud. You were so experienced, and I was so… fresh. We had some great times. I remember that one time when we stayed up all night learning how to create AMI’s. We had so much fun launching and terminating, except for that one time when the accidental termination policy wasn’t set. Man that really sucked. I know you didnt mean it, just like now I know you didn’t mean to walk out on me for almost 2 days, but you gotta understand. I have needs. I need to get back in the cloud and I need support. You know the kind where it’s included in the overpriced retail cost. The kind where I can call up a support agent intead of begging Luke to stop my instance in the forums. I know Luke probably has a bigger keyboard then me, but that’s not the point. The point is that I thought we had a connection. Like a real port 21 connection. I guess I should have gotten the picture when you first refused my connection two nights ago in the wee hours of the morning. I thought it was just because I was drunk but it turns out I was wrong. I guess you just aren’t interested in me anymore… And that’s ok. It’s my time to move on. I hope you don’t forget me, because I won’t forget you and your cute little EBS volumes that I loved to attach and detatch.

Farewell EC2, We had a good go. I’m switching DNS.

P.S. I want my Jovi records back.

It would be interesting to know what really happened this time, and what AWS will do to soothe our collective ‘apprehensions’ and regain our lost — or at least degraded — confidence. They can start by immediately dropping prices across the board (to stop the migration out of AWS or at halt of expansion using AWS services — I already know we’ll do the latter) as we don’t really know (nor care) what they will be doing to make sure this so called network event does not happen again, because we already know.. it will.

Major AWS Outage in US-EAST 24 hours and counting…

In Cloud Computing, Networking on April 22, 2011 at 5:22 pm

Since about 1AM PDT AWS US-East’s EBS service has been down. It’s been 24 hours now and many people are getting mighty antsy about this disaster, which according to their status site is caused by “networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes.“. The irony is, the automated backup activity is bringing down the entire EBS infrastructure and EC2 (those instances that depend on EBS anyway — which most probably do) in the availability zone.

As for my servers, they seem to be ok in US-WEST.