With the dearth of information from AWS, and from past experiences, we can only hazard a guess how the AWS infrastructure looks like. However, the Great AWS Outage (not completely recovered as of this writing) gives us some ideas.. this is a snapshot of the outage so far, showing that majority of the API services were affected.
Pending the official post-mortem.. here’s a couple of possibilities:
- All these services run on the same public EBS layer. When that failed they all failed. This is the most likely reason but how does it explain the Elastic Beanstalk API failing as well (and this does not seem to be region-centric)? Also there could be a connection with the EBS failure on the 19th.. which resulted in a much bigger problem 2 days later. From the status page (EC2 N. Virginia):
[RESOLVED]Increased error rates for Instance Import APIs in US-EAST-14:38 AM PDT Between 02:55 am and 04:20 am PDT, the Instance Import APIs in the US-EAST-1 Region experienced increased error rates. The issue has been resolved and the service is operating normally.
- The US-EAST API infrastructure failed. Call it the Battle: Los Angeles scenario.. where the weakest link of the invading aliens just happened to be their Command and Control (C&C) — the movie sucks btw and these bunch of aliens obviously haven’t learned the lessons of their cousins from ‘V‘ and ‘Independence Day‘. This will explain the Beanstalk failing as well. Which means this one is not replicated to the US-WEST or anywhere else.
Too bad, we were seriously considering migrating our databases to the RDS and were just waiting for the beta bugs to be weeded out.. what a relief.