After a week amazon finally released their much awaited post mortem, the result: mea culpa.
In a nutshell they are saying that an engineer made a routing error during some maintenance at 1am (reminds me of FAA’s air controller problems) sending traffic used by EBS for replication. When the EBS nodes could not replicate they thought that their backup peers were down and triggered an alternative backup mechanism which added to the load and driving load higher and causing other EBS nodes to begin backig up in panic mode. Thats the explanation.
Sounds logical enough but i dont buy it heres why:
1) any network operator worth their salt would have at least 2 routes for redundancy and using some kind of IGP for automatic redundancy and load balancing. I cant believe that with all that talk of high reliability they only have one link for the replication and using one so called “control plane” whatever that means.
2) I dont see why a ‘routing issue’ can lose an EBS volume (ie render them unrecoverable) unless somehow a high load can cause a hard drive to fizzle out or write bad blocks. Very unlikely.
3) The fact that it took them more that 3 days to restore volumes shows that it it is hardware related meaning the primary devices which were inexplicably rendered dead by a barrage of packets had to be restored using the backups. I wonder if the .07% unrecoverable volumes created just before or during the outage.
4) then theres the defeaning silence of all the AWS evangelists, Jeff Bezos and all other paid AWS bloggers. Its not that everybody is singing the same tune: nobodys singing at all!
In conclusion, i dont think this is a network event triggered disaster at all but an inherent infrastructure/ hardware design failure event. For the sake of many people who believes in AWS, I hope AWS learned their lessons really well and will make the necessary “fixes”.