InfoQ had an interesting article on the recent outage of Amazons storage service S3. The article indirectly gives a beginners lecture on fault tolerance metrics, and gives an insight on the true relevance of SLA in practice. We learn again that even with a new term for distributed computing (“cloud”), the problems remain the same:
http://www.infoq.com/news/2008/02/s3-outage-trust-slas
[Update]
I start to feel a little sneak preview of old people’s wisdom. See here:
- Sidekick Gone
- Google Gone
- Bitbucket Gone (while this was not directly Amazon’s fault)
There is nothing new about the failures themselves, it is more about the (misplaced) trust on the vendors capabilities for dependability. Dependable systems cost real money – a lot of it. The old rule still counts – you get what you pay for.