, , , ,

It is more than two weeks since Sony’s PSN is down. In A Letter from Howard Stringer, he apologized and called it a man made disaster comparable to the recent tsunami in Japan. This reminds me of Skype’s failure early this year – my old blog post. This time it is more serious sophisticated hacker attack. We were lucky that it was not a mission critical service like health or financial market. Serious problem here is that most of the cloud services still do not have service mindset. Most of the development and deployment of cloud services still follow old mindset of developing a good feature set and build a scalable system with reasonable security. Very few – if any – really have complete tested plans to handle serious disasters and quick recovery of the system. In ideal service world service strategy and development should happen together. Service is not simply handing over a box to customer. Cost of running service is more than just operating cost of servers and data centers. You must have plans and resources in place  to handle a disaster and fast recovery of your service. Simply assembling an engineering team after hitting a disaster will not work.

We can learn and and apply lessons from traditional service based business models. We have to make sure that a dedicated disaster recovery team is always available. This team could be a combination of engineering and operations teams. We should consider cost of disaster recovery team as a cost of running service, not an overhead cost of the project. Team should not only be expert in running daily operations but also be having have in depth knowledge of the design and implementation. They should be doing regular mock simulated exercises to handle service disaster like any other disaster handling team does in real world. They should be ready to take action on preemptive alerts. Customers of the service should have a well defined channel to quickly report issues and disaster recovery team should be able to respond to it. Service architecture should allow for identifying and isolating faulty pieces and able to restore minimum possible level of service as quickly as possible.

At the end I consider deploying a service is like any other traditional service business, for example when someone built a new hotel he not only plan for good service staff and resources to operate it, but also invest on handling emergency situations. Trying to cut corner on handling of disaster and recovery is simply preparing a recipe for expensive failure in future.