A False Sense Of Security and Complacency = Revenue Loss
Its Monday morning and past Saturday evening issues in one of your datacenters triggered a failover to your second data center for service restoration. In other words, all customer traffic has been routed to a single datacenter. The failover was executed flawlessly and the team went back to bed waiting for Monday morning to permanently fix the issue so traffic could once again run out of both datacenters. On Monday morning, we are expecting a flash sale and will make close to $8000 a minute at peak. All is well and there is nothing to worry about. Right?
Hopefully you cringed at the above scenario. What if the data center you are running out of suffers from a failure? Or what if the only data center and its components that is now live for all of your traffic simply wasn’t sized correctly for acceptable performance during a traffic spike?
If it hasn’t happened yet, it will. If that were the case, your business would stand to lose significant revenue. We see it over and over again with many clients and have also experienced it in practice. Multiple datacenters can serve as a false sense of security and teams can become complacent. Remember, assume everything will fail as a monolith. If you are only running out of a single data center and the other is unable to take traffic, you now have a SPOF and as a whole the DC is a monolith. As a tech ops leader you have to drive the right sense of urgency and lead your team to have the right mindset. Restoring service with a failover is perfectly acceptable. However, the team cannot stop there. They must quickly diagnose the problem and return the site to normal service, which means you are once again running out of two datacenters. Don’t let the false sense of security slip into your ops teams. If you spot it, call it out and explain why.
To help combat complacency from setting in, we recommend considering the following:
- Run a Morning Ops meeting with your business and review issues from the past 24 hours. Determine which issues need to undergo a postmortem. See one of our earlier blogs for more information: http://akfpartners.com/techblog/2010/08/29/morning-operations-meeting/
- Communicate to your team and your business on the failure and what is being done about it.
- Run a postmortem determine multiple causes and actions and owners to address the causes: http://akfpartners.com/techblog/2009/09/03/a-lightweight-post-mortem-process/
- Always restore your systems to normal service as quickly as possible. If you have split your architecture along the Y or Z axis and one of the swim lanes fails or an entire datacenter fails, you need to bring it back up as quickly as possible. See one of our past blogs for more details on splitting your architecture: http://akfpartners.com/techblog/2008/05/30/fault-isolative-architectures-or-“swimlaning”/