Disaster Recovery is the bellybutton of the IT world. Everyone has it (or at least in theory) – However, does anyone know if it can be used for anything useful?
The Disaster When There Is No Active Recovery
Have you been in this scenario: Each year during budget time you SWAG $50-100K to “test disaster recovery.” During the many cycles of budget negotiation, it is dismissed at some point and forgotten until the next budget season? In the meantime, somewhere along the way you are audited or something goes “bang” and DR suddenly becomes a hot topic … for about a week and then it is business as usual.
In many of our technical due diligence reviews we find companies are still stuck in the hypothetical world of table tops and disaster recovery exercises. During our longer term engagements and workshops, we repeatedly hear of DR gone wild - and wrong. The DR exercise is scheduled, then pushed off. Scheduled, pushed off.
Finally the day comes like the first day of school and everyone is aligned to test the DR plan. Unfortunately, too many things have changed since the DR instance was designed and the exercise is a failure. Since business is generally good, the can is kicked down the road to some future utopian alignment of the stars when it can be rescheduled and everyone will have fixed everything on the backlog discovered during the exercise.
Disaster Recovery that Wasn't
Throughout my career I have been invited to participate and support the physical security for many of these exercises. One was in San Jose over a decade ago. A $1.5M large 53 foot trailer with generator and satellite (but no bathroom or cooking facilities …) was procured and sat on its own concrete pad fenced in and tied into our security system. The day came after months of tabletop planning to exercise the DR plan and nothing worked. The laptops - which were already three years old when reallocated to the trailer several years earlier - couldn’t run the latest service pack and as a result could not VPN into the network. The satellite uplink could barely support two or three laptops with email only. All the money invested was a waste and the trailer was eventually donated to a non-profit and the cement pad turned into additional parking.
Several years later while I was responsible for operations in AsiaPac, IT approached us again to do a tabletop exercise and then a simulated exercise in india. My team worked with local emergency services to simulate a fire and spice up the annual building evacuation exercise and IT was to also exercise their DR plan. The evacuation exercise had a lot of theatrics and was considered a great success with explosions, fire crews rushing in and spraying the building. At the same time, a group of developers boarded a bus with their laptops and after over an hour and a half in traffic showed up to the DR site IT had been paying for only to find a real fire had occurred a few months previous and the vendor had failed to inform anyone. Instead of a “plug and play” workspace, they found a concrete bunker dark and still smelling of smoke with a few folding tables and chairs with extremely slow and limited connectivity. They got back on the bus and found a local coffee shop where they were a lot more successful logging on to the network.
These two examples are of physical failures, but I use them to illustrate the limitations imposed by theoretical DR.
Tom Kevan, a founding partner, recalls how once they had set up three instances at eBay for web traffic, gone were the days of overnight outages because they were able to take down the third instance, perform the upgrades and test, and bring it back online. All the while customers never knew he was running "disaster recovery" because it was part of daily operations for web traffic.
Living in 24/7 Disaster Recovery - Active/Active Architecture
Strong, vibrant companies live in a DR world. This means they design with the “rule of three” - always have three active, complete, and independent stacks in production. This allows for continuous maintenance and upgrades on one system while always having at least two active/active systems in production. During peak times, all three are up and taking traffic. The goal is for DR to become your everyday operating cadence, not an unusual event.
As you formalize your architecutural principles, remove the words “disaster recovery” and plan to be forever resilient as table stakes for being in business.
The information on the following slides are what we cover in our three day workshops and often share in our final report with our clients when architectural principles are discovered to be lacking during a technical due diligence.