Business continuity is a critical but often neglected part of managing a technology product. Perhaps this is because it is not a very exciting topic compared to the next customer problem to solve, the next market to address or the next hot new tech trend to incorporate into the product. The result of this often neglected capability is the major impact to the business and customers when an outage occurs that requires significant measures such as failover to a disaster recovery site and it just plain doesn’t work.

In addition to creating a plan that looks nice on paper, we regularly see more fundamental pitfalls that make Business Continuity challenging based on underlying architecture, org structure, lack of business engagement and poor governance/process across the teams involved.

The good news is there are straight forward approaches in each of these areas to optimize your product overall as well as its ability to recover in a disaster. Let's explore each of these pillars and how to optimize them from a DR perspective.

ARCHITECTING FOR DR

One of the most common architectural “attitudes” we see with the rise of public cloud platforms is the sentiment that “Amazon has it [BCP] covered” (or name your public cloud provider). This leads to poor resiliency choices such as:

  • Only running the platform in one geographical region. Early in a business’s growth, one region may be cost/risk appropriate. However, as a business scales, it’s important to have a future architectural path to run active-active in multiple regions. This significantly simplifies disaster recovery (DR) practices as DR is basically always running. In many businesses this provides a side benefit of improved performance for customers on an ongoing basis since customers can be directed first to their closest region.

  • Inherent in running active-active across regions is of course the concept of running active-active at all! Many of our clients do not run active-active across availability zones OR regions, siting increased cost and to a lesser degree increased complexity (and of course cost and complexity are also related).

  • Complexity – while there is an additional layer of logic to manage traffic across sites, as public cloud tools are ever improving this argument is countered by the simplicity of managing in the case of an issue with one site as well as the improved performance for customers for many geographically scaled businesses.

A second architectural tenant that benefits DR management is the ability to quickly enable/disable functionality. The architecture concept here is simply that not all aspects of a products’ functionality are created equal in terms of business/customer value. For example, the ability to search for products, add to a cart and successfully check out are more critical than capabilities such as making recommendations or assigning loyalty points to a purchase. We commonly see the following pitfalls in this area:

  • No standard architectural pattern for how features and enabled/disabled
  • No framework for deciding what capabilities should be able to be enabled/disabled (ideally most capabilities)
  • No oversight that these principles and patterns are actually being used as new capabilities are added
  • No testing during the product development lifecycle and/or DR testing to ensure that the wiring on/off works AND that non-critical capabilities are not in the critical path of truly critical capabilities. For example, the team may *think* that the addition of loyalty points for a purchase are not in the critical path to checkout but actually find in the wild that is not true. Ongoing regression testing scenarios as well as DR testing MUST include these scenarios to ensure non-critical functionality has not ‘creeped into the critical path.’

ORGANIZATION

First, stating the obvious, organization and architecture are inextricably linked as we know from Conway’s law. That said, we regularly see organizational anti-patterns that make successful DR almost impossible. The most common organizational challenge is the functional organizational structure or "partially functional org structure."

Most Product Engineering teams we work with are organized in some form of cross functional agile team structure BUT we still see many organizations where specialized functions such as database management, DevOps and/or Operations and Infrastructure are centralized and separate and therefore disconnected from the engineering scrum (or similar agile) teams. This siloed structure causes a number of ongoing issues, as well as DR specific challenges:

    • The teams that ‘support’ the agile engineering teams don’t understand how the product actually works
    • And conversely, the Agile teams don’t understand how the infrastructure, DB, deployment architecture (name your function!), etc, work. Stating the obvious this makes it impossible to efficiently troubleshoot in just about any scenario, whether it’s a weekly product release or a DR scenario. In a DR scenario though, the stakes are higher.
    • This disconnection across functionally organized teams also breeds affective conflict which adds more fuel to the fire in a DR scenario where tensions are high which results in slower recovery times as well as finger pointing both during and after recovery.

    FULL BUSINESS PARTNERSHIP IN MANAGING BUSINESS CONTINUITY

    Unfortunately, as previously mentioned, DR is not as exciting to key product stakeholders as the next exciting new feature, new hot technology etc. Both Engineering and the Business/Product teams are guilty of this - it is human nature! Very few people get excited about DR, however, generally the Business stakeholders who have the highest accountability for business outcomes (although they should always be shared with Engineering) are least interested in DR until disaster does strike and recovery is slow and highly impactful. This is not an easy problem to solve.....until disaster strikes and significantly impacts customers, Business and Product stakeholders often:

    • See DR as unnecessary added cost, both in terms of infrastructure and engineering effort.
    • Do not want the Engineering spending time on anything except for pumping out new features

    Both of these related challenges can make it hard to architect and test DR on an ongoing basis, since there is a lack of support for the effort and cost involved – and again, neither party finds DR exciting.

    Although this is not easily solved, the following strategies can help:

    • Education – Engineering should take every opportunity to explain the importance of DR capabilities in Business (customer impacting) terms, including previous relevant examples of past DR impacts. For example, even if the company has not experienced a prolonged outage, there are always plenty of examples from other similar businesses, public cloud providers, etc.
    • Aligned incentives – BOTH Product and Engineering must have shared objectives for successful DR capabilities and ongoing testing. As we often highlight, Engineering and Product should share almost ALL objectives and DR is no different.
    • Executive Support – this is KEY to DR success. Ideally even the CEO acknowledges the importance of DR capabilities and participates in scenario testing periodically. This strategy may go the furthest of all to help support DR efforts.

    GOVERNANCE, OVERSIGHT AND DRIVING DR ACCOUNTABILITY

    Even with the best of intentions, great architecture, fully cross-functional teams and Business support and involvement, DR can still come off the rails without adequate governance and foundational process to keep everything needed (people, process and technology) current and top of mind.

    Without an appropriate amount of central oversight DR becomes everyone’s responsibility and therefore no one’s responsibility leading to DR capabilities, policies, documentation and testing become a fragmented and inconsistent “dog’s breakfast” or free for all across teams. What does this look like in the wild?

      • Poor to non-existent and/or out of date documentation, making it hard to conduct a successful DR test
      • Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) that are all over the map AND worse yet, often don’t reflect the true criticality (or lack thereof) of various functionality. This results in confusion as what is most important to recover in a DR scenario and higher ongoing infrastructure costs
      • Lack of testing, period! Without a moderate level of oversight, teams inevitably postpone DR testing again and again, just plain don’t do it and/or don’t repeat tests until they are successful.

      Central oversight of DR can and should be ‘lite’ but it needs to exist in scaled organizations. It can be managed out of a corporate risk management function but should also have some central oversight within the Product Engineering organization. This often lives within Security and/or a small Project Management/PMO and/or “Office of the CTO.”

      The central responsibilities, ideally include:

      • Facilitating the determination of Tiers of criticality to classify functionality/services into
      • Facilitating the consistent assignment of functionality into the standard tiers
      • Facilitating the definition of RTO’s and RPO’s for each criticality tier (working across the Business and Engineering)
      • Setting a cadence for updating both application relevant DR data, which is usually in a configuration management data base (CMDB) or equivalent as well as DR step by step plans
      • Monitoring the updating, quality and accuracy of the CMDB data and DR recovery plans/steps
      • Facilitating the setting of the DR testing cadence and strategy
      • Supporting the scheduling of DR testing, which often involves coordinating and prioritizing with other change management activities

      Note the the recurring use of the word “facilitating” above. This is very deliberate as the central portion of the DR function are NOT the “doers” nor are they the sponsors of DR. The doers are heavily Engineering with some involvement from other teams, e.g. Product, Business, PR/comms. The sponsors should be the Business owners of the product and should be deciding the important aspects of functionality tiers, the thresholds of RTO/RPO that map to each tier and what tier functionality should be mapped into.

      Although these four pillars of Business Continuity/DR greatness do take work to optimally and fully address, it is well worth the effort to minimize or completely remove the impact of a Disaster in any one region of operation. And like so many things, if these muscles are developed and exercised on an ongoing basis it significantly reduces the ongoing cost and frustration since they are part of daily, weekly, monthly and quarterly activities that everyone is familiar with – without having to relearn or restart once a year.

      Lastly, it CANNOT be understated how architecting for 3 site active/active/active operation on an ongoing basis simplifies the ability to manage in any disaster scenario (in most cases a disaster will impact only one of the three regions).

      We work hands-on with teams regularly to improve their Business Continuity and Disaster Recovery posture according to the principles outlined above. Give us a call and we can help!