Hosting Lessons from Harvey and Irma
September 19, 2017 | Posted By: Greg Fennewald
Everyone was saddened to see the horrific destruction storms caused to Houston and Florida, including deaths and extensive property damage. It seems reasonable that the impact of these hurricanes was lessened by advanced notice and preparation – stockpiling supplies, evacuating the highest risk areas, and staging response resources to assist with recovery and rebuilding.
Data centers operate every day with a similar preparation mindset: diesel generators to provide power should the utility fail, batteries to keep servers running during a transition, potentially stored water or a well to replace municipal water service for cooling systems, and food and water for personnel unable to leave the location.
What happens when a “prepared” location such as a data center encounters a hurricane with strong winds, heavy rain, and extensive flooding? In some cases, the data center survives without impact, although there certainly will be outages and failures. Examples of data centers surviving Harvey in good shape can be seen here, while accounts of the service impacts caused by Hurricane Sandy can be seen here.
Data Center Points of Failure
Let’s examine what may enable a data center to survive without functional impact. Extensive risk investigation goes into site selection for data centers. Data centers are expensive to build with costs measured in the tens or even hundreds of millions of dollars. The potential business impact of a failure can be costly with liquidated damage clauses in hosting contracts. These factors lead to data centers being located outside of flood plains, away from hazardous material routes, and stoutly constructed to endure storm winds likely in the region.
Losing utility power is regarded as a “when” not an “if” in the data center industry (be that an outage or a planned maintenance activity), and diesel generators are a common solution, often with 24 hours or more of fuel on hand and multiple replenishment contracts. Data centers can survive for days/weeks without utility power, and in some cases for months. How could flooding impact power? The service entrance for a data center, where the utility power is routed, is often buried underground. Utility power is likely to be lost during flooding, either from damage due to flooding or intentional actions to prevent damage by shutting down the local grid. A data center would operate on generator if the data center itself is not flooded, although fuel replenishment is not likely. If there are two feet of water in the main electrical room(s), the data center is going dark.
Many large data centers rely on evaporating water to cool the servers it hosts. Evaporative cooling is generally more energy efficient than other options, but introduces an additional risk to operations – water supply. In many locations, municipal water pressure is lost during an extend power outage. Data centers can mitigate this risk by using water storage tanks or water wells onsite. Like diesel generators, the data centers can operate normally for hours or days without municipal water. A data center should be outside the flood plain, able to operate without utility power or municipal water for hours or days, is structurally strong enough to handle the winds of a major storm – is there any other risk to mitigate? Network connectivity and bandwidth.
Most data centers need to communicate with other data centers to fulfill their OLAP or OLTP purpose. Without connectivity, services are not available. Data should be fine, but it is becoming increasingly stale. Transactions and traffic are done. Like utility power, network connections are usually buried. With distance and geographic limitations involved, network pathways may get flooded as may the facilities that aggregate and transmit the data. Telecom facilities generally have generators and other availability measures, but can be forced into less advantageous locations and may have a shorter runtime standard than a data center.
Data centers that are serious about availability generally have carrier diversity and physical pathway diversity to mitigate carrier outages and “backhoe fades”. This may help in the event of widespread flooding as well. The reality is a data center without connectivity is generally useless. All the risk mitigation going into structural design, power and cooling redundancy, and fire protection is moot if connectivity fails.
Preparing for the Inevitable
The best way to mitigate these risks is to not rely on a single data center location. One is none and two is one. Owned, colo, managed hosting, or cloud – be able to survive the loss of a single location. The RTO and RPO of the business will guide the choice of active – active, hot – cold, or data backup with an elastic compute response plan. Hurricanes can cause regional impact, such as Irma disrupting most of Florida. In years past, many companies decided to have two data center within 20 miles of each other to support synchronous data base replication. A primary site in one borough of New York City, and the DR site in a different borough. Replication options and data base management techniques have advanced sufficiently to allow far greater dispersion today. Avoid a regionally impacting event by choosing data centers in diverse regions.
Operating from 3 locations can be cheaper than 2, and can also improve customer satisfaction with reduced response times produced by serving customers from the nearest location. See Rule 12 in Scalability Rules. The ability to operate from multiple locations also enables a choice to adjust the redundancy of those locations. A combination of Tier II and III locations may be a more economical choice than a pair of Tier IV locations.
Developing a hosting plan can be complicated and frustrating, particularly since the core competency of your business is likely not data centers. AKF Partners can help – not only with hosting strategy, but also the product architecture and operational processes needed to weld infrastructure, architecture, and process into a seamless vehicle that delivers services to your clients with availability the market demands.
Hurricanes aren’t the only disasters that can take down your data center. Solar flares, runaway SUVs, civil disruption, tornadoes and localized power outages have all caused data centers to fail. Natural disasters of all types trail equipment failures and human error as causes of service impacting events (source: 365DataCenters). According to FEMA, 40% of businesses that close due to a disaster don’t reopen, and of those that do only 29% are in business two years after the disaster (source: FEMA). Don’t be a statistic. AKF Partners can help you with the product architecture and data center planning necessary to survive nearly any disaster.
Reach out to AKF
Monitoring for Early Fault Detection
July 6, 2017 | Posted By: AKF
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather they tell you something appears to be abnormal and should be investigated. The early warning aspect can help reduce detection time and thus shorten overall MTTR.
At eBay, we had near real time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the network operations center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nose dive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine. Our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
The World Cup is the most popular football (soccer) event in the world, arguably the most popular sporting event worldwide. Broadcast matches draw huge audiences in the UK and the broadcast is typically aired without commercials until half time. There was a documentary on the UK electrical utility system preparing for a broadcast. As soon as half time commenced, a large proportion of the viewing audience visited the loo and hit the lever on their electric tea kettles. Thankfully, the documentary was about the electric utility and not sewage! The step function increase in load would cause significant problems for the utility, straining its ability to maintain voltage and frequency. The utility had prepared for this situation by staging “peakers” – diesel generators that can be brought online to help serve the increased load. Utility grid stability is akin to a Goldilocks Zone – too much is bad, too little is bad, just right is best. The operations center for the utility did not want to bring the generators on too early or too late. They needed real time information on their customers. The solution was to have a TV tuned to the World Cup broadcast in the operations center, enabling the engineers to stage on generators immediately prior to half time and stage them off as the load increase subsided. Being paid to watch the World Cup was certainly an unintended benefit!
How could your company react in a manner like the UK power utility? A sponsored event or viral campaign could overload your systems. Consider using elastic compute in the cloud for your peak demand – the equivalent to the diesel generators use for the World Cup. Scale up for the spikes in demand, then shut it down afterwards. Own the base, rent the peak. Use business metric monitors to detect workload shifts.
April 3, 2017 | Posted By: AKF
A topic that often results in great debate is “how to measure engineers?” I’m a pretty data driven guy so I’m a fan of metrics as long as they are 1) measured correctly 2) used properly and 3) not taken in isolation. I’ll touch on these issues with metrics later in the post, let’s first discuss a few possible metrics that you might consider using. Three of my favorite are: velocity, efficiency, and cost.
- Velocity – This is a measurement that comes from the Agile development methodology. Velocity is the aggregate of story
points (or any other unit of estimate that you use e.g. ideal days) that engineers on a team complete in a sprint. As we will
discuss later, there is no standard good or bad for this metric and it is not intended to be used to compare one engineer to
another. This metric should be used to help the engineer get better at estimating, that’s it. No pushing for more story points
or comparing one team to another, just use it as feedback to the engineers and team so they can get more predictable in
- Efficiency – The amount of time a software developer spends doing development related activities (e.g. coding, designing,
discussing with the product manager, etc) divided by their total time available (assume 8 – 10 hours per day) provides the
Engineering Efficiency. This is a metric designed to see how much time software developers are actually spending on
developing software. This metric often surprises people. Achieving 60% or more is exceptional. We often see dev groups
below 40% efficiency. This metric is useful for identifying where else engineers are spending their time. Are there too many
company meetings not directly related to getting products out the door? Are you doing too many HR training sessions, etc?
This metric is really for the management team to then identify what is eating up the nondevelopment
time and get rid of it.
- Cost – Tech cost as a percentage of revenue is a good cost based metric to see how much you are spending on technology.
This is very useful as it can be compared to other tech (SaaS or other webbased companies) and you can watch this metric change over time. Most startups begin with their total tech cost (engineers, hosting, etc) at well over 50% of revenue but this should quickly reduce as revenue grows and the business scales. Yes, scaling a business involves growing it cost effectively. Established companies with revenues in the tens of millions range usually have this percentage below 10%. Very large companies in the hundreds of millions in revenue often drive this down to 57%.
Now that we know about some of the most common metrics, how should they be used? The most common way managers and executives want to use metrics is to compare engineers to each other or compare a team over time. This works for the Efficiency and the Cost metrics, which by the way are primarily measurements of management effectiveness. Managers make most of the cost decisions including staffing, vendor contracts, etc. so they should be on the hook to improve these metrics. In terms of product out the door as measured by story points completed each sprint a.k.a. Velocity, as mentioned above, is to be used to improve estimates, not try to speed up developers. Using this metric incorrectly will just result in bloated estimates, not faster development.
An interesting comparison of developers comes from a 1967 article by Grant and Sackman in which they stated a ratio of 28:1 for the time required by the slowest versus the fastest programmer to complete a task. This has been a widely cited ratio but a paper from 2000 revised this number to 4:1 at the most and more likely 2:1. While a 2x difference in speed is still impressive it doesn’t optimize for the overall quality of the product. An engineer who is very fast and with high quality but doesn’t interact with the product managers isn’t necessarily the overall most effective. My point is that there are many other factors to be considered than just story points per release when comparing engineers.
PDLC or SDLC
April 3, 2017 | Posted By: AKF
As a frequent technology writer I often find myself referring to the method or process that teams use to produce software. The two terms that are usually given for this are software development life cycle (SDLC) and product development life cycle (PDLC). The question that I have is are these really interchangeable? I don’t think so and here’s why.
Wikipedia, our collective intelligence, doesn’t have an entry for PDLC, but explains that the product life cycle has to do with the life of a product in the market and involves many professional disciplines. According to this definition the stages include market introduction, growth, mature, and saturation. This really isn’t the PDLC that I’m interested in. Under new product development (NDP) we find a defintion more akin to PDLC that includes the complete process of bringing a new product to market and includes the following steps: idea generation, idea screening, concept development, business analysis, beta testing, technical implementation, commercialization, and pricing.
Under SDLC, Wikipedia doesn’t let us down and explains it as a structure imposed on the development of software products. In the article are references to multiple different models including the classic waterfall as well as agile, RAD, and Scrum and others.
In my mind the PDLC is the overarching process of product development that includes the business units. The SDLC is the specific steps within the PDLC that are completed by the technical organization (product managers included). An image on HBSC’s site that doesn’t seem to have any accompanying explanation depicts this very well graphically.
Another way to explain the way I think of them is to me all professional software projects are products but not all product development includes software development. See the Venn diagram below. The upfront (bus analysis, competitive analysis, etc) and back end work (infrastructure, support, depreciation, etc) are part of the PDLC and are essential to get the software project created in the SDLC out the door successfully. There are non-software related products that still require a PDLC to develop.
Do you use them interchangeably? What do you think the differences are?
Build v. Buy
April 3, 2017 | Posted By: AKF
In many of our engagements, we find ourselves helping our clients understand when it’s appropriate to build and when they should buy.
If you perform a simple web search for “build v. buy” you will find hundreds of articles, process flows and decision trees on when to build and when to buy. Many of these are costcentric decisions including discounted cash flows for maintenance of internal development and others are focused on strategy. Some of the articles blend the two.
Here is a simple set of questions that we often ask our customers to help them with the build v. buy decision:
1. Does this “thing” (product / architectural component / function) create strategic differentiation in our business?
Here we are talking about whether you are creating switching costs, lowering barriers to exit, increasing barriers to entry, etc that would give you a competitive advantage relative to your competition. See Porter’s Five Forces for more information about this topic. If the answer to this question is “No – it does not create competitive differentiation” then 99% of the time you should just stop there and attempt to find a packaged product, open source solution, or outsourcing vendor to build what you need. If the answer is “Yes”, proceed to question 2.
2. Are we the best company to create this “thing”?
This question helps inform whether you can effectively build it and achieve the value you need. This is a “core v. context” question; it asks both whether your business model supports building the item in question and also if you have the appropriate skills to build it better than anyone else. For instance, if you are a social networking site, you *probably* don’t have any business building relational databases for your own use. Go to question number (3) if you can answer “Yes” to this question and stop here and find an outside solution if the answer is “No”. And please, don’t fool yourselves – if you answer “Yes” because you believe you have the smartest people in the world (and you may), do you really need to dilute their efforts by focusing on more than just the things that will guarantee your success?
3. Are there few or no competing products to this “thing” that you want to create?
We know the question is awkwardly worded – but the intent is to be able to exit these four questions by answering “yes” everywhere in order to get to a “build” decision. If there are many providers of the “thing” to be created, it is a potential indication that the space might become a commodity. Commodity products differ little in feature sets over time and ultimately compete on price which in turn also lowers over time. As a result, a “build” decision today will look bad tomorrow as features converge and pricing declines. If you answer “Yes” (i.e. “Yes, there are few or no competing products”), proceed to question (4).
4. Can we build this “thing” cost effectively?
Is it cheaper to build than buy when considering the total lifecycle (implementation through endoflife)
of the “thing” in question? Many companies use cost as a justification, but all too often they miss the key points of how much it costs to maintain a proprietary “thing”, “widget”, “function”, etc. If your business REALLY grows and is extremely successful, do you really want to be continuing to support internally developed load balancers, databases, etc. through the life of your product? Don’t fool yourself into answering this affirmatively just because you want to work on something neat. Your job is to create shareholder value – not work on “neat things” – unless your “neat thing” creates shareholder value.
There are many more complex questions that can be asked and may justify the building rather than purchasing of your “thing”, but we feel these four questions are sufficient for most cases.
A “build” decision is indicated when the answers to all 4 questions are “Yes”.
We suggest seriously considering buying or outsourcing (with appropriate contractual protection when intellectual property is a
concern) anytime you answer “No” to any question above.