Your Site is as Important as the Product You Sell - Recent Example from Saddleback Leather
February 7, 2018 | Posted By: Pete Ferguson
If you have a premium product, at a premium price, it’s unlikely you would sell it out of a rundown, poorly lighted store that smells vaguely like stale meat. Yet somehow many of us forget to apply that same reasoning when it comes to selling our products online. The availability - and look and feel of your presence online - is your store front.
I’ve long been a fan of Saddleback Leather. However, their motto: “They’ll fight over it when you’re dead” fell short in January. You see, it’s hard for your family to fight over the thing that you can’t even purchase… Saddleback Leather had a completely foreseeable, and absolutely preventable outage. From Dave Munson, the CEO:
“I’ve always dreamt of one day having a really fast and easy website for you to enjoy. So, we decided to leave our slow and clunky old website and start building one on a new and different platform. The contract expired Dec. 30th, 2017, but the new site wasn’t fully ready yet. We flipped the switch anyways and all Gehenna broke loose. The super fast, fun and easy website… wasn’t fast, fun or easy and we wasted a ton of time and irritated the heck out of our favorite people. People couldn’t check out, set up accounts or even add stuff to their carts. So, we paid a ton of money to get our old slow and clunky back again until we get this new site just right. “
To make up for it, last week I received an apology letter sent by “El Presidente” Munson with an 11% off coupon. 11 % because Munson has recently celebrated 11 years of marriage to his wife, Suzzette. As a side note, it’s a perfect example of how to apologize to your customers when you screw up. This guy made a mistake, is paying for it by paying for his old site while continuing to develop the new, and is giving customers discounts with a coupon aptly titled: “IAMSORRY.”
Ironically, as a fan and customer, I don’t recall the old site being slow or terrible. On the contrary, when I visited early in January, their “new and improved” site felt clunky and disjointed. The wrong images were coming up for products and many items reported being “not available.”
In the world of environmental health and safety, “all accidents are preventable” is the holy grail of compliance. We believe that with the right forethought and planning, the same is true with virtually all products and storefronts online.
At AKF we are fond of saying “an accident is a terrible thing to waste.” While the exact details of what went wrong are not disclosed, the motives were:
- They took a concept that presumably worked great in beta testing live without testing under full load.
- Munson made the decision to push out something that wasn’t yet great to save money by exiting a contract by the end of the year.
For similar content on our Growth Blog, click here
The result is lost sales from when the site was down, lost customers who may have been trying the website for their first time and won’t be back, an 11% haircut of sales for the next week, and a fan base - many of whom have been very vocal on FaceBook - that is verbally expressing their disdain to see the company they have counted on for unquestioned quality in the past didn’t settle for quality first this time.
The days of customers quickly forgiving their favorite retailers for not being equally as great online are waning. Make sure you have a solid strategy and the right expertise in your corner when it comes to greatly affecting your customer’s ability to purchase or better interact with your product.
Experiencing growing pains? AKF is here to help! We are an industry expert in technology scalability and due diligence. Put our 200+ years of combined experience to work for you today!
Get this article and others like it by signing up for our newsletter.
Subscribe to the AKF Newsletter
There Are Always Plenty of Incidents from Which To Learn
January 13, 2018 | Posted By: Dave Swenson
Sorry, False Alarm…
On January 13, 2018, what felt like an episode of Netflix’s “Black Mirror” unfolded in real life. Just after 8 in the morning, residents and visitors of Hawaii were woken up to the following startling push notification:
Thankfully, the notification was a false alarm, finally retracted with a second notification nearly 40 interminable minutes later.
The amazing, poignant and sobering stories that occurred from those 40 minutes, included people:
- determining which children to spend their last minutes with,
- abandoning their cars on streets,
- sheltering in a lava tube,
- believing and acting as we all would if we believed the end was here.
Unfortunately, this wasn’t a Black Mirror episode and paralyzed an entire state’s population. Thankfully, the alarm was a false one.
A Muted President
As President Trump took office, he introduced a new means for a President to reach his constituents—Twitter, averaging 6 to 7 tweets per day during his first year. On November 2, 2017, many bots that were created to closely monitor the tweets of @realDonaldTrump started reporting that the account no longer existed. Clicking to his account took the user to the above error page.
For a deafening 11 minutes, the nation was unable to listen to its leader, at least via Twitter.
The Hawaiian false alarm was sent by the state’s Emergency Management Agency. Their explanation of the incident was that during a shift change, an employee clicked “the wrong button” while running a missile crisis test, then subsequently clicked through a confirmation prompt (“Are you sure you want to tell 1.5 million people this?”).
Twitter employees had reportedly tried for years to get management attention on ensuring accounts weren’t deleted without proper vetting. The company typically used contractors in the Philippines and Singapore to handle such account administration; Trump’s account was deleted by a German contract worker on his last day at Twitter. Acting on yet-another-Trump-complaint, believing such an important account couldn’t be suspended, the worker’s last action for Twitter was to click the suspend button, and then walked out of the building causing the Twitterverse to read far more into the account’s disappearance than they should have.
In both of these situations, the immediate focus was on the personnel involved in the incident. “Who pushed the button?” is typically always one of the initial questions. Assumptions that a new employee, or rogue worker were behind the incident are common, and both motive and intelligence of all involved are under inspection.
We at AKF Partners constantly preach “An incident is a terrible thing to waste”. Events such as these warp the known reality into “How the shit can that happen??”, causing enough alarm to warrant special attention and focus, if not panic. Yet, all too often we see teams searching frantically to find any cause, blame the most obvious, immediate factor, declare victory, and move on.
“Who pushed the button?” is only one of many questions.
Toyota’s Taichi Ohno, the father of Lean Manufacturing, recognized his team’s habit of accepting the most apparent cause, ignoring (wasting) other elements revealed by an incident, potentially allowing it to be eventually repeated. Ohno (the person, not the exclamation typically uttered during an incident) emphasized the importance of asking “5 Why’s” in order to move beyond the most obvious explanation (and accompanying blame), to peel the onion diving deeper into contributory causes.
Questions beyond the reflexive “What happened?” and “Who did it?” relevant to the false alarm and erroneous account deletion incidents include:
- Why did the system act differently than the individual expected (is there more training required, is the user interface a confusing one)?
- Why did it take so long to correct (is there no playbook for detecting / reversing such a message or key account activity)?
- Why does the system allow such an impactful event to be performed unilaterally, by a single person (what safeguards should exist requiring more than one set of hands?)
- Why does this particular person have such authorization to perform this action (should a non-employee have the ability to delete such a verified, popular and influential account)?
- Why was the possibility of this incident not anticipated and prevented (why were Twitter employee requests for better safeguards ignored for years, why wasn’t the ease of making such a mistake recognized and what other similar mistake opportunities are there)?
Both of these incidents have had an impact far beyond those directly affected (Hawaiian inhabitants or Trump Twitter followers), and have shed light on the need to recognize the world has changed and policies and practices of old might not be enough for today. The ballistic missile false alarm revealed that more controls need to be placed on all mass communication, but also that Hawaii (or anywhere/anyone else) is extremely unprepared for the unthinkable. The use of Twitter as a channel for the President now raises questions over the validity of it as a Presidential record, asks who should control such a channel, and raises concerns on what security is around the President’s account?
Ask 5 Whys, look beyond the immediate impact to find collateral learnings, and take notice of all that an incident can reveal.
AKF Partners have been brought in by over 400 companies to avoid such incidents, and when they do occur, to learn from them. Let us help you.
Subscribe to the AKF Newsletter
Hosting Lessons from Harvey and Irma
September 19, 2017 | Posted By: Greg Fennewald
Everyone was saddened to see the horrific destruction storms caused to Houston and Florida, including deaths and extensive property damage. It seems reasonable that the impact of these hurricanes was lessened by advanced notice and preparation – stockpiling supplies, evacuating the highest risk areas, and staging response resources to assist with recovery and rebuilding.
Data centers operate every day with a similar preparation mindset: diesel generators to provide power should the utility fail, batteries to keep servers running during a transition, potentially stored water or a well to replace municipal water service for cooling systems, and food and water for personnel unable to leave the location.
What happens when a “prepared” location such as a data center encounters a hurricane with strong winds, heavy rain, and extensive flooding? In some cases, the data center survives without impact, although there certainly will be outages and failures. Examples of data centers surviving Harvey in good shape can be seen here, while accounts of the service impacts caused by Hurricane Sandy can be seen here.
Data Center Points of Failure
Let’s examine what may enable a data center to survive without functional impact. Extensive risk investigation goes into site selection for data centers. Data centers are expensive to build with costs measured in the tens or even hundreds of millions of dollars. The potential business impact of a failure can be costly with liquidated damage clauses in hosting contracts. These factors lead to data centers being located outside of flood plains, away from hazardous material routes, and stoutly constructed to endure storm winds likely in the region.
Losing utility power is regarded as a “when” not an “if” in the data center industry (be that an outage or a planned maintenance activity), and diesel generators are a common solution, often with 24 hours or more of fuel on hand and multiple replenishment contracts. Data centers can survive for days/weeks without utility power, and in some cases for months. How could flooding impact power? The service entrance for a data center, where the utility power is routed, is often buried underground. Utility power is likely to be lost during flooding, either from damage due to flooding or intentional actions to prevent damage by shutting down the local grid. A data center would operate on generator if the data center itself is not flooded, although fuel replenishment is not likely. If there are two feet of water in the main electrical room(s), the data center is going dark.
Many large data centers rely on evaporating water to cool the servers it hosts. Evaporative cooling is generally more energy efficient than other options, but introduces an additional risk to operations – water supply. In many locations, municipal water pressure is lost during an extend power outage. Data centers can mitigate this risk by using water storage tanks or water wells onsite. Like diesel generators, the data centers can operate normally for hours or days without municipal water. A data center should be outside the flood plain, able to operate without utility power or municipal water for hours or days, is structurally strong enough to handle the winds of a major storm – is there any other risk to mitigate? Network connectivity and bandwidth.
Most data centers need to communicate with other data centers to fulfill their OLAP or OLTP purpose. Without connectivity, services are not available. Data should be fine, but it is becoming increasingly stale. Transactions and traffic are done. Like utility power, network connections are usually buried. With distance and geographic limitations involved, network pathways may get flooded as may the facilities that aggregate and transmit the data. Telecom facilities generally have generators and other availability measures, but can be forced into less advantageous locations and may have a shorter runtime standard than a data center.
Data centers that are serious about availability generally have carrier diversity and physical pathway diversity to mitigate carrier outages and “backhoe fades”. This may help in the event of widespread flooding as well. The reality is a data center without connectivity is generally useless. All the risk mitigation going into structural design, power and cooling redundancy, and fire protection is moot if connectivity fails.
Preparing for the Inevitable
The best way to mitigate these risks is to not rely on a single data center location. One is none and two is one. Owned, colo, managed hosting, or cloud – be able to survive the loss of a single location. The RTO and RPO of the business will guide the choice of active – active, hot – cold, or data backup with an elastic compute response plan. Hurricanes can cause regional impact, such as Irma disrupting most of Florida. In years past, many companies decided to have two data center within 20 miles of each other to support synchronous data base replication. A primary site in one borough of New York City, and the DR site in a different borough. Replication options and data base management techniques have advanced sufficiently to allow far greater dispersion today. Avoid a regionally impacting event by choosing data centers in diverse regions.
Operating from 3 locations can be cheaper than 2, and can also improve customer satisfaction with reduced response times produced by serving customers from the nearest location. See Rule 12 in Scalability Rules. The ability to operate from multiple locations also enables a choice to adjust the redundancy of those locations. A combination of Tier II and III locations may be a more economical choice than a pair of Tier IV locations.
Developing a hosting plan can be complicated and frustrating, particularly since the core competency of your business is likely not data centers. AKF Partners can help – not only with hosting strategy, but also the product architecture and operational processes needed to weld infrastructure, architecture, and process into a seamless vehicle that delivers services to your clients with availability the market demands.
Hurricanes aren’t the only disasters that can take down your data center. Solar flares, runaway SUVs, civil disruption, tornadoes and localized power outages have all caused data centers to fail. Natural disasters of all types trail equipment failures and human error as causes of service impacting events (source: 365DataCenters). According to FEMA, 40% of businesses that close due to a disaster don’t reopen, and of those that do only 29% are in business two years after the disaster (source: FEMA). Don’t be a statistic. AKF Partners can help you with the product architecture and data center planning necessary to survive nearly any disaster.
Reach out to AKF
Subscribe to the AKF Newsletter
Monitoring for Early Fault Detection
July 6, 2017 | Posted By: AKF
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather they tell you something appears to be abnormal and should be investigated. The early warning aspect can help reduce detection time and thus shorten overall MTTR.
At eBay, we had near real time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the network operations center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nose dive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine. Our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
The World Cup is the most popular football (soccer) event in the world, arguably the most popular sporting event worldwide. Broadcast matches draw huge audiences in the UK and the broadcast is typically aired without commercials until half time. There was a documentary on the UK electrical utility system preparing for a broadcast. As soon as half time commenced, a large proportion of the viewing audience visited the loo and hit the lever on their electric tea kettles. Thankfully, the documentary was about the electric utility and not sewage! The step function increase in load would cause significant problems for the utility, straining its ability to maintain voltage and frequency. The utility had prepared for this situation by staging “peakers” – diesel generators that can be brought online to help serve the increased load. Utility grid stability is akin to a Goldilocks Zone – too much is bad, too little is bad, just right is best. The operations center for the utility did not want to bring the generators on too early or too late. They needed real time information on their customers. The solution was to have a TV tuned to the World Cup broadcast in the operations center, enabling the engineers to stage on generators immediately prior to half time and stage them off as the load increase subsided. Being paid to watch the World Cup was certainly an unintended benefit!
How could your company react in a manner like the UK power utility? A sponsored event or viral campaign could overload your systems. Consider using elastic compute in the cloud for your peak demand – the equivalent to the diesel generators use for the World Cup. Scale up for the spikes in demand, then shut it down afterwards. Own the base, rent the peak. Use business metric monitors to detect workload shifts.
Subscribe to the AKF Newsletter
April 3, 2017 | Posted By: AKF
A topic that often results in great debate is “how to measure engineers?” I’m a pretty data driven guy so I’m a fan of metrics as long as they are 1) measured correctly 2) used properly and 3) not taken in isolation. I’ll touch on these issues with metrics later in the post, let’s first discuss a few possible metrics that you might consider using. Three of my favorite are: velocity, efficiency, and cost.
- Velocity – This is a measurement that comes from the Agile development methodology. Velocity is the aggregate of story
points (or any other unit of estimate that you use e.g. ideal days) that engineers on a team complete in a sprint. As we will
discuss later, there is no standard good or bad for this metric and it is not intended to be used to compare one engineer to
another. This metric should be used to help the engineer get better at estimating, that’s it. No pushing for more story points
or comparing one team to another, just use it as feedback to the engineers and team so they can get more predictable in
- Efficiency – The amount of time a software developer spends doing development related activities (e.g. coding, designing,
discussing with the product manager, etc) divided by their total time available (assume 8 – 10 hours per day) provides the
Engineering Efficiency. This is a metric designed to see how much time software developers are actually spending on
developing software. This metric often surprises people. Achieving 60% or more is exceptional. We often see dev groups
below 40% efficiency. This metric is useful for identifying where else engineers are spending their time. Are there too many
company meetings not directly related to getting products out the door? Are you doing too many HR training sessions, etc?
This metric is really for the management team to then identify what is eating up the nondevelopment
time and get rid of it.
- Cost – Tech cost as a percentage of revenue is a good cost based metric to see how much you are spending on technology.
This is very useful as it can be compared to other tech (SaaS or other webbased companies) and you can watch this metric change over time. Most startups begin with their total tech cost (engineers, hosting, etc) at well over 50% of revenue but this should quickly reduce as revenue grows and the business scales. Yes, scaling a business involves growing it cost effectively. Established companies with revenues in the tens of millions range usually have this percentage below 10%. Very large companies in the hundreds of millions in revenue often drive this down to 57%.
Now that we know about some of the most common metrics, how should they be used? The most common way managers and executives want to use metrics is to compare engineers to each other or compare a team over time. This works for the Efficiency and the Cost metrics, which by the way are primarily measurements of management effectiveness. Managers make most of the cost decisions including staffing, vendor contracts, etc. so they should be on the hook to improve these metrics. In terms of product out the door as measured by story points completed each sprint a.k.a. Velocity, as mentioned above, is to be used to improve estimates, not try to speed up developers. Using this metric incorrectly will just result in bloated estimates, not faster development.
An interesting comparison of developers comes from a 1967 article by Grant and Sackman in which they stated a ratio of 28:1 for the time required by the slowest versus the fastest programmer to complete a task. This has been a widely cited ratio but a paper from 2000 revised this number to 4:1 at the most and more likely 2:1. While a 2x difference in speed is still impressive it doesn’t optimize for the overall quality of the product. An engineer who is very fast and with high quality but doesn’t interact with the product managers isn’t necessarily the overall most effective. My point is that there are many other factors to be considered than just story points per release when comparing engineers.
Subscribe to the AKF Newsletter
PDLC or SDLC
April 3, 2017 | Posted By: AKF
As a frequent technology writer I often find myself referring to the method or process that teams use to produce software. The two terms that are usually given for this are software development life cycle (SDLC) and product development life cycle (PDLC). The question that I have is are these really interchangeable? I don’t think so and here’s why.
Wikipedia, our collective intelligence, doesn’t have an entry for PDLC, but explains that the product life cycle has to do with the life of a product in the market and involves many professional disciplines. According to this definition the stages include market introduction, growth, mature, and saturation. This really isn’t the PDLC that I’m interested in. Under new product development (NDP) we find a defintion more akin to PDLC that includes the complete process of bringing a new product to market and includes the following steps: idea generation, idea screening, concept development, business analysis, beta testing, technical implementation, commercialization, and pricing.
Under SDLC, Wikipedia doesn’t let us down and explains it as a structure imposed on the development of software products. In the article are references to multiple different models including the classic waterfall as well as agile, RAD, and Scrum and others.
In my mind the PDLC is the overarching process of product development that includes the business units. The SDLC is the specific steps within the PDLC that are completed by the technical organization (product managers included). An image on HBSC’s site that doesn’t seem to have any accompanying explanation depicts this very well graphically.
Another way to explain the way I think of them is to me all professional software projects are products but not all product development includes software development. See the Venn diagram below. The upfront (bus analysis, competitive analysis, etc) and back end work (infrastructure, support, depreciation, etc) are part of the PDLC and are essential to get the software project created in the SDLC out the door successfully. There are non-software related products that still require a PDLC to develop.
Do you use them interchangeably? What do you think the differences are?
Subscribe to the AKF Newsletter
Build v. Buy
April 3, 2017 | Posted By: AKF
In many of our engagements, we find ourselves helping our clients understand when it’s appropriate to build and when they should buy.
If you perform a simple web search for “build v. buy” you will find hundreds of articles, process flows and decision trees on when to build and when to buy. Many of these are costcentric decisions including discounted cash flows for maintenance of internal development and others are focused on strategy. Some of the articles blend the two.
Here is a simple set of questions that we often ask our customers to help them with the build v. buy decision:
1. Does this “thing” (product / architectural component / function) create strategic differentiation in our business?
Here we are talking about whether you are creating switching costs, lowering barriers to exit, increasing barriers to entry, etc that would give you a competitive advantage relative to your competition. See Porter’s Five Forces for more information about this topic. If the answer to this question is “No – it does not create competitive differentiation” then 99% of the time you should just stop there and attempt to find a packaged product, open source solution, or outsourcing vendor to build what you need. If the answer is “Yes”, proceed to question 2.
2. Are we the best company to create this “thing”?
This question helps inform whether you can effectively build it and achieve the value you need. This is a “core v. context” question; it asks both whether your business model supports building the item in question and also if you have the appropriate skills to build it better than anyone else. For instance, if you are a social networking site, you *probably* don’t have any business building relational databases for your own use. Go to question number (3) if you can answer “Yes” to this question and stop here and find an outside solution if the answer is “No”. And please, don’t fool yourselves – if you answer “Yes” because you believe you have the smartest people in the world (and you may), do you really need to dilute their efforts by focusing on more than just the things that will guarantee your success?
3. Are there few or no competing products to this “thing” that you want to create?
We know the question is awkwardly worded – but the intent is to be able to exit these four questions by answering “yes” everywhere in order to get to a “build” decision. If there are many providers of the “thing” to be created, it is a potential indication that the space might become a commodity. Commodity products differ little in feature sets over time and ultimately compete on price which in turn also lowers over time. As a result, a “build” decision today will look bad tomorrow as features converge and pricing declines. If you answer “Yes” (i.e. “Yes, there are few or no competing products”), proceed to question (4).
4. Can we build this “thing” cost effectively?
Is it cheaper to build than buy when considering the total lifecycle (implementation through endoflife)
of the “thing” in question? Many companies use cost as a justification, but all too often they miss the key points of how much it costs to maintain a proprietary “thing”, “widget”, “function”, etc. If your business REALLY grows and is extremely successful, do you really want to be continuing to support internally developed load balancers, databases, etc. through the life of your product? Don’t fool yourself into answering this affirmatively just because you want to work on something neat. Your job is to create shareholder value – not work on “neat things” – unless your “neat thing” creates shareholder value.
There are many more complex questions that can be asked and may justify the building rather than purchasing of your “thing”, but we feel these four questions are sufficient for most cases.
A “build” decision is indicated when the answers to all 4 questions are “Yes”.
We suggest seriously considering buying or outsourcing (with appropriate contractual protection when intellectual property is a
concern) anytime you answer “No” to any question above.
Subscribe to the AKF Newsletter