August 21, 2018 | Posted By: Larry Steinberg
Open Source Software (OSS) is an efficient means to building out solutions rapidly with high quality. You utilize crowdsourced design, development and validation to conveniently speed your engineering. OSS also fosters a sense of sharing and building together - across company boundaries or in one’s free time.
So just pull a library down off the web, build your project, and your company is ready to go. Or is that the best approach? What could be in this library you’re including in your solution which might not be wanted? This code will be running in critical environments - like your SaaS servers, internal systems, or on customer systems. Convenience comes at a price and there are some well known situations of hacks embedded in popular open source libraries.
What is the best approach to getting the benefits of OSS and maintaining the integrity of your solution?
Good practices are a necessity to ensure a high level of security. Just like when you utilize OSS and then test functionality, scale, and load - you should be validating against vulnerabilities. Pre-production vulnerability and penetration testing is a good start. Also, utilize good internal process and reviews. Keep the process simple to maintain speed but establish internal accountability and vigilance on code that’s entering your environment. You are practicing good coding techniques already with reviews and/or peer coding - build an equivalency with OSS.
Always utilize known good repositories and validate the project sponsors. Perform diligence on the committers just like you would for your own employees. You likely perform some type of background check on your employees before making an offer - whether going to a third party or simply looking them up on linkedin and asking around. OSS committers have the same risk to your company - why not do the same for them? Understandably, you probably wouldn’t do this for a third party purchased solution, but your contract or expectation is that the company is already doing this and abiding by minimum security standards. That may not be true for your OSS solutions, and as such your responsibility for validation is at least slightly higher. There are plenty of projects coming from reputable sources that you can rely on.
Ensure that your path to production is only coming from artifacts which have been built on internal sources which were either developed or reviewed by your team. Also, be intentional about OSS library upgrades, this should planned and part of the process.
OSS is highly leveraged in today’s software solutions and provides many benefits. Be diligent in your approach to ensure you only see the upside of open source.
Need additional help? Contact Us!
August 9, 2018 | Posted By: Greg Fennewald
AKF Partners has worked with over 400 companies in our history and we’ve seen a wide variety of both good and bad things. The rise of server virtualization, the spread of NoSQL in the persistence tier, and the growing prevalence of cloud hosting are some of the technology developments in recent years. In the information security arena, there are several practices that are a good indicator of overall security program efficacy. Do them well and your security program is probably in good shape. Do them poorly – or not at all – and your security program might be headed for trouble.
1. Annual Security Training and Testing
Everyone loathes mandated training topics, especially those that require a defined amount of time be spent on the training (many of which are legislative requirements). There’s no reliable method to make security training fun or enjoyable, so let’s hold our noses and focus on why it is important:
- Testing establishes accountability – people do not want to fail and there should be consequences for failure.
- Security threats change over time – annually recurring training provides a vehicle for updating awareness on current threats. Look through the OWASP Top 10 for several years to see how threats change.
- Recurring training and testing are becoming table stakes – any audit is going to start with asking about your training and awareness program.
2. Security Incident Response Plan
An IRP is not amongst the first few security policies a company needs, but when it is needed, it is needed urgently.
- A security incident is virtually a certainty over a sufficiently large time horizon.
- Similar to parachutes and fire extinguishers, planning and practice dramatically improve results.
- Evolving data privacy regulations, GDPR for instance, are likely to heighten incident disclosure requirements – a solid IRP will address disclosure.
3. Open Source Software Inventory
Open source software inventory? How is that related to security? Many consider OSS inventory as a compliance requirement – ensuring the company complies with the licensing requirements of the open source components used, particularly important if the business redistributes the software. OSS inventory also has security applicability.
- Provides ability to identify risks when open source component vulnerabilities and exploits are disclosed – what’s in your stack and is the latest exploit a risk to your business?
- Most effective when coupled with a policy on how new open source components can be safely utilized.
- Lends itself well to automation and tooling with security resource oversight.
- Efficient, serving two purposes – open source license compliance and security vulnerabilities.
What do these three security practices have in common? People – not technology. Firewall rules and the latest intrusion detection tools are not on this list. Many security breaches occur as the result of a human error leading to a compromised account or improper system access. Training and testing your people on the basics, having a plan on how to respond should an incident occur, and being able to know if an open source disclosure affects your risk profile are three human-focused practices that help establish a security-minded culture. Without the proper culture, tools and automation are less likely to succeed.
Looking for additional help? Contact us!
August 2, 2018 | Posted By: Marty Abbott
The movement to SaaS specifically, and more broadly “Anything” (X) as a Service (XaaS) is driven by demand side (buyer) forces. In early cases within any industry, the buyer seeks competitive advantage over competitors. The move to SaaS allows the buyer to focus on core competencies, increasing investments in the areas that create true differentiation. Why spend payroll on an IT staff to support ERP solutions, mail solutions, CRM solutions, etc when that same payroll could otherwise be spent on engineers to build product differentiating features or enlarge a sales staff to generate more revenue?
As time moves on and as the technology adoption lifecycle advances, the remaining buyers for any product feel they have no choice; the talent and capabilities to run a compelling solution for the company simply do not exist. As such, the late majority and laggard adopters are almost “forced” into renting a service over purchasing software.
Whether for competitive reasons, as in the case of early adopters through the early majority, or for lack of alternatives as in the case of late majority and laggards, the movement to SaaS and XaaS represents a shift in risk as compared to the existing purchased product options. This shift in risk is very much like the shift that happens between purchasing and leasing a home.
Renting a home or an apartment is almost always more expensive than owning the same dwelling. The reason for this should be clear: the person owning the property expects to make a profit beyond the costs of carrying a mortgage and performing upkeep on the property over the life of the owner’s investment. There are special “inversion” cases where renting is less expensive, such as in a low rental demand market, but these cases tend to reset the market ownership prices (house prices fall) as rents no longer cover mortgages or ownership does not make sense.
Conversely, ownership is almost always less expensive than renting or leasing. But owners take on more risk: the risk of maintenance activities; the risk of market prices; the risk and costs associated with remodeling to make the property attractive, etc.
The matrix below helps put the shift described above into context.
A customer who “owns” an on-premise solution also “owns” a great deal of risk for all of the components necessary to achieve their desired outcomes: equipment, security, power, licenses, the “-ilities” (like availability), disaster recovery, release management, and monitoring of the solution. The primary components of this risk include fluctuation in asset valuation, useful life of the asset, and most importantly – the risk that they do not have the right skills to maximize the value creation predicated on these components.
A customer who “rents” a SaaS solution transfers most of these risks to a provider who specializes in the solution and therefore should be better able to manage the risk and optimize outcomes. In exchange, the customer typically pays a risk premium relative to ownership. However, given that the provider can likely operate the solution more cost effectively, especially if it is a multi-tenant solution, the risk premium may be small. Indeed, in extreme cases where the company can eliminate headcount, say after eliminating all on-premise solutions, the lessee may experience an overall reduction in cost.
But what about the provider of the service? After all, the “old world” of simply slinging code and allowing customers to take all the risk was mighty appealing; the provider enjoyed low costs of goods sold (and high gross margins) and revenue streams associated with both licensing and customization. The provider expects to achieve higher revenue from the risk premium charged for services. The provider also expects overall margins through greater efficiencies in running solutions with significant demand concentration at scale. The risk premium more than compensates the provider for the increased cost of goods sold relative to the on-premise business. Overall, the provider assumes risk for greater value creation. Both the customer and the provider win.
Architecture and product financial decisions are key to achieving the margins above.
Gross margins are directly correlated with the level of tenancy of any XaaS provider (Y axis). As such, while we want to avoid “all tenancy” for availability reasons, we desire a high level of tenancy to maximize equipment utilization and increase gross margins. Other drivers of gross margins include the level of demand upon shared components and the level of automation on all components – the latter driving down cost of labor.
The X axis of the chart above shows the operating expense associated with various business models. Multi-tenant XaaS offerings collapse the number of “releases supported in the wild” – reducing the operating expense (and increasing gross margins) associated with managing a code base.
Another way of viewing this is to look at the relative costs of software maintenance and administration costs for various business models.
Plotted in the “low COGS” (X axis), “low Maintenance quadrant of the figure above is “True XaaS”. Few versions of a release reduce our cost to maintain a code base, and high equipment utilization and automation reduces our cost to provision a service.
In the upper right and unattractive quadrant is the ASP (Application Service Provider) model, where we have less control over equipment utilization (it is typically provisioned for individual customers) and less control over defining the number of releases.
Hosting a solution on-premise to the customer may reduce our maintenance fees, if we are successful in reducing releases, but significantly increases our costs. This is different than the on-premise model (upper left) in which the customer bears the cost of equipment but for which we have a high number of releases to maintain. The XaaS solution is clearly beneficial overall to maximize margins.
AKF Partners helps companies transition on-premise, licensed software products to SaaS and XaaS solutions. Let us help you on your journey.
July 24, 2018 | Posted By: Marty Abbott
Over a decade of helping on-premise and licensed software companies through the transition to “Something as a Service” – whether that be Software (SaaS), Platform (PaaS), Infrastructure (IaaS), or Analytics (AaaS) – has given us a rather unique perspective on the various phases through which these companies transition. Very often we find ourselves in the position of a counselor, helping them recognize their current phase and making recommendations to deal with the cultural and operational roadblocks inherent to that phase.
While rather macabre, the phases somewhat resemble those of grieving after the loss of a loved one. The similarities make some sense here, as very often we work with companies who have had a very successful on-premise and/or licensed software business; they dominated their respective markets and “genetically” evolved to be the “alphas” in their respective areas. The relationship is strong, and their past successes have been very compelling. Why would we expect anything other than grieving?
But to continue to evolve, succeed, and survive in light of secular forces these companies must let their loved one (the past business) go and move on with a new and different life. To be successful, this life will require new behaviors that are often diametrically opposed to those that made the company successful in their past life.
It’s important to note that, unlike the grieving process with a loved one, these phases need not all be completed. The most successful companies, through pure willpower of the management team, power through each of these quickly and even bypass some of them to accelerate to the healing phase. The most important thing to note here is that you absolutely can move quickly – but it requires decisive action on the part of the executive team.
Phase 1: Denial This phase is characterized by the licensed/on-premise software provider completely denying a need to move to an “X” (something) as a Service (XaaS, e.g. SaaS, PaaS) model.
Commonly heard quotes inside the company:
- “Our customers will never move to a SaaS model.”
- “Our customers are concerned about security. IaaS, SaaS and PaaS simply aren’t an option.”
- “Our industry won’t move to a Cloud model – they are too concerned about ownership of their data.”
- “To serve this market, a solution needs to be flexible and customizable. Proprietary customer processes are a competitive advantage in this space – and our solution needs to map exactly to them.”
Reinforcing this misconceived belief is an executive team, a sales force, and a professional services team trapped in a prison of cognitive biases. Hypothesis myopia and asymmetric attention (both forms of confirmation bias) lead to psychological myopia. In our experience, companies with a predisposed bias will latch on to anything any customer says that supports the notion that XaaS just won’t work. These companies discard any evidence, such as pesky little startups picking up small deals, as aberrant data points.
The inherent lack of paranoia blinds the company to the smaller companies starting to nibble away at the portions of the market that the successful company’s products do not serve well. Think Seibel in the early days of Salesforce. The company’s product is too expensive and too complex to adequately serve the smaller companies beneath them. The cost of sales is simply too high, and the sales cycle too long to address the needs of the companies adopting XaaS solutions. In this phase, the company isn’t yet subject to the Innovator’s Dilemma as the blinders will not let them see it.
Ignorance is bliss…for awhile…
How to Avoid or Limit This Phase
Successful executive teams identify denial early and simply shut it down. They establish a clear vision and timeline to move to the delivery of a product as a service. As with any successful change initiative, the executive team creates, as a minimum:
- The compelling reason and need for change. This visionary element describes the financial and operational benefits in clear terms that everyone can understand. It is the “pulling” force necessary to motivate people through difficult times.
- A sense of fear for not making the change. This fear becomes the “stick” to the compelling “carrot” above. Often given the secular forces, this fear is quite simply the slow demise of the company.
- A “villain” or competitor. As is the case in nearly any athletic competition, where people perform better when they have a competitor of equivalent caliber (vs say running against a clock), so do companies perform better when competing against someone else.
- A “no excuses” cultural element. Everyone on the team is either committed to the result, or they get removed from the team. There is no room for passive-aggressive behavior, or behaviors inconsistent with the desired outcome. People clinging to the past simply prolong or doom the change initiative. Fast success, and even survival, requires that everyone be committed.
Phase 2: Reluctant but Only Partial Acceptance
This phase typically starts when a new executive, or sometimes a new and prominent product manager, is brought into the company. This person understands at least some of – and potentially all of – the demand side forces “pulling” XaaS across the curve of adoption, and notices the competition from below. Many times, the Innovator’s Dilemma keeps the company from attempting to go after the lower level competitors.
Commonly heard quotes inside the company:
- “Those guys (referring to the upstarts) don’t understand the complexities of our primary market – the large enterprise.”
- “There’s a place for those products in the SMB and SME space – but ‘real’ companies want the security of being on-premise.”
- “Sure, there are some companies entertaining SaaS, but it represents a tiny portion of our existing market.”
- “We are not diluting our margins by going down market.”
The company embarks upon taking all their on-premise solutions and hosting them, nearly exactly as implemented on-premise, as a “service”.
Many of the company’s existing customers aren’t yet ready to migrate to XaaS, but discussions are happening inside customer companies to move several solutions off-premise including email, CRM and ERP. These customers see the possibility of moving everything – they are just uncertain as to when.
How to Avoid or Limit This Phase
The answer for how to speed through or skip this phase is not significantly different than that of the prior phase. Vision, fear of death, a compelling adversary, and a “no excuses” culture are all necessary components.
Secular forces drive customers to seek a shift of risk. This shift is analogous to why one would rent instead of owning a home. The customer no longer wants the hassle of maintenance, nor are they truly qualified to perform that maintenance. They are willing to accept some potential increase in costs as capex shifts to opex, to relieve themselves of the burden of specializing in an area for which they are ill-equipped.
If not performed during the Denial phase, now is the time to remove executives who display behaviors inconsistent with the new desired SaaS Principles
Phase 3: Pretending to Have an Answer
The pretending phase starts with the company implementing essentially an on-premise solution as a “SaaS” solution. With small modifications, it is exactly what was shipped to customers before but presented online and with a recurring revenue subscription model. We often like to call this the “ASP” or “Application Service Provider” model. While the revenue model of the company shifts for a small portion of its revenue to recurring services fees, the solution itself has not changed much. In fact, for older client-server products Citrix or the like is often deployed to allow the solution to be hosted.
The product soon displays clear faults including lower than desired availability, higher than necessary cost of operations, higher than expected operating expenses, and lower margins than competitors overall. Often the company will successfully hide these “SaaS” results as a majority of their income from operations still come from on-premise solutions.
The company will often use nebulous terms like “Cloud” when describing the service offering to steer clear of direct comparisons with other “born SaaS” or “true SaaS” solutions. Sometimes, in extreme cases, the company will lie to itself about what they’ve accomplished, and it will almost always direct the conversation to topics seen as differentiating in the on-premise world rather than address SaaS Principles.
Commonly heard quotes inside the company:
- “Our ‘Cloud’ solution is world class – it’s the Mercedes of all solutions in the space with more features and functionality than any other solution.”
- “The smaller guys don’t have a chance. Look at how long it will take them to reach feature parity. The major players in the industry simply won’t wait for that.”
- “We are the Burger King of SaaS providers – you can still have it your way. And we know you need it your way.”
Meanwhile, customers are starting to look at true SaaS solutions. They tire of availability problems, response time issues, the customization necessary to get the solution to work in a suitable fashion and the lack of configurability. The lead time to implementation is still too long.
Sales people continue to sell the product the same way; promising whatever a customer wants to get a sale. Engineers still develop the same way, using the same principles that made the company successful on-premise and completely ignorant of the principles necessary to be successful in the SaaS world.
How to Avoid or Limit This Phase
It’s not completely a bad thing to launch, as a first step, a “hosted” version of a company’s licensed product. But the company must understand internally that it is only an interim step.
In addition to the visionary and behavioral components of the previous phases, the company now must accept and be planning for a smaller functionality solution that will be more likely adopted by “innovators” and “early majority” companies. The concept of MVP relative to the Technology Adoption Lifecycle is important here.
Further, the company must be aggressively weeding product, sales, and technology executives who lack the behaviors or skills to be successful and seeding the team with people who have “done this before”. Sales teams must act similarly to used car sales people in that they can only sell “what is on the lot” that will fit customer “need”, as compared to new car sales people who can promise options and colors from the factory (“It will take more time”) that more precisely fit a customer’s “want”.
Phase 4: Fear
The company loses its first major deal or two to a rival product that appears to be truly SaaS and abides by SaaS Principles. They realize that their ASP product simply isn’t going to cut it, and Sales people are finally “getting” that the solution they have simply won’t work. The question is: Is the company too late? The answer depends on how long it took the company to get to this position. A true SaaS solution is at the very least months away and potentially years away. If the company moves initially right to the “Fear” stage and properly understands the concepts behind the TALC, they have a chance.
Commonly heard quotes inside the company:
- “We’re screwed unless we get this thing re-architected in order to properly compete in the XaaS space.”
- “Stop behaving like we used to – stop promising customizations. That’s not who we are anymore.”
- “The new product needs to do everything the old product did.” [Incorrect and prone to failing]
- “Think smaller, and faster. Think some of today’s customers – not all of them – for the first release. Understand MVP is relative to the TALC.” [Correct and will help drive success]
How to Avoid or Limit This Phase
This is the most easily avoided phase. With proper planning and execution in prior phases, a company can completely skip the fear stage.
When companies find themselves here, its typically because they have not addressed the talents and approach of their sales, product, and engineering teams. Sales behaviors must change to only sell what’s “on the car lot”. Engineers must understand how to build the “-ilities” into products from day 1. Product managers must switch to identifying market need, rather than fulfilling customer want. Executives must now run an entirely different company.
Final Phase: Healing or Marginalization
Companies successful enough to achieve this phase do so only through launching a true XaaS product – one that abides by XaaS principles built and run by executives, managers and individual contributors who truly understand or are completely wedded to learning what it means to be successful in the XaaS world.
The phases of grief are common among many of our customers. But unlike grieving for a loved one, they are not necessary. Quick progression, or better yet avoidance, of these phases can be accomplished by:
- Establishing a clear and compelling vision based on market and secular force analysis, and an understanding of the technology adoption lifecycle. As with any vision, it should not only explain the reason for change, but the positive long-term financial impact of change.
- Ensuring that everyone understands the cost of failure, helping to instill some small level in fear that should help drive cultural change.
- Ensuring that a known villain or competitor exists, against which we are competing to help boost performance and speed of transition.
- Aggressively addressing the cultural and behavioral changes necessary to be successful. Anyone who is not committed and displaying the appropriate changes in behavior needs to be weeded from the garden.
This shift often results in a significant portion of the company departing – sometimes willingly and sometimes forcefully. Some people don’t want to change and can find other companies (for a while) where their skills and behaviors are relevant. Some people have the desire, but may not be capable of changing in the time necessary to be successful.
Image Credit: Tim Gouw from Pexels
July 24, 2018 | Posted By: Greg Fennewald
Contemplating how to scale your website is not often a paramount concern in the early days when the focus is on time to market and core functionality. Growth over time will eventually force you to architect a solution to get ahead of the growth and deliver the availability and scalability the business demands. The AKF Scale Cube is one way to frame the scalability challenge.
The AKF Scale Cube has three axes by which a website can be scaled. X axis, or horizontal duplication, is a common first choice. Duplicating web and app servers into load balanced pools is a best practice that provides the ability to conduct maintenance or roll code without taking the site down as well as scalability if the pools are N+1 or even N+2.
The Y axis is a service split - decomposing a monolithic code base into smaller services than can run independently. The Y axis also allows you to scale your technology team - teams focus on the services for which they are responsible and no longer need complete domain expertise. Y axis does require a lot of development work though, competing for resources the business wants to use for new features.
You can read more about the Scale Cube and the X and Y axes here.
The third axis is Z, a lookup oriented split. A common choice is geographic location or customer identity. A Z axis split takes a monolithic stack and slices it into N smaller stacks, using fewer or smaller servers and working with 1/Nth of the data.
A case can be made that a Z axis split is your best option when X axis is losing effectiveness and infrastructure costs are becoming unsustainable. Consider this situation; you have already implemented X axis split by deploying load balanced pools of web and app servers. You’ve gone a step further by deploying read-only DB replicas to handle reporting workloads, preserving compute power for the write DB. It’s not enough though, your production DB is wheezing after going up one flight of stairs.
The business intentionally took on technical debt in the form of stored procedures for business logic early on to improve time to market. Development resources are now slowly removing those and writing the business logic in the app tier. New features are still needed so there are no resources to begin on the development work needed for a Y axis split.
X axis is running out of steam and increasing costs and you do not have resources to work on a Y axis split in the near term. Z axis can save the day and provide some breathing room. Z axis does not take as much development work as Y does and the X axis infrastructure work your team has already done will be similar to building smaller Z axis stacks. Your team develops cookie cutter stacks via automation and scripting that handle 2,500 customers well. You start a new stack when the previous one reaches 2,000 customers to leave some depth of data growth. These smaller stacks reduce your growth costs as compared to Z axis alone. The business has time to complete the stored procedure removal before turning towards Y axis service splits all the while delivering on the feature roadmap. You keep your job.
July 20, 2018 | Posted By: Pete Ferguson
One of the most common questions we get is “What are the most common failures you see tech and product teams make?” To answer that question we queried our database consisting of 11 years of anonymous client recommendations. Here are the top 20 most repeated failures and recommendations:
1) Failing to Design for Rollback
If you are developing a SaaS platform and you can only make one change to your current process make it so that you can always roll back any of your code changes. Yes, we know that it takes additional engineering work and additional testing to make nearly any change backwards compatible but in our experience that work has the greatest ROI of any work you can do. It only takes one really bad release in which your site performance is significantly degraded for several hours or even days while you attempt to “fix forward” for you to agree this is of the utmost importance. The one thing that is most likely to give you an opportunity to find other work (i.e. “get fired”) is to roll a product that destroys your business. In other words, if you are new to your job DO THIS BEFORE ANYTHING ELSE; if you have been in your job for awhile and have not done this DO THIS TOMORROW. (Related Content: Monitoring for Improved Fault Detection)
2) Confusing Product Release with Product Success
Do you have “release” parties? Stop it! You are sending your team the wrong message! A release has nothing to do with creating shareholder value and very often it is not even the end of your work with a specific product offering or set of features. Align your celebrations with achieving specific business objectives like a release increasing signups by 10%, or increasing checkouts by 15% or increasing the average sale price of a all checkouts by 12% or increasing click-through-rates by 22%. See #10 below on incenting a culture of excellence. Don’t celebrate the cessation of work – celebrate achieving the success that makes shareholder’s wealthy! (Related Content: Agile and the Cone of Uncertainty)
3) Insular Product Development / Engineering
How often does one of your engineering teams complain about not “being in the loop” or “being surprised” by a change? Does your operations team get surprised about some new feature and its associated load on a database? Does engineering get surprised by some new firewall or routing infrastructure resulting in dropped connections? Do not let your teams design in a vacuum and “throw things over the wall” to another group. Organize around your outcomes and “what you produce” in cross functional teams rather than around activities and “how you work.” (Related Content: The No Surprises Rule)
4) Over Engineering the Solution
One of our favorite company mottos is “simple solutions to complex problems”. The simpler the solution, the lower the cost and the faster the time to market. If you get blank stares from peers or within your organization when you explain a design do not assume that you have a team of idiots – assume that you have made the solution overly complex and ask for assistance in resolving the complexity.
Image Source: Hackernoon.com
5) Allowing History to Repeat itself
Organizations do not spend enough time looking at past failures. In the engineering world, a failure to look back into the past and find the most commonly repeated mistakes is a failure to maximize the value of the team. In the operations world, a failure to correlate past site incidents and find thematically related root causes is a guarantee to continue to fight the same fires over and over. The best and easiest way to improve our future performance is to track our past failures, group them into groups of causation and treat the root cause rather than the symptoms. Keep incident logs and review them monthly and quarterly for repeating issues and improve your performance. Perform post mortems of projects and site incidents and review them quarterly for themes.
6) Vendor Lock
Every vendor has a quick fix for your scale issues. If you are a hyper growth SaaS site, however, you do not want to be locked into a vendor for your future business viability; rather you want to make sure that the scalability of your site is a core competency and that it is built into your architecture. This is not to say that after you design your system to scale horizontally that you will not rely upon some technology to help you; rather, once you define how you can horizontally scale you want to be able to use any of a number of different commodity systems to meet your needs. As an example, most popular databases (and NoSQL solutions) provide for multiple types of native replication to keep hosts in synch.
7) Relying on QA to Find Your Mistakes
You cannot test quality into a system and it is mathematically impossible to test all possibilities within complex systems to guarantee the correctness of a platform or feature. QA is a risk mitigation function and it should be treated as such. Defects are an engineering problem and that is where the problem should be treated. If you are finding a large number of bugs in QA, do not reward QA – figure out how to fix the problem in engineering! Consider implementing test driven design as part of your PDLC. If you find problems in production, do not punish QA; figure out how you created them in engineering. All of this is not to say that QA should not be held responsible for helping to mitigate risk – they should – but your quality problems are an engineering issue and should be treated within engineering.
8) Revolutionary or “Big Bang” Fixes
In our experience, complete re-writes or re-architecture efforts end up somewhere on the spectrum of not returning the desired ROI to complete and disastrous failures. The best projects we have seen with the greatest returns have been evolutionary rather than revolutionary in design. That is not to say that your end vision should not be to end up in a place significantly different from where you are now, but rather that the path to get there should not include “and then we turn off version 1.0 and completely cutover to version 2.0”. Go ahead and paint that vivid description of the ideal future, but approach it as a series of small (but potentially rapid) steps to get to that future. And if you do not have architects who can help paint that roadmap from here to there, go find some new architects.
9) The Multiplicative Effect of Failure – Eliminate Synchronous Calls
Every time you have one service call another service in a synchronous fashion you are lowering your theoretical availability. If each of your services are designed to be 99.999% available, where a service is a database, application server, application, webserver, etc. then the product of all of the service calls is your theoretical availability. Five calls is (.99999)^5 or 99.995 availability. Eliminate synchronous calls wherever possible and create fault-isolative architectures to help you identify problems quickly.
10) Failing to Create and Incentivize a Culture of Excellence
Bring in the right people and hold them to high standards. You will never know what your team can do unless you find out how far they can go. Set aggressive yet achievable goals and motivate them with your vision. Understand that people make mistakes and that we will all ultimately fail somewhere, but expect that no failure will happen twice. If you do not expect excellence and lead by example, you will get less than excellence and you will fail in your mission of maximizing shareholder wealth. (Related Content: Three Reasons Your Software Engineers May Not Be Successful)
11) Under-Engineer for Scale
The time to think about scale is when you are first developing your platform. If you did not do it then, the time to think about scaling for the future is right now! That is not to say that you have to implement everything on the day you launch, but that you should have thought about how it is that you are going to scale your application services and your database services. You should have made conscious decisions about tradeoffs between speed to market and scalability and you should have ensured that the code will not preclude any of the concepts we have discussed in our scalability postings. Hold quarterly scalability meetings where you discuss what you need to do to scale to 10x your current volume and create projects out of the action items. Approach your scale needs in evolutionary rather than revolutionary fashion as in #8 above.
12) “Not Built Here” Culture
We see this all the time. You may even have agreed with point (6) above because you have a “we are the smartest people in the world and we must build it ourselves” culture. The point of relying upon third parties to scale was not meant as an excuse to build everything yourselves. The real point to be made is that you have to focus on your core competencies and not dilute your engineering efforts with things that other companies or open source providers can do better than you. Unless you are building databases as a business, you are probably not the best database builder. And if you are not the best database builder, you have no business building your own databases for your SaaS platform. Focus on what you should be the best at: building functionality that maximizes your shareholder wealth and scaling your platform. Let other companies focus on the other things you need like routers, operating systems, application servers, databases, firewalls, load balancers and the like.
13) A New PDLC will Fix My Problems
Too often CTO’s see repeated problems in their product development life cycles such as missing dates or dissatisfied customers and blame the PDLC itself.
The real problem, regardless of the lifecycle you use, is likely one of commitment and measurement. For instance, in most Agile lifecycles there needs to be consistent involvement from the business or product owner. A lack of involvement leads to misunderstandings and delayed products. Another very common problem is an incomplete understanding or training on the existing PDLC. Everyone in the organization should have a working knowledge of the entire process and how their roles fit within it. Most often, the biggest problem within a PDLC is the lack of progress measurement to help understand likely dates and the lack of an appropriate “product discovery” phase to meet customer needs. (Related Content: The Top Five Most Common PDLC Failures)
14) Inability to Hire Great People Quickly
Often when growing an engineering team quickly the engineering managers will push back on hiring plans and state that they cannot possibly find, interview, and hire engineers that meet their high standards. We agree that hiring great people takes time and hiring decisions are some of the most important decisions managers can make. A poor hiring decision takes a lot of energy and time to fix. However, there are lots of ways to streamline the hiring process in order to recruit, interview, and make offers very quickly. A useful idea that we have seen work well in the past are interview days, where potential candidates are all invited on the same day. This should be no more than 2 - 3 weeks out from the initial phone screen, so having an interview day per months is a great way to get most of your interviewing in a single day. Because you optimize the interview process people are much more efficient and it is much less disruptive to the daily work that needs to get done the rest of the month. Post interview discussions and hiring decisions should all be made that same day so that candidates get offers or letters of regret quickly; this will increase the likelihood of offers being accepted or make a professional impression on those not getting offers. The key is to start with the right answer that “there is a way to hire great people quickly” and the myriad of ways to make it happen will be generated by a motivated leadership team.
15) Diminishing or Ignoring SPOFs (Single Point of Failure)
A SPOF is a SPOF and even if the impact to the customer is low it still takes time away from other work to fix right away in the event of a failure. And there will be a failure…because that is what hardware and software does, it works for a long time and then eventually it fails! As you should know by now, it will fail at the most inconvenient time. It will fail when you have just repurposed the host that you were saving for it or it will fail while you are releasing code. Plan for the worst case and have it run on two hosts (we actually recommend to always deploy in pools of three or more hosts) so that when it does fail you can fix it when it is most convenient for you.
16) No Business Continuity Plan
No one expects a disaster but they happen and if you cannot keep up normal operations of the business you will lose revenue and customers that you might never get back. Disasters can be huge, like Hurricane Katrina, where it take weeks or months to relocate and start the business back up in a new location. Disasters can also be small like a winter snow storm that keeps everyone at home for two days or a HAZMAT spill near your office that keeps employees from coming to work. A solid business continuity plan is something that is thought through ahead of time, before you need it, and explains to everyone how they will operate in the event of an emergency. Perhaps your satellite office will pick up customer questions or your tech team will open up an IRC channel to centralize communication for everyone capable of working remotely. Do you have enough remote connections through your VPN server to allow for remote work? Spend the time now to think through what and how you will operate in the event of a major or minor disruption of your business operations and document the steps necessary for recovery.
17) No Disaster Recovery Plan
Even worse, in our opinion, than not having a BC plan is not having a disaster recovery plan. If your company is a SaaS-based company, the site and services provided is the company’s sole source of revenue! Moreover, with a SaaS company, you hold all the data for your customers that allow them to operate. When you are down they are more than likely seriously impaired in attempting to conduct their own business. When your collocation facility has a power outage that takes you completely down, think 365 Main datacenter in San Francisco, how many customers of yours will leave and never return? Our preference is to provide your own disaster recovery through multiple collocation facilities but if that is not yet technically feasible nor in the budget, at a minimum you need your code, executables, configurations, loads, and data offsite and an agreement in place for both collocation services as well as hosts. Lots of vendors offer such packages and they should be thought of as necessary business insurance.
If you are cloud hosted, this still applies to you! We often find in technical due diligence reviews that small companies who are rapidly growing haven’t yet initiated a second active tech stack in a different availability zone or with a second cloud provider. Just because AWS, Azure and others have a fairly reliable track record doesn’t mean they always will. You can outsource services, but you still own the liability!
Image Source: Kaibizzen.com.au
18) No Product Management Team or Person
In a similar vein to #13 above, there needs to be someone or a team of people in the organization who have responsibility for the product lines. They need to have authority to make decisions about what features get added, which get delayed, and which get deprecated (yes, we know, nothing ever gets deprecated but we can always hope!). Ideally these people have ownership of business goals (see #10) so they feel the pressure to make great business decisions.
19) Failing to Implement Continuously
Just because you call it scheduled maintenance does not mean that it does not count against your uptime. While some of your customers might be willing to endure the frustration of having the site down when they want to access it in order to get some new features, most care much more about the site being available when they want it. They are on the site because the existing features serve some purpose for them; they are not there in the hopes that you will rollout a certain feature that they have been waiting on. They might want new features, but they rely on existing features. There are ways to roll code, even with database changes, without bringing the site down (back to #17 - multiple active sites also allows for continuous implementation and the ability to roll back). It is important to put these techniques and processes in place so that you plan for 100% availability instead of planning for much less because of planned down time.
20) Firewalls, Firewalls, Everywhere!
We often see technology teams that have put all public facing services behind firewalls while many go so far as to put firewalls between every tier of the application. Security is important because there are always people trying to do malicious things to your site, whether through directed attacks or random scripts port scanning your site. However, security needs to be balanced with the increased cost as well as the degradation in performance. It has been our experience that too often tech teams throw up firewalls instead of doing the real analysis to determine how they can mitigate risk in other ways such as through the use of ACLs and LAN segmentation. You as the CTO ultimately have to make the decision about what are the best risks and benefits for your site.
Whatever you do, don’t make the mistakes above! AKF Partners helps companies avoid costly product and technology mistakes - and we’ve seen most of them. Give us a call or shoot us an email. We’d love to help you achieve the success you desire.
July 13, 2018 | Posted By: AKF
Security culture is one of the hardest aspects of security to get right. Unfortunately, it is also the most important thing for security that needs to be done right. It is so important because your culture has a tremendous impact on a very important aspect of your company, your employees.
Multiple studies have been conducted over the years and the number one cause for breaches is always employees. Whether purposeful or inadvertent, breaches occur and most often traced back to employees. Why would an adversary attempt to gain access to a database by leveraging weaknesses in a web server when they can just compromise a database administrator? The level of access that employees have make them a rich target.
The security culture of a company is like any other culture: cultures thrive when employees embrace them and fail when they do not. When employees subscribe to the security culture, then ultimately the company becomes more secure because they will become harder targets and in the course of their daily work they will be able to spot a compromised machine or weak password because sloppy security stands out in a healthy and thriving security culture.
Five Areas of Focus to Improve Security Culture:
-“No, however” vs “No”
When a new implementation or modification comes down the line and it doesn’t mesh with security principles, do you say “No” or “No, however”? There is a very distinct difference between these two responses. The first response of just “No” means you have drawn a line in the sand and security trumps whatever is being attempted. The issue with this is that no doesn’t help create productivity and solve the business problem. If you allow security to stand in the way of business then you will soon be looking for a new job. The only true way to be secure is to stay out of business, so security needs to find a way to coexist with business.
If your response is “No, however” then you are off to a good start. There are times when new product may not align with the regulations that are required for your company. And that is ok. It is your job to work with the team and figure out how to best implement what they want and not violate those regulations.
By spending your time shutting down new business ideas for the sake of security, you will quickly realize that no one is coming to you anymore. And that is detrimental for security.
If you’ve never heard groans and sighs as you announce the next round of security training, then you can count yourself part of a very exclusive group of people; People with training that is both engaging and beneficial.
With all the meetings and events that are required in order to keep all the employees going in the same direction, adding more time away from keyboards is never easy. If time away is unproductive then not only is no one learning anything, the company is simultaneously losing money. The topics of discussion need to be relevant and beneficial to the employees and the company.
If you find yourself talking about the latest attack vectors to Red Hat based systems and the majority of your systems are Windows or iOS based then you are going down the wrong path. You need to understand your company and the direction it is going in order to gear you training towards those aspects. Your training can either be productive or unproductive. The more productive it is the easier security is in the long run.
Who is required to be a part of the culture of security? Is your CISO the only one pushing training and recommending that security be brought into development and not considered an afterthought? Or does it go higher?
People are very observant. If they notice that security isn’t embraced by everyone up to and including the CEO then why should they embrace it? Additionally, the most sought-after targets for an adversary are usually the C-level employees. A lot of open source information exists about them and they tend to have a lot of access. So, if the most susceptible employees are not required to be a part of the culture, what reason would someone else have to be a part of the culture?
-Level of Attachment
How attached you are to doing security solely in house is indicative of how quickly it will become stagnant. To think that your company is unique in how it is targeted by adversaries only sets you up for failure. You need to open your doors to third parties or similarly based companies when it comes to security in order to ensure that you are staying relevant with the latest trends and threats that exist.
Unless you are a company that does security for companies, then you are already at a competitive disadvantage for solely performing your own scans. Companies exist with the sole purpose of staying current with the tactics being utilized and then provide you with feedback on how to protect yourself; use them. This isn’t to say that you should completely outsource all security responsibility. Depending on how vulnerable your company is looking at bringing in a third party monthly or quarterly. In between those visits conduct scans internally as well.
Additionally, threats that exist today cast wide nets to see what they can compromise. If you run a deli, chances are the bakery down the street has the same potential to get attacked by the same bad actor. Have your security professionals communicate regularly with them. You aren’t sharing any sort of Intellectual Property by talking about the recent scans or attacks you’ve been seeing. You are helping to create a community that is stronger together against outside attacks.
-Best Business Practices
Some policies you can’t get around. Heavily regulated markets require certain steps be taken in order to secure financial data, health data, personally identifiable information, etc. These policies need to be put into place. This will leave gaps in being secure though. That is where best business practices come in.
Security in an information age isn’t something new. People have been doing it for a very long time. Lists of recommendations, and even steps, exist to help people bridge the gap from insecure to secure. Use these and bolster what you have in place. If something doesn’t pass the “smell test”, then discard. Your policies and procedures should be a living document that is reviewed and updated regularly. If you follow the current trends that exist and keep ingesting the best practices then a lot of the work will be done for you.
This list isn’t all inclusive. Maybe you find yourself better situated in one, or more, of the above aspects. If that is the case, then great. But if you need help in shoring up the above five areas or are looking for additional support on how best to secure your company, AKF can help.
July 12, 2018 | Posted By: AKF
Most companies do a thorough job of financial due diligence when they acquire other companies. But all too often, dealmakers simply miss or underestimate the significance of people issues. The consequences can be severe, from talent loss after a deal’s announcement, to friction or paralysis caused by differences in decision-making styles.
When acquirers do their people homework, they can uncover skills & capability gaps, points of friction, and differences in decision making. They can also make the critical people decisions - who stays, who goes, who runs the various lines of business, what to do with the rank and file at the time the deal is announced or shortly thereafter. Making such decisions within the first 90 days is critical to the success of a deal.
Take for example, Charles Schwab’s 2000 acquisition of US Trust. Schwab & the nation’s oldest trust company set out to sign up the newly minted millionaires created by a soaring bull market. But the cultures could not have been farther apart – a discount do-it-yourself stock brokerage style and a full-service provider devoted to pampering multimillionaires can make for a difficult integration. Six years after the merger, Chuck Schwab came out of retirement to fix the issues related to culture clash. The acquisition reflects a textbook common business problem. The dealmakers simply ignored or underestimated the significance of people and cultural issues.
Another example can be found in the 2002 acquisition of PayPal by eBay. The fact that many on the PayPal side referred to it as a merger, sets the stage for conflicting cultures. eBay was often embarrassed by the fact that PayPal invoice emails for a won auction arrived before the eBay end of auction email - PayPal made eBay look bad in this instance and the technology teams were not eager to combine. As well, PayPal titles were discovered to be one level higher than eBay titles considering the scope of responsibilities. Combining the technology teams did not go well and was ultimately scrapped in favor of dual teams - not the most efficient organizational model.
People due diligence lays the groundwork for a smooth integration. Done early enough, it also helps acquirers decide whether to embrace or kill a deal and determine the price they are willing to pay. There’s a certain amount of people due diligence that companies can and must do to reduce the inevitable fallout from the acquisition process and smooth the integration.
Ultimately, the success or failure of any deal has to do with people. Empowering people and putting them in a position where they will be successful is part of our diligence evaluation at AKF Partners. In our experience with clients, an acquiring company must start with some fundamental question:
1. What is the purpose of the deal?
2. Whose culture will the new organization adopt?
3. Will the two cultures mesh?
4. What organizational structure should be adopted?
5. How will rank-and-file employees react to the deal?
Once those questions are answered, people due diligence can focus on determining how well the target’s current structure and culture will mesh with those of the proposed new company, who should be retained and by what means, and how to manage the reaction of the employee base.
In public, deal-making executives routinely speak of acquisitions as “mergers of equals.” That’s diplomatic, politically correct speak and usually not true. In most deals, there is not only a financial acquirer, there is also a cultural acquirer, who will set the tone for the new organization after the deal is done. Often, they are one and the same, but they don’t have to be.
During our Technology Due Diligence process at AKF Partners, we evaluate the product, technology and support organizations with a focus on culture and think through how the two companies and teams are going to come together. Who the cultural acquirer is dependes on the fundamental goal of the acquisition. If the objective is to strengthen the existing product lines by gaining customers and achieving economies of scale, then the financial acquirer normally assumes the role of the cultural acquirer.
People due diligence, therefore, will be to verify that the target’s culture is compatible enough with the acquirers to allow for the building of necessary bridges between the two organizations. Key steps that are often missed in the process:
• Decide how the two companies will operate after the acquisition — combined either as a fully integrated operating company or as autonomous operating companies.
• Determine the new organizational structure and identify areas that will need to be integrated.
• Decide on the new executive leadership team and other key management positions.
• Develop the process for making employment-related decisions.
With regard to the last bullet point, some turnover is to be expected in any company merger. Sometimes shedding employees is even planned. It is important to execute The Weed, Seed & Feed methodology ongoing not just at acquisition time. Unplanned, significant levels of turnover negatively impact a merger’s success.
AKF Partners brings decades of hands-on executive operational experience, years of primary research, and over a decade of successful consulting experience to the realm of product organization structure, due diligence and technology evaluation. We can help your company successfully navigate the people due diligence process.
July 12, 2018 | Posted By: Marty Abbott
XaaS (e.g. SaaS, PaaS) mass-market adoption is largely a demand-side “pull” of a product through the Technology Adoption Lifecycle (TALC). One of the best books describing this phenomenon is the book that made the TALC truly accessible to business executives in technology companies, Crossing the Chasm.
An interesting side note here is that Crossing the Chasm was originally published in 1991. While many people associate the TALC with Geoffrey Moore, it’s actual origin is Everett Rogers’ research on the diffusion of innovations. Rogers’ initial research focused on the adoption of hybrid corn seed by rural farmers. Rogers and team expanded upon and generalized the research between 1957 and 1962 when he released his book Diffusion of Innovations. While the 1962 book was a critical success, I find it interesting that it took nearly 30 years to bring the work to most technology business executives.
Several forces combine to incent companies within each phase of the TALC to consider, and ultimately adopt, XaaS solutions. Progressing from left to right through the TALC, companies within each phase are decreasingly less aware of these forces and increasingly more resistant to them until the forces are overwhelming for that segment or phase. Innovators, therefore, are the most aware of emerging forces, seek to use them for competitive advantage and as a result are least resistant to them. Laggards are least aware (think heads in the sand) and most resistant to change, desiring status quo over differentiation.
While the list below is incomplete, it has a high degree of occurrence across industries and products. Our experience is that even when they are experienced in isolation (without other demand side XaaS forces), they are sufficient to pull new XaaS products across the TALC.
Demand Side Forces
Talent Forces: Opportunity, Supply and Demand
The first two forces represent the talent available to the traditional internally focused corporate IT organization. The rise of XaaS offerings and the equity compensation opportunities inherent to them has been a vacuum sucking what little great talent exists within corporate IT. This first force really serves to add insult to injury to the second force: the initial low supply of talented engineers available in the US.
Think about this: Per-capita (normalized for population growth) US graduates for engineers within the US has remained relatively constant since 1945. While there has been absolute growth, until very recently that growth has slightly underperformed relative to the growth in population. Contrast this with the fact that college graduation rates in 2012 were higher than high school graduation rates in 1940. Add in that most engineering fields were at or near economic full employment during the great recession and we have a clear indication that we simply do not produce enough engineers for our needs. The same is true on a global level.
This constrained supply has led to alternative means of educating programmers, systems administrators, database administrators and data analysts. “Boot Camps”, or short courses meant to help teach people basic skills to contribute to the product and IT workforces, have sprung up around the nation to help fulfill this need. Once trained, the people fulfill many lower level needs in the product and IT space, but they are not typically equivalent to engineers with undergraduate degrees.
Constrained supply, growing demand and a clear differentiation in employment for engineers working in product organizations compared to IT organizations means that IT groups simply aren’t going to get great talent. Without great talent, there can be no great development or operations teams to create or run solutions. As such, the talent forces alone mandate that many IT teams look to outsource as much as they can: everything from infrastructure to the platforms that help make employees more efficient.
Core vs. Context
Why would a company spend time or capital on running third party solutions or developing bespoke solutions unless those activities result in competitive differentiation? Doing so is a waste of precious time and focus. Companies simply can’t waste time running email servers, or building CRM solutions and soon will find themselves not spending time on anything that doesn’t directly lend itself to competitive advantage.
Put simply, anytime a company can’t maximize profit margins through keeping in-house systems highly utilized, they should (and will) seek to outsource the workload to an IaaS provider. Why purchase a server to do work 8 hours a day and sit idle 16? Why purchase a server capable of handling peak period of 10 minutes of traffic, when a server one fifth its size is all that’s needed for the remainder of the day?
Part of staying competitive is focusing on profit maximization to invest those profits in other, higher return opportunities. Companies that don’t at least leverage Infrastructure as a Service to optimize their costs of operations and costs of goods sold (if doing so results in lowered COGS) are arguably violating their fiduciary responsibility.
Risk is the probability that an event occurs, multiplied by the impact of that event should it occur. Companies that specialize in providing a service, whether it be infrastructure as a service or software as a service typically have better skills and more experience in providing that service than other companies. This means that they can both reduce the probability of occurrence and likely the impact of the occurrence should it happen. Anytime such a risk shift is possible, where company A can shift its risk of operating component B to a company C specializing in managing that risk at an affordable price, it should be taken. The XaaS movement provides many such opportunities.
Time to Market
One of the most significant reasons why companies choose either SaaS or IaaS solutions is the time to market benefit that their usage provides. In the SaaS world, it’s faster and easier to launch a solution that provides compelling business efficiencies than purchasing, implementing and running that system in-house. In the IaaS world, infrastructure on demand is possible and companies need not wait for onerous lead times associated with ordering and provisioning equipment.
Focus on Returns
Most projects, including packaged software purchased and implemented on site are delivered late and fail to produce the returns on which the purchase was predicated. Further, these projects are “hard to kill”, with millions invested and long depreciation cycles; companies are financially motivated to “ride” their investments to the bitter end in the hopes of minimizing the loss. In the extreme, XaaS solutions offer a simple way to flee a sinking ship: leave. Because the solution is leased, you can (obviously with some work) switch to a competing provider. Capital is freed up in the company for higher return investments associated with revenue rather than costs.
The outcomes of these factors is clear. In virtually every industry, and for nearly every product, companies will move from on-premise or built-in-house solutions to XaaS solutions. CEOs, COOs, CTOs and CIOs will all seek to “Get this stuff out of my shop!”
Companies that believe it will never happen in their industry are simply kidding themselves and completely ignoring the outcomes predicted within the technology adoption life-cycle. Every company needs to focus on fewer, higher value things. Every company should outsource risky operational activities if there is a better manager of that risk than them. Every company is obligated to seek better margins and greater profitability – even those companies in the not for profit sector need to maximize the benefit of every donated dollar. Every company seeks opportunities to bring their value to market faster. And finally, few companies can find the great talent in today’s market necessary to be successful in their corporate IT and product operations endeavors.
If you produce on-premise, licensed software products and have not yet considered the movement to a SaaS solution, you need to do so now. Failing to do so could mean the demise of your company.
We are experts in helping companies migrate products to the XaaS model. Contact us - we would love to help you.
July 8, 2018 | Posted By: AKF
AKF often recommends to our clients the adoption of business metric monitoring – the use of high-level user activity or transaction patterns that can often provide early warning of an incident. Business metric monitors will not tell you where or what the problem is, rather – and most importantly – they tell you something appears to be abnormal and should be investigated, that something has affected your customer experience.
A significant part of recovery time (and therefore availability) is the time required to detect and localize service incidents. A 2013 study by Business Internet Group of San Francisco found that of the 40 top-performing websites (as identified by KeyNote Systems), 72% had suffered user-visible failures in common functionality, such as items not being added to a shopping cart or an error message being displayed.
Our conversations with clients confirm that detecting these failures is a significant problem. AKF Partners estimates that 75% of the time spent recovering from application-level failures is time spent detecting them! Application-level failures can sometimes take days to detect, though they are repaired quickly once found. Fast detection of these failures (Time to Detect – TTD) is, therefore, a key problem in improving service availability.
The duration of a product impairment is TTR.
To improve TTR, implement a good notification system that first, based on business metrics, tells you that an error affecting your users is happening. Then, rely upon application and system monitoring to inform you on where and what has failed. Make sure to have good and easy view logs for all errors, warnings and other critical data your application creates. We already have many technologies in this space and we just need to employ them in an effective manner with the focus on safeguarding the client experience.
In the form of Statistical Process Control (SPC – defined below) two relatively simple methods to improve TTD:
- Business KPI Monitors (Monitor Real User Behavior): Passively monitor critical user transactions such as logins, queries, reports, etc. Use math to determine abnormal behavior. This is the first line of defense.
- Synthetic Transactions (Simulate User Behavior): Synthetic transactions are scripted actions that attempt to mimic real customer behavior. Examples might be sign-ons, add to cart, etc. They provide a more meaningful view of your customers’ experiences vs. just looking at page load times, error rates, and similar. Do this with Keynote or a similar product and expand it to an internal systems scope. Alerts from a passive monitor can be confirmed or denied and escalated as appropriate. This is the second line of defense.
Monitor the Bad – potential, & actual bad things (alert before they happen), and tune and continuously improve (Iterate!)
If you can’t identify all problem areas, identify as many as possible. The best monitoring starts before there’s a problem and extends beyond the crisis.
Because once the crisis hits, that’s when things get ugly! That’s when things start falling apart and people point fingers.
At times, failures do not disable the whole site, but instead cause brown-outs, where part of a site’s functionality is disabled or only some users are unable to access the site. Many of these failures are application-level failures that change the user-visible functionality of a service but do not cause obvious lower-level failures detectable by service operators. Effective monitoring will detect these faults as well.
The more proactive you can be about identifying the issues, the easier it will be to resolve and prevent them.
In fault detection, the aim is to determine whether an abnormal event happened or when an application being monitored is out of control. The early detection of a fault condition is important in avoiding quality issues or system breakdown, and this can be achieved through the proper design of effective statistical process control with upper & lower limits identified. If the values of the monitoring statistics exceed the control limits of the corresponding statistics, a fault is detected. Once a fault condition has been positively detected, the next step is to determine the root cause of the out-of-control status.
One downside of the SPC method is that significant changes in amplitude (natural increases in your business metrics) can cause problems. An alternative to SPC is First and Second Derivative testing. These tests tell if the actual and expected curve forms are the same.
Here’s a real-world example of where business metrics help us determine changes in normal usage at eBay.
We had near real-time graphs of user metrics such as bids, listings, logins, and new user registrations. The data was graphed week over week. Usage patterns throughout a day followed a readily identifiable pattern with peaks and valleys. These graphs were displayed in the Network Operations Center, which was staffed 24x7. Deviations from the previous week’s pattern had proven useful, identifying issues such as ISP instability in the EU impacting customers trying to access eBay.
Everything seemed normal on a Wednesday evening – right up to the point that bids and listings both took a nosedive. The NOC quickly initiated the SEV1 process and technical resources checked their areas. The site had no identifiable faults, services were confirmed to be working fine, yet the user activity was still markedly lower. Roughly 20 minutes into the SEV1 process, the root cause was identified. The finale episode of American Idol was being broadcast. Our site was fine – but our customers had other things on their mind. The business metric monitors worked – they gave warning of an aberrant usage pattern.
How would your company react to this critical change in normal usage patterns? Use business metric monitors to detect workload shifts.
‹ First < 6 7 8 9 10 > Last ›