August 21, 2019 | Posted By: Bill Armelin
At AKF Partners, we believe in learning aggressively, not just from your successes, but also your failures. One common failure we see are service disrupting incidents. These are the events that either make your systems unavailable or significantly degrade performance for your customers. They result in lost revenue, poor customer satisfaction and hours of lost sleep. While there are many things we can do to reduce the probability of an incident occurring or the impact if it does happen, we know that all systems fail.
We like to say, “An incident is a terrible thing to waste.” The damage is already done. Now, we need to learn as much about the causes of the incident to prevent the same failures from happening again. A common process for determining the causes of failure and preventing them from reoccurring is the postmortem. In the Army, it is called an After-Action Review. In many companies it is called a Root Cause Analysis. It doesn’t matter what you call it, as long as you do it.
We actually avoid using Root Cause Analysis. Many of our clients that use the term focus too much on finding that one “root cause” of the issue. There will never be a single cause to an incident. There will always be a chain of problems with a trigger or proximate event. This is the one event that causes the system to finally topple over. We need a process that digs into the entire chain of events inclusive of the trigger. This is where the postmortem comes in. It is a cross-functional brainstorming meeting that not only identifies the root causes of a problem, but also help in identifying issues with process and training.
Postmortem Process – TIA
The purpose of a good postmortem is to find all of the contributing events and problems that caused an incident. We use a simple three step process called TIA. TIA stands for imeline, ssues, and ctions.
First, we create a timeline of events leading up the issue, as well as the timeline of all the actions taken to restore service. There are multiple ways to collect the timeline of events. Some companies have a scribe that records events during the incident process. Increasingly, we are seeing companies use chat tools like Slack to record events related to restoration. The timestamp in Slack for the message is a good place to extract the timeline. Don’t start your timeline at the beginning of the incident. It starts with the activities prior to the incident that cause the triggering event (e.g. a code deployment). During the postmortem meeting, augment the timeline with additional details.
The second part of TIA is Issues. This is where we walkthrough the timeline and identify issues. We want to focus on people, process, and technology. We want to capture all of the things that either allowed the incident to happen (e.g. lack of monitoring), directly triggered it (e.g. a code push), or increased the time to restore the system to a stable state (e.g. could get the right people on the call). List each issue separately. At this point, there is no discussion about fixing issues, we only focus on the timeline and identifying issues. There is also no reference to ownership. We also don’t want to assign blame. We want a process that provides constructive feedback to solve problems.
Avoid the tendency to find a single triggering event and stop. Make sure you continue to dig into the issues to determine why things happened the way they did. We like to use the “5-whys” methodology to explore root causes. This entails repeatedly asking questions about why something happened. The answer to one question becomes the basis for the next. We continue to ask why until we have identified the true causes of the problems.
Here is a summary of anti-patterns we see when companies conduct postmortems:
|Not conducting a postmortem after a serious (e.g. Sev 1) incident
||Conduct a postmortem within a week after a serious incident
||Avoid blame and keep it constructive
|Not having the right people involved
||Assemble a cross functional team of people involved or needed to resolve problems
|Using a postmortem block (e.g. multiple postmortems during a 1-hour session every two weeks)
||Dedicate time for a postmortem based on the severity of the incident
|Lack of ownership of identified tasks
||Make one person accountable to complete a task within an appropriate timeframe
|Not digging far enough into issues (finding a single root cause)
||Use the 5-Why methodology to identify all of the causes for an issue
Incidents will always happen. What you do after service restoration will determine if the problem occurs again. A structured, timely postmortem process will help identify the issues causing outages and help prevent their reoccurrence in the future. It also fosters a culture of learning from your mistakes without blame.
Are you struggling with the same issues impacting your site? Do you know you should be conducting postmortems but don’t know how to get started? AKF can help you establish critical incident management and postmortem processes. Call us – we can help!
August 20, 2019 | Posted By: Dave Berardi
If your company doesn’t utilize one of the big cloud providers for either IaaS or PaaS as part of product infrastructure, it’s only a matter of time. We often find our clients in situations where they are pressured to move quickly for benefit-realization to improve many aspects of their business.
Drivers of this trend that exist across our client base and the industry include:
- The Need For Speed and Time To Market: The need to scale capacity quickly without waiting weeks or months for hardware procurement and provisioning in your own datacenter or colo.
- Traditional On-Prem Software Dying by 1000 Cuts: Demand-side (buyer) forces are encouraging companies to get services and software out of data centers. Cloud-native SaaS competition is pressuring what’s left of the on-prem software providers.
- Legacy Company Talent Challenges: The inability of the old economy companies to hire engineering talent to support on-prem software in house.
Several different approaches can be used for migration. We’ve seen many of them and there are two on opposite ends of the spectrum – Lift and Shift and Cloud-Native – that we want to unpack.
The Lift and Shift Approach:
What is it?
Put simply, this is when the same architecture, resources, and services from an on-prem or colo data center are moved up into a cloud provider. Often VMs from on-prem hosting centers are converted and dropped into reserved virtual compute instances. Tools such as AWS Connector for vCenter or GCP’s Velostrata, in theory, allow for an easy transition.
- Fastest path to cloud
- Same architecture and tech stack minimizes training need – infrastructure management does require knowledge of the console
- Least costly in terms of planning, architecture changes, refactoring
- Monolithic nature of the architecture can prove to be costly thru BYOL and compute requirements
- Minimal use of native elasticity and resources create cost-inefficient use of compute, memory, and storage and may not perform as needed
- Technical debt migrates with the product and cost could be magnified with additional problems and a shift to the pay for use model
While Lift and Shift seems to be the easiest path, you need to be aware of the strong potential for an increase in cost in the cloud. Running VMs in your own DC and colo masks the cost inefficiencies since they are all part of Capex for your compute, storage, and network. When you move to public cloud the provider will promise to be cheaper. But in the cloud you will pay for every reserved CPU that isn’t utilized, storage that isn’t used, and other idle resources. Further, your availability can only be as good as the provider’s uptime for a given Region and/or Availability Zone.
Cloud Native Approach:
What is it?
Cloud-Native approach ultimately allows for the use of a provider’s cloud services as long as there are requests and demand being created by product users. This approach almost always requires investment into splitting the monolith and moving to a services-separated architecture. In addition, it could require you to use native services in your provider of choice. Doing so lets you move from paying for provisioned infrastructure to consumption-based services with better cost-efficiency.
- Less time needed to manage infrastructure and more time for features and experimentation
- Easier to scale out using native services
- Most cost-efficient
- Slowest path to cloud
- More discovery and training - this approach requires your teams to understand the current tech stack in order to recreate them in cloud. From a cloud perspective they must understand how the provider of choice works so that decisions can be made on native services.
- Increased risk of vendor lock-in (eg. Building out event-driven services with rules inside of native serverless)
The Cloud Native path is a longer one, but provides several benefits that will yield more value over time. With this approach you must spend time determining how to split up your monolith and how to best leverage the right combination of Availability Zones, Regions, and use of native services depending on your Recovery Time Objective (RTO) and Recovery Point Objectives (RPO). We prefer to solve scalability and availability problems with systems and software architecture to avoid vendor lock-in. All of the trade-offs on such a journey must be understood.
We have helped several companies of various sizes move to the cloud going thru SaaS transformations and have engaged in reviewing proposed architectures. Contact us to see how we can help.
August 7, 2019 | Posted By: Pete Ferguson
Scalability doesn’t somehow magically appear when you trust a cloud provider to host your systems. While Amazon, Google, Microsoft, and others likely will be able to provide a lot more redundancy in power, network, cooling, and expertise in infrastructure than hosting yourself – how you are set up using their tools is still very much up to your budget and which tools you choose to utilize. Additionally, how well your code is written to take advantage of additional resources will affect scalability and availability.
We see more and more new startups in AWS, Google, and Azure – in addition to assisting well-established companies make the transition to the cloud. Regardless of the hosting platform, in our technical due diligence reviews, we often see the same scalability gaps common to hosted solutions written about in our first edition of “Scalability Rules.” (Abbott, Martin L.. Scalability Rules: Principles for Scaling Web Sites. Pearson Education.)
This blog is a summary recap of the AKF Scale Cube (much of the content contains direct quotes from the original text), an explanation of each axis, and how you can be better prepared to scale within the cloud.
Scalability Rules – Chapter 2: Distribute Your Work
Using ServiceNow as an early example of designing, implementing, and deploying for scale early in its life, we outlined how building in fault tolerance helped scale in early development – and a decade + later the once little known company has been able to keep up with fast growth with over $2B in revenue and some forecasts expecting that number to climb to $15B in the coming years.
So how did they do it? ServiceNow contracted with AKF Partners over a number of engagements to help them think through their future architectural needs and ultimately hired one of the founding partners to augment their already-talented engineering staff.
“The AKF Scale Cube was helpful in offsetting both the increasing size of our customers and the increased demands of rapid functionality extensions and value creation.”
~ Tom Keevan (Founding Partner, AKF Partners & former VP of Architecture at eBay & Service Now)
The original scale cube has stood the test of time and we have used the same three-dimensional model with security, people development, and many other crucial organizational areas needing to rapidly expand with high availability.
At the heart of the AKF Scale Cube are three simple axes, each with an associated rule for scalability. The cube is a great way to represent the path from minimal scale (lower left front of the cube) to near-infinite scalability (upper right back corner of the cube). Sometimes, it’s easier to see these three axes without the confined space of the cube.
X Axis – Horizontal Duplication
The X Axis allows transaction volumes to increase easily and quickly. If data is starting to become unwieldy on databases, distributed architecture allows for reducing the degree of multi-tenancy (Z Axis) or split discrete services off (Y Axis) onto similarly sized hardware.
A simple example of X Axis splits is cloning web servers and application servers and placing them behind a load balancer. This cloning allows the distribution of transactions across systems evenly for horizontal scale. Cloning of application or web services tends to be relatively easy to perform and allows us to scale the number of transactions processed. Unfortunately, it doesn’t really help us when trying to scale the data we must manipulate to perform these transactions as memory caching of data unique to several customers or unique to disparate functions might create a bottleneck that keeps us from scaling these services without significant impact on customer response time. To solve these memory constraints we’ll look to the Y and Z Axes of our scale cube.
Y Axis – Split by Function, Service, or Resource
Looking at a relatively simple e-commerce site, Y Axis splits resources by the verbs of signup, login, search, browse, view, add to cart, and purchase/buy. The data necessary to perform any one of these transactions can vary significantly from the data necessary for the other transactions.
In terms of security, using the Y Axis to segregate and encrypt Personally Identifiable Information (PII) to a separate database provides the required security without requiring all other services to go through a firewall and encryption. This decreases cost, puts less load on your firewall, and ensures greater availability and uptime.
Y Axis splits also apply to a noun approach. Within a simple e-commerce site data can be split by product catalog, product inventory, user account information, marketing information, and so on.
While Y axis splits are most useful in scaling data sets, they are also useful in scaling code bases. Because services or resources are now split, the actions performed and the code necessary to perform them are split up as well. This works very well for small Agile development teams as each team can become experts in subsets of larger systems and don’t need to worry about or become experts on every other part of the system.
Z Axis – Separate Similar Things
Z Axis splits are effective at helping you to scale customer bases but can also be applied to other very large data sets that can’t be pulled apart using the Y Axis methodology. Z Axis separation is useful for containerizing customers or a geographical replication of data. If Y Axis splits are the layers in a cake with each verb or noun having their own separate layer, a Z Axis split is having a separate cake (sharding) for each customer, geography, or other subset of data.
This means that each larger customer or geography could have its own dedicated Web, application, and database servers. Given that we also want to leverage the cost efficiencies enabled by multitenancy, we also want to have multiple small customers exist within a single shard which can later be isolated when one of the customers grows to a predetermined size that makes financial or contractual sense.
For hyper-growth companies the speed with which any request can be answered to is at least partially determined by the cache hit ratio of near and distant caches. This speed in turn indicates how many transactions any given system can process, which in turn determines how many systems are needed to process a number of requests.
Splitting up data by geography or customer allows each segment higher availability, scalability, and reliability as problems within one subset will not affect other subsets. In continuous deployment environments, it also allows fragmented code rollout and testing of new features a little at a time instead of an all-or-nothing approach.
This is a quick and dirty breakdown of Scalability Rules that have been applied at thousands of successful companies and provided near infinite scalability when properly implemented. We love helping companies of all shapes and sizes (we have experience with development teams of 2-3 engineers to thousands). Contact us to explore how we can help guide your company to scale your organization, processes, and technology for hyper growth!
July 31, 2019 | Posted By: Marty Abbott
The Easiest Job
Hands down, the easiest job in any company that produces software for on-premise delivery is the sales job.
Everything is magically aligned to make this job “easy” on a relative basis.
- The more a producing company promises, the more the purchasing company wants.
- The more the customer wants, the larger the contract becomes. If the software doesn’t do it today, the producing company adds in professional services fees to customize the software.
- The more you promise, the higher the probability of closing a deal.
In many ways, being an on-premise software salesperson is very much like being a new car dealer. The dealer (or software producer) has several new cars on the lot that the customer can just drive away in today. If the customer wants something special, the dealer can very likely configure it and order it from the factory – the customer just needs to wait awhile. These options cost a bit more and take a bit longer to produce.
Leather seats? Sure – we have that. Different color? We have 25 colors from which to choose. Finally, if the customer’s desire is far afield from what the factory can fulfil (special paint effects, spoilers, ground effect lighting, etc.), the dealership either has an auto shop that can fulfil the request, or a relationship with an auto shop somewhere in town from which the dealer receives a referral fee.
Sure, your software product still must compete well with the other providers in your space just as Chevy’s products must compete well with Ford, Dodge, Toyota, etc. But assuming you have a viable product, meeting a customer’s needs is not very difficult (for the salesperson) and the more the customer wants the more the company (and the salesperson) makes!
That’s not to say that just anyone can be a salesperson or that the sales job is “easy” (remember – we used relative terms like easiest and easier). We know that not all of us get excited about hitting the road every day, living on planes and in hotels, and putting a smile on in front of people we barely know.
We’re just saying that on a comparative basis, it’s easier than most other jobs in the company (engineering, product management, finance, etc.) and a whole lot easier than the alternative job in software sales.
The Hardest Job Kurt Russell, Used Cars (1980)
Contrast the on-premise software sales job with what is very likely the hardest job in the software industry – that of the Software as a Service (SaaS) salesperson. The root of this difference in difficulty, lies within the principles necessary for SaaS to be successful – specifically building to market need instead of customer want.
Whereas on-premise sales are bolstered by adding the entirety of a customer’s wants into a contract (thereby increasing the value of each contract), such a process creates bloated and unmaintainable SaaS solutions. To be successful in a SaaS world, we need configuration over customization and homogeneous environments.
To that end, successful SaaS salespeople have a job that’s very similar to that of a used car salesperson. I know, the metaphor conjures up cheap, rumpled suits and people who wreak of desperation. But consider the job for a minute – it is quite difficult. The used car salesperson needs to converge customer wants into something that he or she has on the lot or there isn’t going to be a sale.
There is no factory from which to order specific configurations. The dealership isn’t likely to have an after-market shop for bespoke requests. The salesperson must be a bit of a magician in somehow force fitting a laundry list of customer desires into some vehicle that’s sitting on the lot.
I’ll say that again – the used car salesperson can only sell what’s on the lot. This is similarly true with SaaS salespeople – they need to find a way to convince a customer that their wants are served by a product’s existing capabilities.
Mature SaaS products evolve into having a great deal of customer configurable items that allow for incredible extensibility. But those configurations don’t typically exist in early stage companies. Even more mature SaaS solutions implement APIs that allow for off-platform extensions of product capabilities.
Does your customer want a unique workflow engine? A mature solution allows for the current workflow to be turned off and for another workflow to be plugged into the system using APIs.
Does your customer want to use a different order fulfillment solution? A mature solution allows for the warehouse management component to be disabled and replaced via asynchronous stubs and APIs to another providers fulfillment services. Once API extensions are available, professional services teams can again begin to create revenue streams from customer wants.
The key point remains that sales teams in SaaS solutions cannot go “off script”. They must only sell what’s “on the truck” or “on the lot”.
Allowing SaaS sales teams to behave in the same fashion as on-premise sales teams will cause:
- High levels of customer dissatisfaction when products can’t be delivered
- High churn and slow time to market in the engineering organization
- Incredible “whiplash” in product management (rapid change of priorities)
- Soaring software maintenance costs for the company as revisions “in the wild” increase relative to other SaaS firms.
- High costs of goods sold as infrastructure costs rise to meet the needs of product skew and code complexity
All of these will combine to make the company, over time, under perform relative to competitors. Ultimately the company’s SaaS solution will become bloated, noncompetitive and the company’s SaaS solution will fail.
This change in selling behavior is so significant that we often find on-premise sales organizations incapable of making the transition to the SaaS mentality and necessary way of selling. Many companies with which we work go through a cycle of trying to retain their original sales force, seeing that the organizations behaviors are inconsistent with need too late as product time to market slows and margins decline, and then replacing large portions of the sales team late in the game.
AKF Partners helps companies transition to a SaaS model and mindset. Give us a call - we can help!
July 29, 2019 | Posted By: Marty Abbott
Asynchronous messaging systems are a critical component of many highly scalable and highly available architectures. But, as with any other architectural component, these solutions also need attention to ensure availability and scalability. The solution should scale along one of the scale cube axes, either X, Y or Z. The solution should also both include and enable the principle of fault isolation. Finally, it should scale cost both gracefully and cost effectively while enabling high levels of organizational scale. These requirements bring us to the principle of Smart End Points and Dumb Pipes.
Fast time to market within software development teams is best enabled when we align architectures and organizations such that coordination between teams is reduced (see Conway’s Law and our white paper on durable cross functional product teams). When services within an architecture communicate, especially in the case of one service “publishing” information for the consumption of multiple services, the communication often needs to be modified or “transformed” for the benefit of the consumers. This transformation can happen at the producer, the transport mechanism or the consumer. Transformation by the producer for the sake of the consumer makes little sense, as the producer service and its associated team have low knowledge of the consumer needs and it creates an unnecessary coordination task between producer and consumer. Transformation “in flight” by the service similarly implies a team of engineers who must be both knowledgeable about all producers and consumers and an unnecessary coordination activity. Transformation by the consumer makes most sense, as the consumer has the most knowledge of what they need from the message and eliminates reliance upon and coordination with other teams. The principle of smart end points and dumb pipes then creates the lowest coordination between teams, the highest level of organizational scale and the best time to market option.
To be successful achieving a dumb pipe, we introduce the notion of a pipe contract. Such a contract explains the format of messages produced on and consumed from the pipe. It may indicate that the message will be in a tag delimited format (XML, YAML, etc), abide by certain start and end delimiters, and for the sake of extensibility allow for custom tags for new information or attributes. The contract may also require that consumption not be predicated on strict order of elements (e.g. title is always first) but rather by strict adherence to tag and value regardless of where each tag is in the message.
By ensuring that the pipe remains dumb, the pipe can now scale both more predictably and cost effectively. As no transformation compute happens within the pipe, its sole purpose becomes the delivery of the message conforming to the contract. Large messages do not go through computationally complex transformation, meaning low compute requirements and therefore low cost. The lack of computation also means no odd “spikes” as transforms start to stall delivery and eat up valuable resources. Messages are delivered faster (lower latency). An additional unintended benefit is that because transforms aren’t part of message transit, a type of failure (computational/logical) does not hinder message service availability.
The 2x2 matrix below summarizes the options here, clearly indicating smart end points and dumb pipes as the best choice.
One important callout here is that “streams processing”, which is off-message platform evaluation of message content, is not a violation of the smart end points, dumb pipes concept. The solutions performing streams processing are simply consumers and producers of messages, subscribing to the contract and transport of the pipe.
Summarizing all of the above, the benefits of smart end points and dumb pipes are:
- Lower cost of messaging infrastructure - pushes the cost of goods sold closer to the producer and consumer. Allows messaging infrastructure to scale by number of messages instead of computational complexity of messages. License cost is reduced as fewer compute nodes are needed for message transit.
- Organization Scalability – teams aren’t reliant on transforms created by a centralized team.
- Low Latency – because computation is limited, messages are delivered more quickly and predictably to end consumers.
- Capacity and scalability of messaging infrastructure – increased significantly as compute is not part of the scale of the platform.
- Availability of messaging infrastructure – because compute is removed, so is a type of failure. As such, availability increases.
Two critical requirements for achieving smart end points and dumb pipes:
- Message contracts – all messages need to be of defined form. Producers must adhere to that form as must consumers.
- Team behaviors – must assure adherence to contracts.
AKF Partners helps companies build scalable, highly available, cost effective, low-latency, fast time to market products. Call us – we can help!
July 29, 2019 | Posted By: Bill Armelin
On February 7, 2019, Wells Fargo experienced a major service interruption to its customer facing applications. The bank blamed a power shutdown at one of its data centers in response to smoke detected in the facility. Customers continued to experience the effects for several days. How could this happen? Aren’t banks required to maintain multiple data centers (DC) to fail over when something like this happens? While we do not know the specifics of Wells Fargo’s situation, AKF has worked at several banks before, and we know the answer is yes. This event highlights an area that we have seen time and time again. Disaster Recovery (DR) usually does not work.
Don’t government regulations require some form of business continuity? If the company loses a data center, shouldn’t the applications run out of a different data center? The answer to the former is yes, and the answer to the latter should be yes. These companies spend millions of dollars setting up redundant systems in secondary data centers. So, what happens? Why don’t these systems work?
The problem is these companies rarely practice for these DR events. Sure, they will tell you that they test DR yearly. But many times, this is simply to check a box on their yearly audit. They will conduct limited tests to bring up these applications in the other data center, and then immediately cut back to the original. Many times, supporting systems such as AuthN, AuthZ and DNS are not tested at the same time. Calls from the tested system go back to the original DC. The capacity of the DR system cannot handle production traffic. They can’t reconcile transactions in the DR instance of ERP with the primary. The list goes on.
What these companies don’t do is prepare for the real situation. There is an old adage in the military that says you must “train like you fight.” This means that your training should be as realistic as possible for the day that you will actually need to fight. From a DR perspective, this means that you need to exercise your DR systems as if they were production systems. You must simulate an actual failure that invokes DR. This means that you should be able to fail over to your secondary DC and run indefinitely. Not only should you be able to run out of your secondary datacenter, you should regularly do it to exercise the systems and identify issues.
Imagine cutting over to a backup data center when doing a deployment. You run out of the backup DC will new code is being deployed to the primary DC. When the deployment is complete, you cut back to the primary. Once the new deployment is deemed stable, you can update the secondary DC. This allows you to deploy without downtime and you exercise your backup systems. You do not impact your customers during the deployment process and you know that your DR systems actually work.
How do companies typically setup their DR? Many times, we see companies use an Active/Passive (Hot/Cold) setup. This is where the primary systems run out of one DC and a second (usually smaller) DC houses a similar setup. Systems synchronize data to backup data stores. The idea is that during a major incident, they start up the systems in the secondary DC and redirect traffic to it. There are several downsides to this configuration. First, it requires running an additional set of servers, databases, storage and networking. This requires costs of 200% to run production traffic. Second, it is slow to get started. For cost reasons, companies keep the majority of systems shut down and start them when needed. It takes time to get the systems warmed up to take traffic. During major incidents, teams avoid failing over to the secondary DC, trying to fix the issues in the primary DC. This extends the outage time. When they do fail over, they find that systems that haven’t run in a long time don’t work properly or are undersized for production traffic.
Companies running this configuration complain that DR is expensive. “We can’t afford to have 100% of production resources sitting idle.” Companies that choose Active/Passive DR typically have not had a complete and total DC failure, yet.
So, companies don’t want to have an additional 100% set of untested resources sitting idle. What can they do? The next configuration to consider is running Active/Active. This means that you run your production systems out of two datacenters, sending a portion of production traffic (usually 50%) to each. Each DC synchronizes its data with the other. If there is a failure of one DC, divert all of the traffic to the other. Fail over usually happens quickly since both DCs are already receiving production traffic.
This doesn’t fix the cost issue of have an additional 100% resources in a second DC. It does fix the issues of the systems not working in the other DC. Systems don’t sit idle and are exercised regularly.
While this sounds great, it is still expensive. Is there another way to reduce the total cost of DR? The answer is yes. Instead of having two DCs taking production traffic, what if we use three? At first glance, it sounds counter intuitive. Wouldn’t this take 300% of resources? Luckily, by splitting traffic to three (or more) datacenters, we no longer need 100% of the resources in each.
In a three-way active configuration, we only need 50% of the capacity in each DC. From a data perspective, each DC house its own data and 50% of each of the other’s data (see table below). This configuration can handle a single DC failure with minimal impact to production traffic. However, because each DC needs less capacity, the total cost of three active is approximately 166% (vs. 200% for two). An added benefit is that you can pin your customers to the closest DC, resulting in lower latency.
|Distribution of Data in a Multi-site Active Configuration
Companies that rely on Active/Passive DR typically have not experienced a full datacenter outage that has caused them to run from their backup systems in production. Tests of these systems allow them to pass audits, but that is usually it. Tests do not mimic actual failure conditions. Systems tend to be undersized and may not work. An Active-Active configuration will help but does not decrease costs. Adopting a Multi-Site Active DR configuration will result in improved availability and lower costs over an Active/Passive or Active/Active setup.
Do you need help defining your DR strategy or architecture? AKF Partners can conduct a technology assessment to get you started.
July 26, 2019 | Posted By: James Fritz
Running a technology company is a challenging endeavor. Not only are consumers demands changing daily, the technology to deliver upon those demands is constantly evolving. Where you host your infrastructure and software, what your developers code in, what version you are on, and how you are poised to deliver quality product is not the same as it was 20 years ago, probably not even 10 or 5 years ago. And these should all be good things. But underlying all those things is a common denominator: people. In Seed, Feed, Weed I outlined what companies need to do in order to maintain a stable of great employees. This article will delve down into the aspect of Seed a little more.
What is Seed?
At its core, seed is hiring the best people for the job. Unfortunately, it takes a little bit of work to get to that. If it was that easy, then this is where the article would end…
But it doesn’t.
Seed is not just your hiring managers dealing with a specific labor pool available to them. It needs to be more than that. It needs to be an ever evolving, ever responsive organism within your organization.
If your HR Recruiting office is still hiring people like it did in the 90’s, then don’t be surprised when you get talent on par with 90’s capability. No longer can you sit back and wait for the right candidate to come to you because chances are what you are hiring for is buried under a million other similar job postings in your area. Your desired future candidates are out, going to meet ups, conferences, and other networking events. To meet them, you too need to be in attendance.
If you are able to hire a future employee from a conference where other employers are present, that is a great indicator of where your company stands. If you can’t stand at least shoulder-to-shoulder with your competitors, then you will never be able to hire the best people.
There are many great advantages to the minimalization of the world through telecommunications. Now if a certain skillset is only available half-way around the world, today’s technology makes it much easier to overcome the distance challenge. This isn’t to say the debate over off-shore vs. near-shore or in-house has a clear winner, but there are many more options.
So where should you be looking? Do you want quality or quantity? If quality matters, start where competition in your sector is heaviest. If quantity matters, any place will do. But hopefully you want quality. Almost anyone can sit at a desk for 8 hours. Very few talented programmers can adapt your current architecture to meet the demands of a market in 6 months.
If your company is afraid to enter a competitive technology market geography because of fear it won’t be able to hire more employees than the competition, then that should be a red flag. Challenge breeds greatness.
The hiring process itself should be iterative and multi-faceted. Sure, it is nice to be able to tell a prospective candidate they will go through two 30-minute phone screens, followed by two 1 hour on sites, but maybe that job, or that candidate needs something a little more, or a little less.
Don’t be afraid to deviate your approach based upon the role or the potential future employee. Just make sure they are aware of it and why you are changing from what they were told. This will give them a chance to shine more. Recently, I got to be a part of a hiring process that should’ve involved two 30-minute phone screens and one 2-hour onsite. That 2-hour onsite was deemed not long enough because the candidate and the future employer spent too much time discussing the minutiae of various implementations to an engineering plan. And that’s ok. They then asked the candidate to do a video conference where he stepped through the code base. But they let him know why they needed that follow on. It wasn’t to test him further. It was because he had simply “clicked” too well with the engineering aspect and time ran away from them.
Additionally, it shouldn’t just be technology members involved in hiring developers. Far too often a new employee has trouble meshing with the culture of the organization or team because they were asked purely technical-related questions or presented with technical scenarios. Have someone from your People Operations or Marketing, involved as well. This will help flesh out the entirety of the candidate and provide them with more knowledge of the company.
Far too often companies are so focused on their hyper growth that getting “butts in seats” matters more than getting the right people. Nine times out of 10, one great employee is going to be better than three okay employees.
We’ve helped dozens of companies fill interim roles as we helped find great employees. If you need assistance on how to identify a great employee, and Seed your company appropriately,
AKF can help.
July 22, 2019 | Posted By: Eric Arrington
It’s funny how clearly you can remember some events from your childhood. I remember exactly where I was on Jan 28th, 1986.
All the kids in Modoc Elementary School had been ushered into the Multi Purpose Room. It was an exciting day. We were all going to watch the Challenger Shuttle Launch. The school was especially excited about this launch. A civilian school teacher was going into space.
I was sitting right up front (probably so the teacher could keep an eye on me). I had on the paper helmet I had made the day before. I was ready to sign up for NASA. We all counted down and then cheered when the shuttle lifted off.
Seventy-three seconds in something happened.
There was an obvious malfunction. For once the kids were silent. Teachers didn’t know what to do. We all sat there watching. Watching as the Challenger exploded in mid air, taking the lives of all 7 crew members aboard.
How could this have happened? Some fluke accident after all that careful planning? This was NASA. They thought of everything right?
I recently picked up a book by Dr. Diane Vaughan called The Challenger Launch Decision. Vaughan isn’t an engineer, she is a sociologist. She doesn’t study Newtonian Mechanics. She studies social institutions, cultures, organizations, and interactions between people that work together.
She wasn’t interested in O-rings failing. She wanted to understand the environment that led to such a failure.
She realized that it’s easy for people to rationalize shortcuts under pressure. Let’s be honest, do any of us not work under a certain amount of pressure? The rationalization gets even easier when you take a shortcut and nothing bad happens. Lack of a “bad outcome” can actually justify the shortcut.
After studying the Challenger Launch and other failures, Vaughan came up with the theory for this type of breakdown in procedure. She called this theory the normalization of deviance. She defines it as:
The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization
In other words, the gradual breakdown in process where a violation of procedure becomes acceptable. One important key is, it happens even though everyone involved knows better.
Normalization of Deviance and What Happened at NASA
Prior to the launch, NASA became more and more focused on hitting the launch date (sound familiar?). Deviants from established procedures kept popping up. Instead of reevaluating and changing things, the deviants were accepted. Over time these deviants became the new normal.
Erosion between the O-rings had occurred before the date of the launch. It wasn’t a new occurrence. The issue was, erosion past the O-rings wasn’t supposed to happen at all. It was happening on every flight. The engineers scratched their heads and made changes but the erosion kept happening. They argued that yes, it was happening but it was stable so it could be ignored.
In other words, the O-rings didn’t completely fail so it was ok. A condition that was at one time deemed unacceptable was now considered to be acceptable. The deviance had become the new normal. This deviance led to the death of 7 people and scarred a bunch of my classmates for life (don’t worry I was ok).
Normalization of Deviance
Normalization of deviance doesn’t only happen at NASA. Their failures tend to garner more attention though. When you’re sitting on more than 500,000 gallons of liquid oxygen and liquid hydrogen the failures are spectacular.
Most of us don’t work in a job where a failure can cost someone their life. That doesn’t mean these principles don’t apply to us. Normalization of deviance happens in all industries.
There is a study of how the . The author, John Banja, identifies 7 factors that contribute to normalizing unacceptable behaviors. These 7 factors are extremely relevant to us in the software industry as well. Here are his seven factors and some takeaways for the software world.
1. The rules are stupid and inefficient!
I am sure you have never heard this at your company before. A good alternative would be, “management doesn’t understand what we are doing. Their rules slow us down.”
In this situation the person violating the rule understands the rule. He just doesn’t think management understands his job. The rule was handed down by someone in management who doesn’t know what it’s like to be “in the trenches.”
Guess what? Sometimes this is true. Sometimes the rules are stupid and inefficient and are created by someone that is out of touch. What is the solution? Don’t ignore the rule. Go find out why the rule is there.
2. Knowledge is imperfect and uneven.
In this case, the “offender” falls under 3 possible categories:
They are unaware that the rule exists.
They might know about the rule but fail to get why it applies to them.
They have been taught deviants by other co-workers.
This is especially a problem in a culture where people are afraid to ask for help. This problem gets compounded with every new hire. Have you ever asked why a certain thing was done at a new job and heard back, “I don’t know, that’s just how things are done here”?
Foster a culture where it is acceptable to ask questions. New hires and juniors should feel empowered to ask “why.”
3. The work itself, along with new technology, can disrupt work behaviors and rule compliance.
We all do complex work in a dynamic environment. It’s unpredictable. New technologies and new environments can lead us to come up with solutions that don’t perfectly fit established procedures. Engineers are forced to come up with answers that might not fit in the old documented standards.
4. I’m breaking the rule for the good of my patient!
We don’t have patients, but we can see this in our world as well. Substitute the word user for patient. Have you ever violated a procedure for the good of the user or ease of integration with a colleague?
What would be a better solution? If it’s a better way and you don’t see any negative to doing it that way, communicate it. It might be beneficial to everyone to not have that rule. Have a discussion with your team about what you are trying to do and why. Maybe the rule can be changed or maybe you aren’t seeing the whole picture.
5. The rules don’t apply to me/you can trust me.
“It’s a good rule for everyone else but I have been here for 10 years. I understand the system better than everyone. I know how and when to break the rules.”
We see this a lot as startups grow up. Employee #2 doesn’t need to follow the rules right? She knows every line of code in the repo. Here is the problem, developers aren’t know for our humility. We all think we are that person. We all think we understand things so well that we know what we can get away with.
6. Workers are afraid to speak up.
The likelihood of deviant behavior increases in a culture that discourages people from speaking up. Fear of confrontation, fear or retaliation, “not my job”, and lack of confidence make ignoring something even though it’s wrong easier.
Let’s be honest, as developers we aren’t always highly functioning human beings. We are great when our heads are down and we’re banging on a keyboard but when we are face to face with another human? That’s a different set of tools than most of us don’t have in our quiver.
This especially difficult in a relationship between a junior and senior engineer. It’s hard to a junior engineer to point out flaws or call out procedure violations to a senior engineer.
7. Leadership withholding or diluting findings on system problems.
We know about deviant behavior, we just dilute it as we pass it up the chain of command. This can happen for many reasons but can be mostly summed by “company politics.” Maybe someone doesn’t want to look bad to superiors so they won’t report the incident fully. Maybe you don’t discipline a top performer for unacceptable behavior because you are afraid they might leave.
You also see this in companies that have a culture where managers lead with an iron fist. People feel compelled to protect coworkers and don’t pass information along.
How Do You Fix It
This happens everywhere. It happens at your current job, at home, with your personal habits, driving habits, diet and exercise; it’s everywhere. There are 3 important steps to fighting it.
Creating and Communicating Good Processes
It’s simple, bad processes lead to bad results. Good processes that aren’t documented and/or accessible lead to bad results. Detailed and documented processes are the first step to fixing this culture of deviance.
Good documentation helps you maintain operational consistency. The next step is to make sure each employee knows the process.
Create good processes, document them, train employees, and hold everyone accountable for maintaining them.
Create a Collaborative Environment
This is especially true when creating new processes. Bring the whole team in to discuss. People should feel some ownership over the process they are accountable for.
Remember, normalization of deviance is a social problem. If a process is created as a group then the social need to adhere to it as a group is more powerful.
This also solves problem #1 Rules are Stupid. If the team makes the rules then they will be more likely to follow them.
Create a Culture of Communication
The key to fighting normalization of deviance is to understand that everyone knows better. If employees are consistently accepting deviants to accepted procedures then find out why.
A great way to see this is in action is to watch what happens when a new hire comes to the team with an alert. How does the team react? Do they brush them off? If so, then you probably have a team that is accepting deviant practices.
Employees should feel empowered to “hit the e-stop” on their processes and tasks. Employees, especially juniors, should be encouraged to question the established order of things. They need to feel comfortable asking “why?”.
Conventional wisdom needs to be questioned. They will be wrong most of the time. This will give you an opportunity to explain why you do things the way you do. If they are right then you make the procedure better. It’s a no lose situation.
As you can tell, most of the solutions are the same: Communication. Creating a culture of communication is the only way to keep from falling into this trap. Empower your employees to question the status quo. You will create stronger teams, better ideas, and improved performance.
There is only one way to catch normalization of deviance before it sets in: Create a culture of honesty, communication, and continuous improvement.
Sometimes it’s hard to judge this in your own culture. I call this “ship in a bottle” syndrome. When you’re in the bottle it’s hard to see things clearly. AKF has helped hundreds of software companies change their culture. Give us a call, we can help.
July 21, 2019 | Posted By: Robin McGlothin
Microservices are an architectural approach emerging out of service-oriented architecture, emphasizing self-management and lightweightness as the means to improve software agility, scalability, and autonomy. This article examines microservice definition, how to size, and the benefits and challenges facing microservice development.
What exactly are Microservices?
Microservices is an approach to architecting applications. The approach breaks the application down into multiple services with each service being called a microservice. Nothing mysterious or magical there. The beauty of microservices is that each service should be deployed independently and runs independently of any other service or even the implementation around the service.
Microservices simplify the application because each service constitutes a single business function that does one task. In all cases, one task represents a small piece of business capability.
Figure 1 shows a sample application using microservices.
Small and Focused
Microservice size is not related to the number of lines of code but the service should have a small set of responsibilities. To help answer the sizing question, we’ve put together a list of considerations based on developer throughput, availability, scalability, and cost. By considering these, you can decide if your application should be grouped into a large, monolithic codebase, or split up into smaller, individual services and swim lanes.
You must also keep in mind that splitting too aggressively can be overly costly and have little return for the effort involved. Companies with little to no growth will be better served to focus their resources on developing a marketable product than by fine-tuning their service sizes using the considerations below.
See the full article here.
The illustration below can be used to quickly determine whether a service or function should be segmented into smaller microservices, be grouped together with similar or dependent services, or remain in a multifunctional, infrequently changing monolith.
Figure 2 - Determine Service Size
A microservice also needs to be treated like an application or a product. It should have its own source code management repository and its own delivery pipeline for builds and deployment.
Loose coupling is an essential characteristic of microservices. You need to be able to deploy a single microservice on its own. There must be zero coordination necessary for the deployment with other microservices. This loose coupling enables frequent and rapid deployments, therefore getting much-needed features and capabilities to clients.
A popular way to implement microservices is to use protocols such as HTTP/REST alongside JSON, as an architectural design pattern, we ‘re seeing a most of the main SaaS providers from AWS, Microsoft to IBM and much more, are implementing microservice architecture adopting microservices into their solutions and services.
Microservices are always expressed in plural because we run several of them, not one. Each microservice is further scaled by running multiple instances of it. There are many processes to handle, and memory and CPU requirements are an important consideration when assessing the cost of operation of the entire system. Traditional Java EE stacks are less suitable for microservices from this point of view because they are optimized for running a single application container, not a multitude of containers. Stacks such as Node.js and Go are seen as a go-to technology because they are more lightweight and require less memory and CPU power per instance.
In theory, it is possible to create a microservice system in which each service uses a different stack. In most situations, this would be craziness. Economy of scale, code reuse, and developer skills all limit this number at a level that is around 2 - 3.
Benefits of Microservices
As Microservices architecture has been growing in popularity in recent years, so has the benefits that it can bring software development teams and the enterprises. As software increases in complexity, being able to componentize functional areas in the application into sets of independent services can yield many benefits, which include, but are not limited to the following:
- More efficient debugging – no more jumping through multiple layers of an application, in essence, better fault isolation
- Accelerated software delivery – multiple programming languages can be used thereby giving you access to a wider developer talent pool
- Easier to understand the codebase – increased productivity as each service represents a single functional area or business use case
- Scalability – componentized microservices lend themselves to be integrated with other applications or services via industry-standard interfaces such as REST
- Fault tolerance – reduced downtime due to more resilient services
- Reusability – as microservices are organized around business cases and not a particular project, due to their implementation, they can be reused and easily slotted into other projects or services, thereby reducing costs.
- Deployment – as everything is encapsulated into separate microservices, you only need to deploy the services that you‘ve changed and not the entire application. A key tenet of microservice development is ensuring that each service is loosely coupled with existing services as mentioned earlier.
Challenges of Microservice Architecture
As with every new software architecture, each has the list of pros and cons, it‘s not always peaches and cream and microservices are not an exception to this rule. It‘s worth pointing some of these out.
- Too many coding languages – yes, we listed this as a benefit, but it can also be a double-edged sword. Too many languages, in the end, could make your solution unwieldy and potentially difficult to maintain.
- Integration – you need to make a conscious effort to ensure your services as are loosely coupled as they possibly can be (yes, mentioned earlier too), otherwise, if you don‘t, you‘ll make a change to one service which has a ripple effect with additional services thereby making service integration difficult and time-consuming.
- Integration test – testing one monolithic system can be simpler as everything is in “one solution”, whereas a solution based on microservices architecture may have components that live on other systems and/or environments thereby making it harder to configure an “end to end” test environment.
- Communication – microservices naturally need to interact with other services, each service will depend on a specific set of inputs and return specific outputs, these communication channel‘s need to be defined into specific interfaces standards and shared with your team. Failures between microservices can occur when interface definitions haven‘t been adhered to which can result in lost time.
Don’t even think about Microservices without DevOps
Microservices allow you to respond quickly and incrementally to business opportunities. Incremental and more frequent delivery of new capabilities drives the need for organizations to adopt DevOps practices.
Microservices cause an explosion of moving parts. It is not a good idea to attempt to implement microservices without serious deployment and monitoring automation. You should be able to push a button and get your app deployed. In fact, you should not even do anything. Committing code should get your app deployed through the commit hooks that trigger the delivery pipelines in at least development. You still need some manual checks and balances for deploying into production.
You no longer just have a single release team to build, deploy, and test your application. Microservices architecture results in more frequent and greater numbers of smaller applications being deployed.
DevOps is what enables you to do more frequent deployments and to scale to handle the growing number of new teams releasing microservices. DevOps is a prerequisite to being able to successfully adopt microservices at scale in your organization.
Teams that have not yet adopted DevOps must invest significantly in defining release processes and corresponding automation and tools. This is what enables you to onboard new service teams and achieve efficient release and testing of microservices. Without it, each microservice team must create its own DevOps infrastructure and services, which results in higher development costs. It also means inconsistent levels of quality, security, and availability of microservices across teams.
As you begin to reorganize teams to align with business components and services, also consider creating microservices DevOps teams who provide the cross-functional development teams with tool support, dependency tracking, governance, and visibility into all microservices. This provides business and technical stakeholders greater visibility into microservices investment and delivery as microservices move through their lifecycle.
The DevOps services team provides the needed visibility across the teams as to what services are being deployed, used by other teams, and ultimately used by client applications. This loosely coupled approach provides greater business agility.
Frequent releases keep applications relevant to business needs and priorities. Smaller releases means less code changes, and that helps reduce risk significantly. With smaller release cycles, it is easier to detect bugs much earlier in the development lifecycle and to gain quick feedback from the user base. All these are characteristics of a well-oiled microservices enterprise.
AKF Partners has helped to architect some of the most scalable, highly available, fault-tolerant and fastest response time solutions on the internet. Give us a call - we can help.
July 19, 2019 | Posted By: Marty Abbott
When AKF Partners uses the term asynchronous, we use it in the logical rather than the physical (transport mechanism) sense. Solutions that communicate asynchronously do not suspend execution and wait for a return – they move off to some other activity and resume execution should a response arrive.
Asynchronous, non-blocking communications between service components help create resilient, fault isolated (limited blast radius) solutions. Unfortunately, while many teams spend a great deal of time ensuring that their services and associated data stores are scalable and highly available, they often overlook the solutions that tend to be the mechanism by which asynchronous communications are passed. As such, these message systems often suffer from single points of failure (physical and logical), capacity constraints and may themselves represent significant failure domains if upon their failure no messages can be passed.
The AKF Scale Cube can help resolve these concerns. The same axes that guide how we think about applications, servers, services, databases and data stores can also be applied to messaging solutions.
Cloning or duplication of messaging services means that anytime we have a logical service, we should have more than one available to process the same messages. This goes beyond ensuring high availability of the service infrastructure for any given message queue, bus or service – it means that where one mechanism by which we send messages exist, another should be there capable of handling traffic should the first fail.
As with all uses of the X axis, N messaging services (where N>1) can allow the passage of all similar messages. Messages aren’t replicated across the instances, as doing so would eliminate the benefit of scalability. Rather, messages are sent to one instance, but all producers and consumers send or consume to each of the N instances. When an instance fails, it is taken out of rotation for production and when it returns its messages are consumed and producers can resume sending messages through it. Ideally the solution is active-active with producers and consumers capable of interacting with all N copies as necessary.
The Y axis is segmentation by a noun (resource or message type) or verb (service or action). There is very often a strong correlation between these.
Just as messaging services often have channels or types of communication, so might you segment messaging infrastructure by the message type or channel (nouns). Monitoring messages may be directed to one implementation, analytics to a second, commerce to a third and so on. In doing so, physical and logical failures can be isolated to a message type. Unanticipated spikes in demand on one system, would not slow down the processing of messages on other systems. Scale is increased through the “sharding” by message type, and messaging infrastructure can be increased cost effectively relative to the volume of each message type.
Alternatively, messaging solutions can be split consistent with the affinity between services. Service A, B and C may communicate together but not need communication with D, E and F. This affinity creates natural fault isolation zones and can be leveraged in the messaging infrastructure to isolate A, B and C from D, E and F. Doing so provides similar benefits to the noun/resource approach above – allowing the solutions to scale independently and cost effectively.
Whereas the Y axis splits different types of things (nouns or verbs), the Z axis splits “similar” things. Very often this is along a customer and geography boundary. You may for instance implement a geographically distributed solution in multiple countries, each country having its own processing center. Large countries by be subdivided, allowing solutions to exist close to the customer and be fault isolated from other geographic partitions.
Your messaging solution should follow your customer-geography partitions. Why would you conveniently partition customers for fault isolation, low latency and scalability but rely on a common messaging solution between all segments? A more elegant solution is to have each boundary have its own messaging solution to increase fault tolerance and significantly reduce latency. Even monitoring related would ideally be handled locally and then forwarded if necessary, to a common hub.
We have held hundreds of on-site and remote architectural 2 and 3-day reviews for companies of all sizes in addition to thousands of due diligence reviews for investors. Contact us to see how we can help!
1 2 3 > Last ›