July 1, 2019 | Posted By: Pete Ferguson
This is one of several articles on recommended architectural principles and goes into deeper depth to our post on the AKF Scale Cube made reference to a concept that we call “Fault Isolation” or more commonly – “Swim lanes” or “Swim-laned Architectures”. We sometimes also call “swim lanes” fault isolation zones or fault isolated architecture.
Fault Isolation Defined
A “swim lane” or fault isolation zone is a failure domain. A failure domain is a group of services within a boundary such that any failure within that boundary is contained within the boundary and the failure does not propagate or affect services outside of the said boundary. Think of this as the “blast radius” of failure meant to answer the question of “What gets impacted should any service fail?” The benefit of fault isolation is twofold:
- Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain. Once something breaks, because the failure is limited in scope, it can be more rapidly identified and fixed. Recovery time objectives (RTO) are subsequently decreased which increases overall availability.
- Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. The “blast radius” of failure is contained. As such, and depending upon approach, only a portion of users or a portion of the functionality of the product is affected. This is akin to circuit breakers in your house – the breaker exists to limit the fault zone for any load that exceeds a limit imposed by the breaker. Failure propagation is contained by the breaker tripping, preserving power to devices which are not affected.
Architecting Fault Isolation
A fault isolated architecture is one in which each failure domain is completely isolated. We use the term “swim lanes” to depict the separations, similar to how a floating line of buoys keeps each swimmer in his or her lane during a race. In order to achieve this in systems architecture, ideally there are no synchronous calls between swimlanes or failure domains made pursuant to a user request.
User-initiated synchronous calls between failure domains are absolutely forbidden in this type of architecture as any user-initiated synchronous call between fault isolation zones, even with an appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures across other domains. Strictly speaking, you do not have a failure domain if that domain is connected via a synchronous call to any other service in another domain, to any service outside of the domain, or if the domain receives synchronous calls from other domains or services.
It is acceptable, but not advisable, to have asynchronous calls between domains and to have non-user initiated synchronous calls between domains (as in the case of a batch job collecting data for the purposes of reporting in another failure domain). If such a communication is necessary it is very important to include failure detection and timeouts even with the asynchronous calls to ensure that retries do not call port overloads on any services.
As previously indicated, a swim lane should have all of its services located within the failure domain. For instance, if database [read/writes] are necessary, the database with all appropriate information for that swim lane should exist within the same failure domain as all of the application and web servers necessary to perform the function or functions of the swim lane. Furthermore, that database should not be used for other requests of service from other swim lanes. Our rule is one production database on one host.
The figure below demonstrates the components of software and infrastructure that are typically fault isolated:
Rarely are shared higher level network components isolated (e.g. border systems and core routers).
Sometimes, if practical, firewalls and load balancers are isolated. These are especially the case under very high demand situations where a single pair of devices simply wouldn’t meet the demand.
The remainder of solutions are always isolated, with web-servers, top of rack switches (in non IaaS implementations), compute (app servers) and storage all being properly isolated.
Applying Fault Isolation with AKF’s Scale Cube
As we have indicated with the AKF Scale Cube in the past, there are many ways in which to think about swimlaned architectures. Swim lanes can be isolated along the axes of the Scale Cube as shown below with AKF’s circuit breaker analogy to fault isolation.
Fault isolation in X-Axis would mean replicating everything for high availability – and performing the replication asynchronously and in an eventually consistent (rather than a consistent) fashion. For example, when a data center fails the fault will be isolated to the one failed data center or multiple availability zones. This is common with traditional disaster recovery approaches, though we do not often advise it as there are better and more cost-effective solutions for recovering from a disaster.
Fault Isolation in the Y-Axis can be thought in terms of a separation of services e.g. “login” and “shopping cart” (two separate swim lanes) with each having the web and app servers as well as all data stores located within the swim lane and answering only to systems within that swim lane. Each portion of a page is delivered from a separate service reducing the blast radius of a potential fault to its swim lane.
The example above of a commerce site shows different components of the page broken down into sections for login, buy again, promotions, shopping cart, and checkout. Each component would reside within separate applications, hosted on different servers with properly isolated services.
Another approach would be to perform a separation of your customer base or separation of your order numbers or product catalog. Assuming an indiscriminate function to perform this separation (like a modulus of id), such a split would be a Z-Axis swimlane along customer, order number, or product ID lines. More beneficially, if we are interested in the fastest possible response times to customers, we may split along geographic boundaries with each pointing to the closest data center within that region. Besides contributing to faster customer response times, these implementations can also help ensure we are compliant with data sovereignty laws (GDPR for example) unique to different countries or even states within the US.
Combining the concepts of service and database separation into several fault isolative failure domains creates both a scalable and highly available platform. AKF has helped achieve high availability through fault isolation. Contact us to see how we can help you achieve the same fault tolerance.
AKF Partners helps companies create highly available, fault-isolated swim lane solutions. Send us a note - we’d love to help you!
June 28, 2019 | Posted By: Roger Andelin
The two most common types of technology due diligence requests we see at AKF are
- Product Technology Due Diligence
- Information Technology, or IT, Due Diligence.
Both types are very different from each other, but often get confused. This article will explain the differences between Product Technology and
Information Technology – and why understanding the difference is critical to a company’s success and profitability.
An IT department is typically led by a Chief Information Officer (CIO). The focus of the CIO is on information technology that supports the ongoing operations of the business. The CIO and the IT team’s key outcomes are typically around employee productivity and efficiency, applying technology to improve productivity and to lower costs. This includes technologies that run the financial and accounting systems, sales and operations systems, customer support systems, and the networks, servers, and storage underlying these systems. The CIO is also responsible for the technologies that are used in the office such as email, chat, video conferencing systems, printers, and employees’ desktop computers.
Conversely, Product Technology (or Digital Product Technology) is typically led by a Chief Technology Officer (CTO). The focus of the CTO is building a product or service for customers out of software and running that product or service on cloud systems or company-owned systems, although the latter is becoming less common. Put another way, CTOs build and run software as revenue generating products and services.
Whereas the CIO runs a cost center and is responsible for employee productivity, the CTO is responsible for revenue and cash-flow. Sales growth, time to market, costs of goods sold, and R&D spend are some of the factors included within key outcomes for the CTO.
For example, if you were running a newspaper business, your primary product is the news. However, you also must build applications to read news like mobile apps and web apps. It is the job of your CTO to build, maintain, and run these apps for your customers. Your CTO would be accountable for business metrics – such as the number of downloads, users, and revenue. If your CTO is distracted by CIO issues of running the day-to-day business of the office, they are being taken away from their work to build and implement the revenue generating products and services your company is trying to create.
Product Technologies and IT Technologies are very different. CIOs and CTOs have very different skills and competencies to manage these differences. For example, a CIO often possesses deep knowledge of back-office applications such as accounting systems, finance systems, and warehouse management systems. In many cases they likely rose up through the technology ranks writing, maintaining, and running those systems for other departments in the company. They are excellent at business analysis, collecting requirements from company users and translating those requirements into project plans. CIOs are often proficient at waterfall development methodologies often used to implement back-office applications.
The IT team is often largely staffed with people who know how to integrate and configure third-party products, with a small amount of custom development. The opposite is true for most product technology teams typically staffed with software engineers who are building new solutions and a smaller number of engineers integrating infrastructure components.
CTOs possess entirely different technology skills needed to build and maintain software as a service (SaaS) for the company’s customers. CTOs possess the skills to architect software applications that are scalable as the company grows its customer base. The AKF Scale Cube is an invaluable reference tool for the CTO building a scalable software solution based on scalable microservices. CTOs must have the skills to run product teams, including user experience design, along with software development. CTOs are more likely to be proficient in Agile development methodologies, such as Scrum. CTOs are expected to know the product development lifecycle (PDLC), mitigate technical debt liability, and know how to build software release pipelines to support continuous integration and delivery(CI/CD).
What We See In The Wild
Regularly, when AKF is called to perform technology due diligence, we often find that Product and IT are combined! This has been especially true for older, established companies, with traditional IT departments. CIOs took on the responsibility of building and running customer-facing internet and mobile software products and services rather than creating a separate Product Development Team under a CTO. The results are often not positive because the technical, product, and process skills are very different between the two.
This mistake is not exclusive to older companies. We see startup CEO’s making the same mistake, often under the rationale of reducing burn. However, when a startup CEO looks to the CTO to help set up new employees’ desktop computers or to fix a problem with email, it is a huge distraction for the CTO who should be focused on building and improving the company’s revenue-generating products.
AKF recommends that CEOs not combine Product Development and IT departments. Understanding this distinction and why these two very different departments need to function separately is critically important. Our primary expertise at AKF is to help successful companies become more successful at delivering digital products. AKF focuses its expertise on helping product development teams succeed. We have developed intellectual property, including the AKF Scale Cube, used to evaluate product architecture and to guide successful product development teams.
We also help CEOs, CTOs, CIOs and IT departments who are looking to improve performance and deliver more business value by creating efficient product development processes and architectures. Give us a call, we can help!
June 27, 2019 | Posted By: Marty Abbott
Bulkhead Pattern Overview
Bulkheads in ships separate components or sections of a ship such that if one portion of a ship is breached, flooding can be contained to that section. Once contained, the ship can continue operations without risk of sinking. In this fashion, ship bulkheads perform a similar function to physical building firewalls, where the firewall is meant to contain a fire to a specific section of the building.
The microservice bulkhead pattern is analogous to the bulkhead on a ship. By separating both functionality and data, failures in some component of a solution do not propagate to other components. This is most commonly employed to help scale what might be otherwise monolithic datastores. The bulkhead is then a pattern for implementing the AKF principle of “swimlanes” or fault isolation.
Problems the Bulkhead Pattern Fixes
The bulkhead pattern helps to fix a number of different quality of service related issues.
- Propagation of Failure: Because solutions are contained and do not share resources (storage, synchronous service-to-service calls, etc), their associated failures are contained and do not propagate. When a service suffers a programmatic (software) or infrastructure failure, no other service is disrupted.
- Noisy Neighbors: If implemented properly, network, storage and compute segmentation ensure that abnormally large resource utilization by a service does not affect other services outside of the bulkhead (fault isolation zone).
- Unusual Demand: The bulkhead protects other resources from services experiencing unpredicted or unusual demand. Other resources do not suffer from TCP port saturation, resulting database deterioration, etc.
Principles to Apply
- Share Nearly Nothing: As much as possible, services that are fault isolated or placed within a bulkhead should not share databases, firewalls, storage, load balancers, etc. Budgetary constraints may limit the application of unique infrastructure to these services. The following diagram helps explain what should never be shared, and what may be shared for cost purposes. The same principles apply, to the extent that they can be managed, within IaaS or PaaS implementations.
- Avoid synchronous calls to other services: Service to service calls extend the failure domain of a bulkhead. Failures and slowness transit blocking synchronous calls and therefore violate the protection offered by a bulkhead.
Put another way, the dimensions of a bulkhead or failure domain is the largest boundary across which no critical infrastructure is shared, and no synchronous inter-service calls exist.
Anti-Patterns to Avoid
The following anti-patterns each rely on either synchronous service to service communication or sharing of data solutions. As such, they represent solutions that should not be present within a bulkhead.
When to use the Bulkhead Pattern
- Apply the bulkhead pattern whenever you want to scale a service independent of other services.
- Apply the bulkhead pattern to fault isolate components of varying risk or availability requirements.
- Apply the bulkhead pattern to isolate geographies for the purposes of increased speed/reduced latency such that distant solutions do not share or communicate and thereby slow response times.
AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures. Give us a call – we can help!
June 24, 2019 | Posted By: Marty Abbott
The microservice CQRS (Command Query Responsibility Seegregation) pattern is most commonly employed to help scale what might be otherwise monolithic datastores. Per the X-axis of the AKF Scale Cube, a write instance of a datastore receives all creates, updates and deletes (collectively the “commands” within CQRS) while one or more read instances receive reads (the “queries” within CQRS).
Most relational (ACID compliant) databases have asynchronous and eventually consistent transfer mechanisms available to create “replicants” or “replica sets” of a “master” database. These are often called master databases (the write database) and slave databases (the read databases). The easiest implementation for CQRS then is to rely on native replication technology to create one or more slave databases with the same schema. The service is then separated into a write service and a read service, each with its own endpoint connected to the appropriate database.
Many NoSQL solutions also offer similar capabilities. Attribute names vary by implementation, but the user may identify the number of copies or replica sets for any piece of information. Further, the user can also often identify how these sets are to be used. MongoDB, for instance, allows users to identify from which elements of a replica set a read should be performed as well as the level of “synchronicity” between a write and the remaining read elements. Ideally, this level of synchronicity should be as loose as possible to maximize the value of BASE and avoid the limitations of Brewer’s Cap Theorem.
Highly tuned solutions, or solutions that need to support significant transaction volumes, may benefit from differentiation between the write and read schema instance. There may be several reasons for this, including:
- A subset of data that needs to be written compared to that which needs to be read. For instance, read data may include “static” elements that add no value to the (C)reat, (U)pdate or (D)elete process.
- An elimination of certain relationships in the read or write schema where those relationships add no value. Read databases, for instance, may need more relationships than the write in order to perform complex reporting.
- A reduction or significant difference in indices between those needed for writes (ideally small given that each index needs to be updated for each write, thereby increasing write response time) and those needed for reads.
In these cases, native replication capabilities in either ACID (relational) or BASE (NoSQL) datastores may not fulfill the desired outcomes. Engineers may need to rely on eventually consistent transfer technologies including queues or buses along with transformation compute capacity.
Another somewhat “advanced” concept within relational solutions is to use “Master-Master” replication technology as “Master-Slave”. At AKF, we prefer not to rely on multi-master solutions for distributing writes across multiple database instances. Very often, “in doubt” transactions can cause a pinging effect against multiple master databases and sometimes result in either starvation or deadlock (the Dining Philosopher’s Problem).
But if you implement a multi-master replication solution and use it as master-slave we avoid the concerns over transactional coherence and consistency with a single write database. Should that database fail, however, we can easily “swing” future transactions to the other master previously operating as a slave with a very low probability of conflict.
Benefits of CQRS
- Scalability of transactions at low development cost and comparatively low conceptual complexity (conceptual cost).
- Availability/redundancy of solutions.
- Distribution of X-axis (read copies) geographically for lower latency on read request.
Drawbacks to CQRS
- Multiple copies of data for scale, leading to potentially higher costs of goods sold
- Higher number of failures in a production environment (but typically with increasing levels of customer perceived availability).
- Reads are very often “eventually consistent” as opposed to immediately consistent. This approach will work for greater than 99.9% of use cases in our experience.
How to Use CQRS
- Split writes (commands) from reads (queries) both in actions/methods/services and the datastores they use.
- Implement an eventual consistency solution between the datastores.
- If implementing more than one read instance, implement a proxy or load balancer with a single service endpoint to “spray” requests across all instances.
What to NEVER do with CQRS
Never employ multiple “write masters” or “command masters” with CQRS where each master shares ownership of the same data elements.
AKF Partners has helped hundreds of companies implement new microservice architectures and migrate existing monolithic products to microservice architectures. Give us a call – we can help!
June 19, 2019 | Posted By: Larry Steinberg
Transforming a traditional on-premise product and company to a SaaS model is currently in vogue and has many broad-reaching benefits for both producers and consumers of services. These benefits span financial, supportability, and consumption simplification.
In order to achieve the SaaS benefits, your company must address the broad changes necessitated by SaaS and not just the product delivery and technology changes. You must consider the impact on employees, delivery, operations/security, professional services, go to market, and financial impacts.
The employee base is one key element of change – moving from a traditional ‘boxed’ software vendor to an ‘as a Service’ company changes not only the skill set but also the dynamics of the engagement with your customers. Traditionally staff has been accustomed to having an arms-length approach for the majority of customer interactions. This traditional process has consisted of a factory that’s building the software, bundling, and a small group (if any) who ‘touch’ the customer. As you move ‘to as a service’ based product, the onus on ensuring the solution is available 24x7 is on everyone within your SaaS company. Not only do you require the skillsets to ensure infrastructure and reliability – but also the support model can and should change.
Now that you are building SaaS solutions and deploy them into environments you control, the operations are much ‘closer’ to the engineers who are deploying builds. Version upgrades happen faster, defects will surface more rapidly, and engineers can build in monitoring to detect issues before or as your customers encounter them. Fixes can be provided much faster than in the past and strict separation of support and engineering organizations are no longer warranted. Escalations across organizations and separate repro steps can be collapsed. There is a significant cultural shift for your staff that has to be managed properly. It will not be natural for legacy staff to adopt a 24x7 mindset and newly minted SaaS engineers likely don’t have the industry or technology experience needed. Finding a balance of shifting culture, training, and new ‘blood’ into the organization is the best course of action.
Passion For Service Delivery
Having Services available 24x7 requires a real passion for service delivery as teams no longer have to wait for escalations from customers, now engineers control the product and operating environment. This means they can see availability and performance in real time. Staff should be proactive about identifying issues or abnormalities in the service. In order to do this the health and performance of what they built needs to be at the forefront of their everyday activities. This mindset is very different than the traditional onpremise software delivery model.
Shifting the operations from your customer environments to your own environments also has a security aspect. Operating the SaaS environment for customers shifts the liability from them to you. The security responsibilities expand to include protecting your customer data, hardening the environments, having a practiced plan of incident remediation, and rapid response to identified vulnerabilities in the environment or application.
Finance & Accounting
Finance and accounting are also impacted by this shift to SaaS - both on the spend/capitalization strategy as well as cost recovery models. The operational and security components are a large cost consideration which has been shifted from your customers to your organization and needs to be modeling into SaaS financials. Pricing and licensing shifts are also very common. Moving to a utility or consumption model is fairly mainstream but is generally new for the traditional product company and its customers. Traditional billing models with annual invoices might not fit with the new approach and systems + processes will need to be enhanced to handle. If you move to a utility-based model both the product and accounting teams need to partner on a solution to ensure you get paid appropriately by your customers.
Think through the impacts on your customer support team. Given the speed at which new releases and fixes become available the support team will need a new model to ensure they remain up to date as these delivery timeframes will be much more rapid than in the past and they must stay ahead of your customer base.
Your go to market strategies will most likely also need to be altered depending on the market and industry. To your benefit, as a SaaS company, you now have access to customer behavior and can utilize this data in order to approach opportunities within the customer base. Regarding migration, you’ll need a plan which ensures you are the best option amongst your competitors.
Most times at AKF we see companies who have only focused on product and technology changes when moving to SaaS but if the whole company doesn’t move in lockstep then progress will be limited. You are only as strong as your weakest link.
We’ve helped companies of all sizes transition their technology – AND organization – from on-premises to the cloud through SaaS conversion. Give us a call – we can help!
June 12, 2019 | Posted By: Marty Abbott
Strangler Pattern Overview
The microservice Strangler pattern is employed when teams migrate functionality from an old solution to one or more new implementations. While the pattern can be used for any old to new service migration, the diagram below depicts the common use case of migrating from a monolithic solution to a microservice implementation:
The solution above starts with a single monolithic application. Step one is to implement a proxy (or context switch) that can separate requests by type (Y-Axis or Service Segmentation), or if attempting to evaluate the solution for efficacy against an existing implementation, the X-Axis by transaction volume.
The pattern progresses with the monolith growing increasingly smaller as newly implemented services begin to take requests previously bound for the monolith.
Finally, the proxy is removed once the service disaggregation is completed.
Benefits of Strangler
- The Strangler pattern allows for graceful migration from a service to one or more replacement services.
- If implemented properly, with the ability to “roll back”, the pattern allows relatively low risk in migrating to new service(s).
- The pattern can be used for versioning of APIs.
- Similar to versioning above, the pattern can be used for legacy interactions (the old service remains for solutions that aren’t or won’t be upgraded).
Drawbacks to Strangler
- If implemented with a “new” or additional service – something in addition to a prior proxy or load balancer, the solution decreases availability through the multiplicative effect of failure.
- Per above, additional (new) services will increase latency.
How to Use Strangler
- Use the Strangler for versioning or migration to new services.
- Employ within an existing proxy – either a software-based load balancer (such as Nginx) or a layer 7 capable load balancer/context switch (F5 or Netscaler).
- If using an existing proxy is not possible, ensure the client is responsible for choosing the resource.
- Keep rules updated as services are migrated.
- Prune rules as no longer needed.
What to NEVER do with Strangler
Never employ a new service and increase the call depth with Strangler. Such a service is often called a “façade”. Your existing proxy or load balancer solutions likely give you this flexibility today.
Never allow the solution to become a bottleneck or a single point of failure. Always ensure that the solution is scalable along the X-axis for both availability and scalability.
The X-Axis approach takes a percentage of calls for any unique service endpoint and splits them between the old service and the new service. This approach is useful to A/B test the solution for end-user efficacy, response time, availability, and cost of operations (cost per transaction). It also allows for graceful “dialing up” or “dialing down” of the transaction volume (say from 1% to 51%).
This is the traditional usage for Strangler. It takes a monolith servicing N unique capabilities and partitions them along either verb/service boundaries (e.g. checkout separated from everything else) or noun/resource boundaries (e.g. catalog related actions from customer data actions).
If time allows, implement both the X and Y approach. X gives you rollback as well as A/B testing and significantly reduces your risk while Y allows service disaggregation.
AKF Partners has helped hundreds of companies implement and improve microservice based architectures. Give us a call, we can help you with your transition.
June 10, 2019 | Posted By: AKF
Agile teams sometimes struggle with the meaning and value of autonomy within a team. Is it the autonomy to decide how best to accomplish a goal? Is it the autonomy to choose what tools and processes they will employ to accomplish that goal? Is it both? To answer this, we need to first determine where autonomy drives value creation and where it destroys value within an enterprise.
Getting Team Autonomy over Anarchy
Consider the following analogy: We have the autonomy to determine what roads and paths we will take to a destination when driving a vehicle, but we are governed by speed limits, right of way (which side to drive on, stop signs, stop signals and other road signs), emission standards, and both vehicle and personal licensing. Put another way, we are completely empowered to determine WHAT path we take, and WHY we take it. We are much more limited in HOW we get to a place (on road, off road, speed, only licensed vehicles, etc) and WHO can do it (only sober, licensed drivers). How does this apply to a fully autonomous technology team?
The value-creation of autonomy within a team deals with what path a team takes to accomplish a goal and the reasoning as to why that approach is valuable. A failure to provide structure and rules for how something is accomplished through architectural principles, coding standards, development standards, etc. can start to escalate the costs of development and destroy value. Also, not sharing best practices through standards, means that teams are bound to repeat mistakes and cause customer interruptions. While there is a narrow line between autonomy and anarchy, the difference on their effect in an organization is significant.
Consider the matrix below which distinguishes between decision making (What path gets taken to a result and Why that path is taken) and governance (the rules around How and Who should do something). In an Autocratic organization there is a high amount of Governance controlled via a Top-Down method of Management. Agile teams here are bounded by how they can accomplish a goal and crippled by what path to take because of the heavy involvement of Management. Here is where you find organizations that are described derogatorily as Bureaucratic.
On the opposite side of the spectrum, if an Agile shop finds itself unbounded by Governance and capable of making all decisions then you run into the possibility of anarchy. This type of environment is what allows start-ups to thrive in their early days, but once you have moved beyond a handful of employees, it is necessary to start building governance.
Where companies with Agile succeed is when Governance is in place and decisions are driven from the bottom. Ownership and appropriately known boundaries allow Agile teams to deftly maneuver and get after the concept of “Build Small, Fail Fast”. Lastly, it is important that the lessons learned from all of the Autonomous teams are continually brought up and shared across the multitude of products and teams you have. It is a waste of time to reinvent the wheel when another team has already solved the problem.
Teams should be built around the suggestions from AKF’s white paper on Organizing Product Teams for Innovation: small, cross functional teams built around a service, who are empowered to be autonomous and work independently on their own. Autonomy should be defined within the rules of the organization, inform the organization’s architectural principles, and drive adherence through leadership. This is by no means a notion that the organization should avoid cross functional empowered teams! As we state in our white paper, “We still have executives developing strategy, functional teams (product management, etc.) defining subservient strategies and road maps. But the primary identities of these folks are embedded within the teams that implement these solutions.”
It is also easy to confuse the notion of empowerment and autonomy. Empowerment is an action of delegation coupled with the assurance of resources and tools to complete the desired outcomes to the delegated party. It is only through empowerment that autonomy can be achieved within an organizational hierarchy. Further, it is only through empowerment that autonomous agile teams can be established. But both empowerment and autonomy need to have rules governing their action or issuance. Specifically, an agile team is empowered to be autonomous with the following constraints: following development protocols and standards, adhering to architectural principals, adhering to established best practices regarding test coverage, etc. and ensuring that you achieve the “non-functional requirements” codified within the Agile Definition of Done.
Have your autonomous technology teams been free to make decisions that do not align with the vision of the company? Are you fearful of switching to Agile because of the rampant anarchy they will exhibit? AKF Partners can help ensure that your organization is aligned to the outcomes it desires.
June 5, 2019 | Posted By: Marty Abbott
Sidecar Pattern Overview
The Sidecar Pattern is meant to allow the deployment of components of an application into separate, isolated, and encapsulated processes or containers. This pattern is especially useful when there is a benefit to sharing common components between microservices, as in the case of logging utilities, monitoring utilities, configuration routines, etc.
The Sidecar Pattern name is an analogy indicating the one-seat cars sometimes bolted alongside a motorcycle.
Benefits of Sidecar
Sidecar comes with many benefits:
- Use of multiple languages (polyglot) or technologies for each component. This is especially useful if a language is especially strong in a necessary area (e.g. Python for Machine Learning, or R for statistical work) or if an opensource solution can be leveraged to eliminate in-house specialization (e.g. the use of NGINX for certain network-related functions).
- Separation of what would otherwise be a monolith, and if used properly, fault isolation of associated services.
- Conceptually easy interactions between components similar to those provided by libraries, or service calls between microservices.
- Lower latency than traditional service calls to “other” services as the Sidecar lives in the same processing environment (VM or physical server) – albeit typically in a separate container.
- Similar to the use of libraries, allows for ownership by individual teams and organizational scalability of a larger team.
- Similar to the use of dynamically-loadable libraries, allows for independent release by teams of various shared usage components.
Drawbacks to Sidecar
Regardless of implementation (poly- or mono- glot), Sidecar has some drawbacks compared to the use of libraries:
- Higher inter-process communication latency – Because most implementations are service calls, the loopback interface on a system (127.0.0.1) will increase latency compared to the transition of call flow through memory.
- Size – especially in polyglot implementations but even in monoglot implementations – Containerization leads to multiple copies of similar libraries and increased memory utilization for comparable operations relative to the use of libraries.
- Environments – it is difficult to create any notion of fault isolation with Sidecar without containerization technologies. VM technologies (Sidecar in a VM separate from the host or calling solution) is not an option as it is then a Fan Out or Mesh anti-pattern rather than a local call.
When to Use Sidecar
Sidecar is a compelling alternative to libraries for cases where the increase in latency associated with local service messaging does not impact end-user response times. Examples of these are asynchronous logging, out of band monitoring, and asynchronous messaging capabilities. Circuit breakers (time-based request/response timeouts) are also a good example of a Sidecar implementation.
When to Avoid Sidecar
Never use a Sidecar Pattern for synchronous activities that must complete prior to generating a user response. Doing so will add some delay to end-user response times.
AKF also advises staying away from Sidecar for synchronous communications between services where doing so requires Sidecar to know all endpoints for each service. A specific example we advise against is having every endpoint (instance) of Service A (e.g. add-to-cart) know of every endpoint (instance) of Service B (e.g. decrement-SKU). A graphic of this example is given below:
The above graphic indicates the coordination between just two services and the instances that comprise that service. Imagine a case where all services may communicate to each other (as in the broader Mesh anti-pattern). Attempting to isolate faults becomes nearly impossible.
If Service A sometimes fails while calling Service B, how do you know which component is failing? Is it a failure in Service A, Service A’s Sidecar Proxy or Service B? Easier is to have a fewer number of proxies (albeit at a higher cost of latency given non-local communication) handle the transactions allowing for easier fault identification.
AKF Partners has helped hundreds of companies move from monolithic solutions to services and microservice architectures. Give us a call, we can help you with your transition.
May 29, 2019 | Posted By: AKF
VMs vs Containers
Inefficiency and down time have traditionally kept CTO’s and IT decision makers up at night. Now, new challenges are emerging driven by infrastructure inflexibility and vendor lock-in, limiting Technology more than ever and making strategic decisions more complex than ever. Both VMs and containers can help get the most out of available hardware and software resources while easing the risk of vendor lock-in limitation.
Containers are the new kids on the block, but VMs have been, and continue to be, tremendously popular in data centers of all sizes. Having said that, the first lesson to learn, is containers are not virtual machines. When I was first introduced to containers, I thought of them as light weight or trimmed down virtual instances. This comparison made sense since most advertising material leaned on the concepts that containers use less memory and start much faster than virtual machines – basically marketing themselves as VMs. Everywhere I looked, Docker was comparing themselves to VMs. No wonder I was a bit confused when I started to dig into the benefits and differences between the two.
As containers evolved, they are bringing forth abstraction capabilities that are now being broadly applied to make enterprise IT more flexible. Thanks to the rise of Docker containers it’s now possible to more easily move workloads between different versions of Linux as well as orchestrate containers to create microservices. Much like containers, a microservice is not a new idea either. The concept harkens back to service-oriented architectures (SOA). What is different is that microservices based on containers are more granular and simpler to manage. More on this topic in a blog post for another day!
If you’re looking for the best solution for running your own services in the cloud, you need to understand these virtualization technologies, how they compare to each other, and what are the best uses for each. Here’s our quick read.
VM’s vs. Containers – What’s the real scoop?
One way to think of containers vs. VMs is that while VMs run several different operating systems on one server, container technology offers the opportunity to virtualize the operating system itself.
Figure 1 – Virtual Machine Figure 2 - Container
VMs help reduce expenses. Instead of running an application on a single server, a virtual machine enables utilizing one physical resource to do the job of many. Therefore, you do not have to buy, maintain and service several servers. Because there is one host machine, it allows you to efficiently manage all the virtual environments with a centralized tool – the hypervisor. The decision to use VMs is typically made by DevOps/Infrastructure Team. Containers help reduce expenses as well and they are remarkably lightweight and fast to launch. Because of their small size, you can quickly scale in and out of containers and add identical containers as needed.
Containers are excellent for Continuous Integration and Continuous Deployment (CI/CD) implementation. They foster collaborative development by distributing and merging images among developers. Therefore, developers tend to favor Containers over VMs. Most importantly, if the two teams work together (DevOps & Development) the decision on which technology to apply (VMs or Containers) can be made collaboratively with the best overall benefit to the product, client and company.
What are VMs?
The operating systems and their applications share hardware resources from a single host server, or from a pool of host servers. Each VM requires its own underlying OS, and the hardware is virtualized. A hypervisor, or a virtual machine monitor, is software, firmware, or hardware that creates and runs VMs. It sits between the hardware and the virtual machine and is necessary to virtualize the server.
IT departments, both large and small, have embraced virtual machines to lower costs and increase efficiencies. However, VMs can take up a lot of system resources because each VM needs a full copy of an operating system AND a virtual copy of all the hardware that the OS needs to run. This quickly adds up to a lot of RAM and CPU cycles. And while this is still more economical than bare metal for some applications this is still overkill and thus, containers enter the scene.
Benefits of VMs
• Reduced hardware costs from server virtualization
• Multiple OS environments can exist simultaneously on the same machine, isolated from each other.
• Easy maintenance, application provisioning, availability and convenient recovery.
• Perhaps the greatest benefit of server virtualization is the capability to move a virtual machine from one server to another quickly and safely. Backing up critical data is done quickly and effectively because you can effortlessly create a replication site.
Popular VM Providers
• VMware vSphere ESXi, VMware has been active in the virtual space since 1998 and is an industry leader setting standards for reliability, performance, and support.
• Oracle VM VirtualBox - Not sure what operating systems you are likely to use? Then VirtualBox is a good choice because it supports an amazingly wide selection of host and client combinations. VirtualBox is powerful, comes with terrific features and, best of all, it’s free.
• Xen - Xen is the open source hypervisor included in the Linux kernel and, as such, it is available in all Linux distributions. The Xen Project is one of the many open source projects managed by the Linux Foundation.
• Hyper-V - is Microsoft’s virtualization platform, or ‘hypervisor’, which enables administrators to make better use of their hardware by virtualizing multiple operating systems to run off the same physical server simultaneously.
• KVM - Kernel-based Virtual Machine (KVM) is an open source virtualization technology built into Linux. Specifically, KVM lets you turn Linux into a hypervisor that allows a host machine to run multiple, isolated virtual environments called guests or virtual machines (VMs).
What are Containers?
Containers are a way to wrap up an application into its own isolated ”box”. For the application in its container, it has no knowledge of any other applications or processes that exist outside of its box. Everything the application depends on to run successfully also lives inside this container. Wherever the box may move, the application will always be satisfied because it is bundled up with everything it needs to run.
Containers virtualize the OS instead of virtualizing the underlying computer like a virtual machine. They sit on top of a physical server and its host OS — typically Linux or Windows. Each container shares the host OS kernel and, usually, the binaries and libraries, too. Shared components are read-only. Sharing OS resources such as libraries significantly reduces the need to reproduce the operating system code and means that a server can run multiple workloads with a single operating system installation. Containers are thus exceptionally light — they are only megabytes in size and take just seconds to start. Compared to containers, VMs take minutes to run and are an order of magnitude larger than an equivalent container.
In contrast to VMs, all that a container requires is enough of an operating system, supporting programs and libraries, and system resources to run a specific program. This means you can put two to three times as many applications on a single server with containers than you can with VMs. In addition, with containers, you can create a portable, consistent operating environment for development, testing, and deployment. This is a huge benefit to keep the environments consistent.
Containers help isolate processes through differentiation in the operating system namespace and storage. Leveraging operating system native capabilities, the container isolates process space, may create temporary file systems and relocate process “root” file system, etc.
Benefits of Containers
One of the biggest advantages to a container is the fact you can set aside less resources per container than you might per virtual machine. Keep in mind, containers are essentially for a single application while virtual machines need resources to run an entire operating system. For example, if you need to run multiple instances of MySQL, NGINX, or other services, using containers makes a lot of sense. If, however you need a full web server (LAMP) stack running on its own server, there is a lot to be said for running a virtual machine. A virtual machine gives you greater flexibility to choose your operating system and upgrading it as you see fit. A container by contrast, means that the container running the configured application is isolated in terms of OS upgrades from the host.
Popular Container Providers
1. Docker - Nearly synonymous with containerization, Docker is the name of both the world’s leading containerization platform and the company that is the primary sponsor of the Docker open source project.
2. Kubernetes - Google’s most significant contribution to the containerization trend is the open source containerization orchestration platform it created.
3. Although much of early work on containers was done on the Linux platform, Microsoft has fully embraced both Docker and Kubernetes containerization in general. Azure offers two container orchestrators Azure Kubernetes Service (AKS) and Azure Service Fabric. Service Fabric represents the next-generation platform for building and managing these enterprise-class, tier-1, applications running in containers.
4. Of course, Microsoft and Google aren’t the only vendors offering a cloud-based container service. Amazon Web Services (AWS) has its own EC2 Container Service (ECS).
5. Like the other major public cloud vendors, IBM Bluemix also offers a Docker-based container service.
6. One of the early proponents of container technology, Red Hat claims to be “the second largest contributor to the Docker and Kubernetes codebases,” and it is also part of the Open Container Initiative and the Cloud Native Computing Foundation. Its flagship container product is its OpenShift platform as a service (PaaS), which is based on Docker and Kubernetes.
Uses for VMs vs Uses for Containers
Both containers and VMs have benefits and drawbacks, and the ultimate decision will depend on your specific needs, but there are some general rules of thumb.
• VMs are a better choice for running applications that require all of the operating system’s resources and functionality when you need to run multiple applications on servers or have a wide variety of operating systems to manage.
• Containers are a better choice when your biggest priority is maximizing the number of applications running on a minimal number of servers.
Because of their small size and application orientation, containers are well suited for agile delivery environments and microservice-based architectures. When you use containers and microservices, however, you can easily have hundreds or thousands of components in your environment. You may be able to manually manage a few dozen virtual machines or physical servers, but there is no way you can manage a production-scale container environment without automation. The task of automating and managing a large number of containers and how they interact is known as orchestration.
Scalability of containerized workloads is a completely different process from VM workloads. Modern containers include only the basic services their functions require, but one of them can be a web server, such as NGINX, which also acts as a load balancer. An orchestration system, such as Google Kubernetes is capable of determining, based upon traffic patterns, when the quantity of containers needs to scale out; can replicate container images automatically; and can then remove them from the system.
For most, the ideal setup is likely to include both. With the current state of virtualization technology, the flexibility of VMs and the minimal resource requirements of containers work together to provide environments with maximum functionality.
If your organization is running many instances of the same operating system, then you should look into whether containers are a good fit. They just might save you significant time and money over VMs.
May 22, 2019 | Posted By: Pete Ferguson
Disaster Recovery is the bellybutton of the IT world. Everyone has it (or at least in theory) – However, does anyone know if it can be used for anything useful?
The Disaster When There Is No Active Recovery
Have you been in this scenario: Each year during budget time you SWAG $50-100K to “test disaster recovery.” During the many cycles of budget negotiation, it is dismissed at some point and forgotten until the next budget season? In the meantime, somewhere along the way you are audited or something goes “bang” and DR suddenly becomes a hot topic … for about a week and then it is business as usual.
In many of our technical due diligence reviews we find companies are still stuck in the hypothetical world of table tops and disaster recovery exercises. During our longer term engagements and workshops, we repeatedly hear of DR gone wild - and wrong. The DR exercise is scheduled, then pushed off. Scheduled, pushed off.
Finally the day comes like the first day of school and everyone is aligned to test the DR plan. Unfortunately, too many things have changed since the DR instance was designed and the exercise is a failure. Since business is generally good, the can is kicked down the road to some future utopian alignment of the stars when it can be rescheduled and everyone will have fixed everything on the backlog discovered during the exercise.
Disaster Recovery that Wasn’t
Throughout my career I have been invited to participate and support the physical security for many of these exercises. One was in San Jose over a decade ago. A $1.5M large 53 foot trailer with generator and satellite (but no bathroom or cooking facilities …) was procured and sat on its own concrete pad fenced in and tied into our security system. The day came after months of tabletop planning to exercise the DR plan and nothing worked. The laptops - which were already three years old when reallocated to the trailer several years earlier - couldn’t run the latest service pack and as a result could not VPN into the network. The satellite uplink could barely support two or three laptops with email only. All the money invested was a waste and the trailer was eventually donated to a non-profit and the cement pad turned into additional parking.
Several years later while I was responsible for operations in AsiaPac, IT approached us again to do a tabletop exercise and then a simulated exercise in india. My team worked with local emergency services to simulate a fire and spice up the annual building evacuation exercise and IT was to also exercise their DR plan. The evacuation exercise had a lot of theatrics and was considered a great success with explosions, fire crews rushing in and spraying the building. At the same time, a group of developers boarded a bus with their laptops and after over an hour and a half in traffic showed up to the DR site IT had been paying for only to find a real fire had occurred a few months previous and the vendor had failed to inform anyone. Instead of a “plug and play” workspace, they found a concrete bunker dark and still smelling of smoke with a few folding tables and chairs with extremely slow and limited connectivity. They got back on the bus and found a local coffee shop where they were a lot more successful logging on to the network.
These two examples are of physical failures, but I use them to illustrate the limitations imposed by theoretical DR.
Tom Kevan, a founding partner, recalls how once they had set up three instances at eBay for web traffic, gone were the days of overnight outages because they were able to take down the third instance, perform the upgrades and test, and bring it back online. All the while customers never knew he was running “disaster recovery” because it was part of daily operations for web traffic.
Living in 24/7 Disaster Recovery - Active/Active Architecture
Strong, vibrant companies live in a DR world. This means they design with the “rule of three” - always have three active, complete, and independent stacks in production. This allows for continuous maintenance and upgrades on one system while always having at least two active/active systems in production. During peak times, all three are up and taking traffic. The goal is for DR to become your everyday operating cadence, not an unusual event.
As you formalize your architecutural principles, remove the words “disaster recovery” and plan to be forever resilient as table stakes for being in business.
The information on the following slides are what we cover in our three day workshops and often share in our final report with our clients when architectural principles are discovered to be lacking during a technical due diligence.
Whether you are a Fortune 100 or brand new start up with 3 employees, we have a proven track record of helping companies your size build a strong foundation for scalability through rapid growth, cost savings during downturns, and resiliency regardless of how business is doing. Give us a call and let us help!
‹ First < 2 3 4 5 6 > Last ›