GROWTH BLOG: The 5-95 Rule
AKF Partners Logo Technology ConsultingScalability - We wrote the book on it ℠

Growth Blog

Scalability and Technology Consulting Advice for SaaS and Technology Companies

Architecture Principles: Messaging Systems – Smart End Points, Dumb Pipes

July 29, 2019  |  Posted By: Marty Abbott

Asynchronous messaging systems are a critical component of many highly scalable and highly available architectures.  But, as with any other architectural component, these solutions also need attention to ensure availability and scalability.  The solution should scale along one of the scale cube axes, either X, Y or Z.  The solution should also both include and enable the principle of fault isolation.  Finally, it should scale cost both gracefully and cost effectively while enabling high levels of organizational scale. These requirements bring us to the principle of Smart End Points and Dumb Pipes. 

Fast time to market within software development teams is best enabled when we align architectures and organizations such that coordination between teams is reduced (see Conway’s Law and our white paper on durable cross functional product teams).  When services within an architecture communicate, especially in the case of one service “publishing” information for the consumption of multiple services, the communication often needs to be modified or “transformed” for the benefit of the consumers.  This transformation can happen at the producer, the transport mechanism or the consumer.  Transformation by the producer for the sake of the consumer makes little sense, as the producer service and its associated team have low knowledge of the consumer needs and it creates an unnecessary coordination task between producer and consumer.  Transformation “in flight” by the service similarly implies a team of engineers who must be both knowledgeable about all producers and consumers and an unnecessary coordination activity.  Transformation by the consumer makes most sense, as the consumer has the most knowledge of what they need from the message and eliminates reliance upon and coordination with other teams.  The principle of smart end points and dumb pipes then creates the lowest coordination between teams, the highest level of organizational scale and the best time to market option.

To be successful achieving a dumb pipe, we introduce the notion of a pipe contract.  Such a contract explains the format of messages produced on and consumed from the pipe.  It may indicate that the message will be in a tag delimited format (XML, YAML, etc), abide by certain start and end delimiters, and for the sake of extensibility allow for custom tags for new information or attributes.  The contract may also require that consumption not be predicated on strict order of elements (e.g. title is always first) but rather by strict adherence to tag and value regardless of where each tag is in the message. 

Smart End Points Dumb Pipes Message Contract

By ensuring that the pipe remains dumb, the pipe can now scale both more predictably and cost effectively.  As no transformation compute happens within the pipe, its sole purpose becomes the delivery of the message conforming to the contract.  Large messages do not go through computationally complex transformation, meaning low compute requirements and therefore low cost.  The lack of computation also means no odd “spikes” as transforms start to stall delivery and eat up valuable resources.  Messages are delivered faster (lower latency).  An additional unintended benefit is that because transforms aren’t part of message transit, a type of failure (computational/logical) does not hinder message service availability.

The 2x2 matrix below summarizes the options here, clearly indicating smart end points and dumb pipes as the best choice.

Smart End Points Dumb Pipes Comparison 2x2 Matrix

One important callout here is that “streams processing”, which is off-message platform evaluation of message content, is not a violation of the smart end points, dumb pipes concept.  The solutions performing streams processing are simply consumers and producers of messages, subscribing to the contract and transport of the pipe.

Summarizing all of the above, the benefits of smart end points and dumb pipes are:

  1. Lower cost of messaging infrastructure - pushes the cost of goods sold closer to the producer and consumer.  Allows messaging infrastructure to scale by number of messages instead of computational complexity of messages.  License cost is reduced as fewer compute nodes are needed for message transit.
  2. Organization Scalability – teams aren’t reliant on transforms created by a centralized team.
  3. Low Latency – because computation is limited, messages are delivered more quickly and predictably to end consumers.
  4. Capacity and scalability of messaging infrastructure – increased significantly as compute is not part of the scale of the platform.
  5.  
  6. Availability of messaging infrastructure – because compute is removed, so is a type of failure.  As such, availability increases.

Two critical requirements for achieving smart end points and dumb pipes:

  • Message contracts – all messages need to be of defined form.  Producers must adhere to that form as must consumers.
  • Team behaviors – must assure adherence to contracts.

AKF Partners helps companies build scalable, highly available, cost effective, low-latency, fast time to market products.  Call us – we can help!

Permalink

The Fallacies of DR

July 29, 2019  |  Posted By: Bill Armelin

picture of a rack of servers in a datacenter with a overly saying disater recovery

On February 7, 2019, Wells Fargo experienced a major service interruption to its customer facing applications. The bank blamed a power shutdown at one of its data centers in response to smoke detected in the facility. Customers continued to experience the effects for several days. How could this happen? Aren’t banks required to maintain multiple data centers (DC) to fail over when something like this happens? While we do not know the specifics of Wells Fargo’s situation, AKF has worked at several banks before, and we know the answer is yes. This event highlights an area that we have seen time and time again. Disaster Recovery (DR) usually does not work.

Don’t government regulations require some form of business continuity? If the company loses a data center, shouldn’t the applications run out of a different data center? The answer to the former is yes, and the answer to the latter should be yes. These companies spend millions of dollars setting up redundant systems in secondary data centers. So, what happens? Why don’t these systems work?

The problem is these companies rarely practice for these DR events. Sure, they will tell you that they test DR yearly. But many times, this is simply to check a box on their yearly audit. They will conduct limited tests to bring up these applications in the other data center, and then immediately cut back to the original. Many times, supporting systems such as AuthN, AuthZ and DNS are not tested at the same time. Calls from the tested system go back to the original DC. The capacity of the DR system cannot handle production traffic. They can’t reconcile transactions in the DR instance of ERP with the primary. The list goes on.

What these companies don’t do is prepare for the real situation. There is an old adage in the military that says you must “train like you fight.”  This means that your training should be as realistic as possible for the day that you will actually need to fight. From a DR perspective, this means that you need to exercise your DR systems as if they were production systems. You must simulate an actual failure that invokes DR. This means that you should be able to fail over to your secondary DC and run indefinitely. Not only should you be able to run out of your secondary datacenter, you should regularly do it to exercise the systems and identify issues.

Imagine cutting over to a backup data center when doing a deployment. You run out of the backup DC will new code is being deployed to the primary DC. When the deployment is complete, you cut back to the primary. Once the new deployment is deemed stable, you can update the secondary DC. This allows you to deploy without downtime and you exercise your backup systems. You do not impact your customers during the deployment process and you know that your DR systems actually work.

DR Configurations

How do companies typically setup their DR? Many times, we see companies use an Active/Passive (Hot/Cold) setup. This is where the primary systems run out of one DC and a second (usually smaller) DC houses a similar setup. Systems synchronize data to backup data stores.  The idea is that during a major incident, they start up the systems in the secondary DC and redirect traffic to it. There are several downsides to this configuration. First, it requires running an additional set of servers, databases, storage and networking. This requires costs of 200% to run production traffic. Second, it is slow to get started. For cost reasons, companies keep the majority of systems shut down and start them when needed. It takes time to get the systems warmed up to take traffic. During major incidents, teams avoid failing over to the secondary DC, trying to fix the issues in the primary DC. This extends the outage time. When they do fail over, they find that systems that haven’t run in a long time don’t work properly or are undersized for production traffic.

Companies running this configuration complain that DR is expensive. “We can’t afford to have 100% of production resources sitting idle.” Companies that choose Active/Passive DR typically have not had a complete and total DC failure, yet.

So, companies don’t want to have an additional 100% set of untested resources sitting idle. What can they do? The next configuration to consider is running Active/Active. This means that you run your production systems out of two datacenters, sending a portion of production traffic (usually 50%) to each. Each DC synchronizes its data with the other. If there is a failure of one DC, divert all of the traffic to the other. Fail over usually happens quickly since both DCs are already receiving production traffic.

This doesn’t fix the cost issue of have an additional 100% resources in a second DC. It does fix the issues of the systems not working in the other DC. Systems don’t sit idle and are exercised regularly.

While this sounds great, it is still expensive. Is there another way to reduce the total cost of DR? The answer is yes. Instead of having two DCs taking production traffic, what if we use three? At first glance, it sounds counter intuitive. Wouldn’t this take 300% of resources? Luckily, by splitting traffic to three (or more) datacenters, we no longer need 100% of the resources in each.

In a three-way active configuration, we only need 50% of the capacity in each DC. From a data perspective, each DC house its own data and 50% of each of the other’s data (see table below). This configuration can handle a single DC failure with minimal impact to production traffic. However, because each DC needs less capacity, the total cost of three active is approximately 166% (vs. 200% for two). An added benefit is that you can pin your customers to the closest DC, resulting in lower latency.

                                               
Distribution of Data in a Multi-site Active Configuration
Datacenter ADatacenter BDatacenter C
100% A50% A50% A
50% B100% B50% B
50% AWS50% AWS100% AWS

graphic showing how three data centers spread across the United States will cost less than two

Companies that rely on Active/Passive DR typically have not experienced a full datacenter outage that has caused them to run from their backup systems in production. Tests of these systems allow them to pass audits, but that is usually it. Tests do not mimic actual failure conditions. Systems tend to be undersized and may not work. An Active-Active configuration will help but does not decrease costs. Adopting a Multi-Site Active DR configuration will result in improved availability and lower costs over an Active/Passive or Active/Active setup.

Do you need help defining your DR strategy or architecture? AKF Partners can conduct a technology assessment to get you started.

Permalink

July 26, 2019  |  Posted By: AKF

Running a technology company is a challenging endeavor.  Not only are consumers demands changing daily, the technology to deliver upon those demands is constantly evolving.  Where you host your infrastructure and software, what your developers code in, what version you are on, and how you are poised to deliver quality product is not the same as it was 20 years ago, probably not even 10 or 5 years ago.  And these should all be good things.  But underlying all those things is a common denominator: people.  In Seed, Feed, Weed I outlined what companies need to do in order to maintain a stable of great employees.  This article will delve down into the aspect of Seed a little more.

What is Seed?

At its core, seed is hiring the best people for the job.  Unfortunately, it takes a little bit of work to get to that.  If it was that easy, then this is where the article would end…

But it doesn’t.

Seed is not just your hiring managers dealing with a specific labor pool available to them.  It needs to be more than that.  It needs to be an ever evolving, ever responsive organism within your organization. 

If your HR Recruiting office is still hiring people like it did in the 90’s, then don’t be surprised when you get talent on par with 90’s capability.  No longer can you sit back and wait for the right candidate to come to you because chances are what you are hiring for is buried under a million other similar job postings in your area.  Your desired future candidates are out, going to meet ups, conferences, and other networking events.  To meet them, you too need to be in attendance. 

If you are able to hire a future employee from a conference where other employers are present, that is a great indicator of where your company stands.  If you can’t stand at least shoulder-to-shoulder with your competitors, then you will never be able to hire the best people.

AKF-Hiring the Right Employees

Location

There are many great advantages to the minimalization of the world through telecommunications.  Now if a certain skillset is only available half-way around the world, today’s technology makes it much easier to overcome the distance challenge.  This isn’t to say the debate over off-shore vs. near-shore or in-house has a clear winner, but there are many more options. 

So where should you be looking?  Do you want quality or quantity?  If quality matters, start where competition in your sector is heaviest.  If quantity matters, any place will do.  But hopefully you want quality.  Almost anyone can sit at a desk for 8 hours.  Very few talented programmers can adapt your current architecture to meet the demands of a market in 6 months.

If your company is afraid to enter a competitive technology market geography because of fear it won’t be able to hire more employees than the competition, then that should be a red flag.  Challenge breeds greatness.

Hiring

The hiring process itself should be iterative and multi-faceted.  Sure, it is nice to be able to tell a prospective candidate they will go through two 30-minute phone screens, followed by two 1 hour on sites, but maybe that job, or that candidate needs something a little more, or a little less. 

Don’t be afraid to deviate your approach based upon the role or the potential future employee.  Just make sure they are aware of it and why you are changing from what they were told.  This will give them a chance to shine more.  Recently, I got to be a part of a hiring process that should’ve involved two 30-minute phone screens and one 2-hour onsite.  That 2-hour onsite was deemed not long enough because the candidate and the future employer spent too much time discussing the minutiae of various implementations to an engineering plan.  And that’s ok.  They then asked the candidate to do a video conference where he stepped through the code base.  But they let him know why they needed that follow on.  It wasn’t to test him further.  It was because he had simply “clicked” too well with the engineering aspect and time ran away from them.

Additionally, it shouldn’t just be technology members involved in hiring developers.  Far too often a new employee has trouble meshing with the culture of the organization or team because they were asked purely technical-related questions or presented with technical scenarios.  Have someone from your People Operations or Marketing, involved as well.  This will help flesh out the entirety of the candidate and provide them with more knowledge of the company.

Far too often companies are so focused on their hyper growth that getting “butts in seats” matters more than getting the right people.  Nine times out of 10, one great employee is going to be better than three okay employees.

We’ve helped dozens of companies fill interim roles as we helped find great employees.  If you need assistance on how to identify a great employee, and Seed your company appropriately,

AKF can help.

Permalink

Normalization of Deviance and Software...Oh and Nasa

July 22, 2019  |  Posted By: Eric Arrington

waiting for challenger launch - normalization of deviance

It’s funny how clearly you can remember some events from your childhood. I remember exactly where I was on Jan 28th, 1986.

All the kids in Modoc Elementary School had been ushered into the Multi Purpose Room. It was an exciting day. We were all going to watch the Challenger Shuttle Launch. The school was especially excited about this launch. A civilian school teacher was going into space.

I was sitting right up front (probably so the teacher could keep an eye on me). I had on the paper helmet I had made the day before. I was ready to sign up for NASA. We all counted down and then cheered when the shuttle lifted off.

Seventy-three seconds in something happened.

There was an obvious malfunction. For once the kids were silent. Teachers didn’t know what to do. We all sat there watching. Watching as the Challenger exploded in mid air, taking the lives of all 7 crew members aboard.

How could this have happened? Some fluke accident after all that careful planning? This was NASA. They thought of everything right?

I recently picked up a book by Dr. Diane Vaughan called The Challenger Launch Decision. Vaughan isn’t an engineer, she is a sociologist. She doesn’t study Newtonian Mechanics. She studies social institutions, cultures, organizations, and interactions between people that work together.

She wasn’t interested in O-rings failing. She wanted to understand the environment that led to such a failure.

She realized that it’s easy for people to rationalize shortcuts under pressure. Let’s be honest, do any of us not work under a certain amount of pressure? The rationalization gets even easier when you take a shortcut and nothing bad happens. Lack of a “bad outcome” can actually justify the shortcut.

After studying the Challenger Launch and other failures, Vaughan came up with the theory for this type of breakdown in procedure. She called this theory the normalization of deviance. She defines it as:

The gradual process through which unacceptable practice or standards become acceptable. As the deviant behavior is repeated without catastrophic results, it becomes the social norm for the organization

In other words, the gradual breakdown in process where a violation of procedure becomes acceptable. One important key is, it happens even though everyone involved knows better.

normalization of deviance and software

Normalization of Deviance and What Happened at NASA

Prior to the launch, NASA became more and more focused on hitting the launch date (sound familiar?). Deviants from established procedures kept popping up. Instead of reevaluating and changing things, the deviants were accepted. Over time these deviants became the new normal.

Erosion between the O-rings had occurred before the date of the launch. It wasn’t a new occurrence. The issue was, erosion past the O-rings wasn’t supposed to happen at all. It was happening on every flight. The engineers scratched their heads and made changes but the erosion kept happening. They argued that yes, it was happening but it was stable so it could be ignored.

In other words, the O-rings didn’t completely fail so it was ok. A condition that was at one time deemed unacceptable was now considered to be acceptable. The deviance had become the new normal. This deviance led to the death of 7 people and scarred a bunch of my classmates for life (don’t worry I was ok).

Normalization of Deviance

Normalization of deviance doesn’t only happen at NASA. Their failures tend to garner more attention though. When you’re sitting on more than 500,000 gallons of liquid oxygen and liquid hydrogen the failures are spectacular.

Most of us don’t work in a job where a failure can cost someone their life. That doesn’t mean these principles don’t apply to us. Normalization of deviance happens in all industries.

There is a study of how the normalization of deviance affects healthcare. The author, John Banja, identifies 7 factors that contribute to normalizing unacceptable behaviors. These 7 factors are extremely relevant to us in the software industry as well. Here are his seven factors and some takeaways for the software world.

1. The rules are stupid and inefficient!

I am sure you have never heard this at your company before. A good alternative would be, “management doesn’t understand what we are doing. Their rules slow us down.”

In this situation the person violating the rule understands the rule. He just doesn’t think management understands his job. The rule was handed down by someone in management who doesn’t know what it’s like to be “in the trenches.”

Guess what? Sometimes this is true. Sometimes the rules are stupid and inefficient and are created by someone that is out of touch. What is the solution? Don’t ignore the rule. Go find out why the rule is there.

2. Knowledge is imperfect and uneven.

In this case, the “offender” falls under 3 possible categories:

They are unaware that the rule exists.

They might know about the rule but fail to get why it applies to them.

They have been taught deviants by other co-workers.

This is especially a problem in a culture where people are afraid to ask for help. This problem gets compounded with every new hire. Have you ever asked why a certain thing was done at a new job and heard back, “I don’t know, that’s just how things are done here”?

Foster a culture where it is acceptable to ask questions. New hires and juniors should feel empowered to ask “why.”

3. The work itself, along with new technology, can disrupt work behaviors and rule compliance.

We all do complex work in a dynamic environment. It’s unpredictable. New technologies and new environments can lead us to come up with solutions that don’t perfectly fit established procedures. Engineers are forced to come up with answers that might not fit in the old documented standards.  

4. I’m breaking the rule for the good of my patient!

We don’t have patients, but we can see this in our world as well. Substitute the word user for patient. Have you ever violated a procedure for the good of the user or ease of integration with a colleague?

What would be a better solution? If it’s a better way and you don’t see any negative to doing it that way, communicate it. It might be beneficial to everyone to not have that rule. Have a discussion with your team about what you are trying to do and why. Maybe the rule can be changed or maybe you aren’t seeing the whole picture.

5. The rules don’t apply to me/you can trust me.

“It’s a good rule for everyone else but I have been here for 10 years. I understand the system better than everyone. I know how and when to break the rules.”

We see this a lot as startups grow up. Employee #2 doesn’t need to follow the rules right? She knows every line of code in the repo. Here is the problem, developers aren’t know for our humility. We all think we are that person. We all think we understand things so well that we know what we can get away with.

6. Workers are afraid to speak up.

The likelihood of deviant behavior increases in a culture that discourages people from speaking up. Fear of confrontation, fear or retaliation, “not my job”, and lack of confidence make ignoring something even though it’s wrong easier.

Let’s be honest, as developers we aren’t always highly functioning human beings. We are great when our heads are down and we’re banging on a keyboard but when we are face to face with another human? That’s a different set of tools than most of us don’t have in our quiver.

This especially difficult in a relationship between a junior and senior engineer. It’s hard to a junior engineer to point out flaws or call out procedure violations to a senior engineer.

7. Leadership withholding or diluting findings on system problems.

We know about deviant behavior, we just dilute it as we pass it up the chain of command. This can happen for many reasons but can be mostly summed by “company politics.” Maybe someone doesn’t want to look bad to superiors so they won’t report the incident fully. Maybe you don’t discipline a top performer for unacceptable behavior because you are afraid they might leave.

You also see this in companies that have a culture where managers lead with an iron fist. People feel compelled to protect coworkers and don’t pass information along.

How Do You Fix It

This happens everywhere. It happens at your current job, at home, with your personal habits, driving habits, diet and exercise; it’s everywhere. There are 3 important steps to fighting it.

Creating and Communicating Good Processes

It’s simple, bad processes lead to bad results. Good processes that aren’t documented and/or accessible lead to bad results. Detailed and documented processes are the first step to fixing this culture of deviance.

Good documentation helps you maintain operational consistency. The next step is to make sure each employee knows the process.

Create good processes, document them, train employees, and hold everyone accountable for maintaining them.

Create a Collaborative Environment

This is especially true when creating new processes. Bring the whole team in to discuss. People should feel some ownership over the process they are accountable for.

Remember, normalization of deviance is a social problem. If a process is created as a group then the social need to adhere to it as a group is more powerful.

This also solves problem #1 Rules are Stupid. If the team makes the rules then they will be more likely to follow them.

Create a Culture of Communication

The key to fighting normalization of deviance is to understand that everyone knows better. If employees are consistently accepting deviants to accepted procedures then find out why.

A great way to see this is in action is to watch what happens when a new hire comes to the team with an alert. How does the team react? Do they brush them off? If so, then you probably have a team that is accepting deviant practices.

Employees should feel empowered to “hit the e-stop” on their processes and tasks. Employees, especially juniors, should be encouraged to question the established order of things. They need to feel comfortable asking “why?”.

Conventional wisdom needs to be questioned. They will be wrong most of the time. This will give you an opportunity to explain why you do things the way you do. If they are right then you make the procedure better. It’s a no lose situation.

Key Takeaways

As you can tell, most of the solutions are the same: Communication. Creating a culture of communication is the only way to keep from falling into this trap. Empower your employees to question the status quo. You will create stronger teams, better ideas, and improved performance.

There is only one way to catch normalization of deviance before it sets in: Create a culture of honesty, communication, and continuous improvement.

Sometimes it’s hard to judge this in your own culture. I call this “ship in a bottle” syndrome. When you’re in the bottle it’s hard to see things clearly. AKF has helped hundreds of software companies change their culture. Give us a call, we can help.

Permalink

What are Microservices?

July 21, 2019  |  Posted By: AKF

Microservices are an architectural approach emerging from service-oriented architecture.  The approach emphasizes self-management and light weight as the means to improve software agility, scalability, velocity, and team autonomy. In essence, microservices are an approach to solution decomposition as described in the AKF Scale Cube..

The approach decomposes or disintegrates an application into multiple services. Each microservice should be:

  1. independently deployed (or capable of independent deployment)
  2. independently executable (not dependent on another service for execution)
  3. an “owner” of some unique business capability
  4. owned by a single team (no two teams own a microservice
  5. an owner of its own data store and ideally the only solution accessing that store

   
The approach logically simplifies a software-centric understanding of business capabilities. 

sample application using microservices

                    Figure 1 shows a sample application using microservices. 

Size of a Microservice

Generally, the term “micro” is an unfortunate one as teams tend to misread it as meaning “a small number of lines of code” or “single task”.  To help answer the sizing question, we’ve put together a list of considerations based on developer throughput, availability, scalability, and cost. By considering these, you can decide if your service should be comparatively large or small in lines of code, objects/methods, functions, etc, or split up into smaller, individual services and swim lanes. Put another way, consider “micro” as a comparison to a singular, monolith or “macro” solution.

Splitting too aggressively can be overly costly and have little return for the effort involved. Companies with little to no growth will be better served to focus their resources on developing a marketable product than by fine-tuning their service sizes using the considerations below.
See the full article here.

The illustration below can be used to quickly determine the size (in functionality) of any given service.

how to determine service size

                      Figure 2 - Determine Service Size

Loosely coupled

Loose coupling is an essential characteristic of microservices. Any microservice should be capable of independent deployment. There must be zero coordination necessary for the deployment with other microservices, and other teams. This loose coupling enables frequent and rapid deployments, decreasing time to market for value creation within a product.

Implementation

Each microservice is scaled by running multiple instances of it as in the X axis of the AKF Scale Cube. There are many processes to handle, and memory and CPU requirements are an important consideration when assessing the cost of operation of the entire system. Container technologies are often employed to aid with ease of deployment.  Traditional Java EE stacks are less desirable for microservices from this point of view because they are optimized for running a single application container, not a multitude of containers.  Node.js and Go are more common as they are more lightweight and require less memory and CPU power per instance.

In theory, it is possible to create a microservice system in which each service uses a different language and stack (polyglot implementations). Such a polyglot implementation has many advantages and disadvantages.  Generally speaking, in smaller companies economy of scale, code reuse, and developer skills all set an upper bound on this number of no more than 2 to 3 “stacks”.

Benefits of Microservices

As software increases in complexity, the ability to separate functional areas in what would otherwise be a monolith into sets of independent services can yield many benefits, which include, but are not limited to the following:

  • More efficient debugging – no more jumping through multiple layers of an application, in essence, better fault isolation
  • Accelerated software delivery – smaller, easier to understand code bases owned by a single team increase velocity as a result of lower communication and coordination overhead
  • Scalability – microservices lend themselves to be integrated with other applications or services via industry-standard interfaces such as REST and can be scaled independently relative to their individual request rates
  • Fault tolerance – reduced downtime due to more resilient services< assuming that proper fault isolation and bulkheads are in place/li>
  • Reusability – as microservices are organized around business cases and not a particular project, due to their implementation, they can be reused and easily slotted into other projects or services, thereby reducing costs.
  • Deployment – as everything is encapsulated into separate microservices, you only need to deploy the services that you‘ve changed and not the entire application. A key tenet of microservice development is ensuring that each service is loosely coupled with existing services as mentioned earlier.
  • Polyglot – each service could be developed in its own language, and run on its own infrastructure and runtime stack.  This allows teams to diversify to maximize the opportunity to tap a markets skill sets or to operate across various geographies where each geography may have unique talent and skills

Challenges of Microservice Architecture

As with any architecture, microservices come with certain concerns and risks. Put another way, the approach is not a panacea.

  • Too many coding languages – yes, we listed this as a benefit, but it can also be a double-edged sword.  Too many languages, in the end, could make your solution unwieldy and potentially difficult to maintain.
  • Integration – you need to make a conscious effort to ensure your services as are loosely coupled as they possibly can be (yes, mentioned earlier too), otherwise, if you don‘t, you‘ll make a change to one service which has a ripple effect with additional services thereby making service integration difficult and time-consuming.
  • Integration test – testing one monolithic system can be simpler as everything is in “one solution”, whereas a solution based on microservices architecture may have components that live on other systems and/or environments thereby making it harder to configure an “end to end” test environment.
  • Communication – microservices naturally need to interact with other services, each service will depend on a specific set of inputs and return specific outputs, these communication channel‘s need to be defined into specific interfaces standards and shared with your team. Failures between microservices can occur when interface definitions haven‘t been adhered to which can result in lost time.
  • Unique Failures - microservices can introduce unique failure modes such as deadlock when multiple services and data stores are aggregated.  Race conditions are also a more common problem with the propagation of services.  Teams need to take great care to think through these possibilities when defining service boundaries.
  • Multiplicative Effect of Failure - Deployment architectures are important for microservices, as chaining services together will cause a multiplicative effect of failure that reduces downtime.  When developing deployment architectures, choose services in breadth and libraries for depth to increase availability and reduce failure probability.  Peruse our patterns and anti-patterns list for a better understanding of what to do and what not to do with microservices.

Don’t even think about Microservices without DevOps

Microservices allow you to respond quickly and incrementally to business opportunities. Incremental and more frequent delivery of new capabilities drives the need for organizations to adopt DevOps practices. 

Microservices cause an explosion of moving parts. It is not a good idea to attempt to implement microservices without serious deployment and monitoring automation. You should be able to push a button and get your app deployed. In fact, you should not even do anything. Committing code should get your app deployed through the commit hooks that trigger the delivery pipelines in at least development. You still need some manual checks and balances for deploying into production.

You no longer just have a single release team to build, deploy, and test your application. Microservices architecture results in more frequent and greater numbers of smaller applications being deployed.

DevOps is what enables you to do more frequent deployments and to scale to handle the growing number of new teams releasing microservices. DevOps is a prerequisite to being able to successfully adopt microservices at scale in your organization.

Teams that have not yet adopted DevOps must invest significantly in defining release processes and corresponding automation and tools. This is what enables you to onboard new service teams and achieve efficient release and testing of microservices. Without it, each microservice team must create its own DevOps infrastructure and services, which results in higher development costs. It also means inconsistent levels of quality, security, and availability of microservices across teams.

As you begin to reorganize teams to align with business components and services, also consider creating microservices DevOps teams who provide the cross-functional development teams with tool support, dependency tracking, governance, and visibility into all microservices. This provides business and technical stakeholders greater visibility into microservices investment and delivery as microservices move through their lifecycle.

The DevOps services team provides the needed visibility across the teams as to what services are being deployed, used by other teams, and ultimately used by client applications. This loosely coupled approach provides greater business agility.

Conclusion

Frequent releases keep applications relevant to business needs and priorities. Smaller releases means less code changes, and that helps reduce risk significantly. With smaller release cycles, it is easier to detect bugs much earlier in the development lifecycle and to gain quick feedback from the user base. All these are characteristics of a well-oiled microservices enterprise.

AKF Partners has helped to architect some of the most scalable, highly available, fault-tolerant and fastest response time solutions on the internet. Give us a call - we can help.

Permalink

Implementing Scalable, Highly Available Messaging Services

July 19, 2019  |  Posted By: Marty Abbott

When AKF Partners uses the term asynchronous, we use it in the logical rather than the physical (transport mechanism) sense.  Solutions that communicate asynchronously do not suspend execution and wait for a return – they move off to some other activity and resume execution should a response arrive. 

Asynchronous, non-blocking communications between service components help create resilient, fault isolated (limited blast radius) solutions. Unfortunately, while many teams spend a great deal of time ensuring that their services and associated data stores are scalable and highly available, they often overlook the solutions that tend to be the mechanism by which asynchronous communications are passed.  As such, these message systems often suffer from single points of failure (physical and logical), capacity constraints and may themselves represent significant failure domains if upon their failure no messages can be passed.

The AKF Scale Cube can help resolve these concerns.  The same axes that guide how we think about applications, servers, services, databases and data stores can also be applied to messaging solutions.


AKF Scale Cube for Messaging Services

X Axis

Cloning or duplication of messaging services means that anytime we have a logical service, we should have more than one available to process the same messages.  This goes beyond ensuring high availability of the service infrastructure for any given message queue, bus or service – it means that where one mechanism by which we send messages exist, another should be there capable of handling traffic should the first fail. 

As with all uses of the X axis, N messaging services (where N>1) can allow the passage of all similar messages.  Messages aren’t replicated across the instances, as doing so would eliminate the benefit of scalability.  Rather, messages are sent to one instance, but all producers and consumers send or consume to each of the N instances.  When an instance fails, it is taken out of rotation for production and when it returns its messages are consumed and producers can resume sending messages through it.  Ideally the solution is active-active with producers and consumers capable of interacting with all N copies as necessary.

Y Axis

The Y axis is segmentation by a noun (resource or message type) or verb (service or action).  There is very often a strong correlation between these.

Just as messaging services often have channels or types of communication, so might you segment messaging infrastructure by the message type or channel (nouns).  Monitoring messages may be directed to one implementation, analytics to a second, commerce to a third and so on.  In doing so, physical and logical failures can be isolated to a message type.  Unanticipated spikes in demand on one system, would not slow down the processing of messages on other systems.  Scale is increased through the “sharding” by message type, and messaging infrastructure can be increased cost effectively relative to the volume of each message type.

Alternatively, messaging solutions can be split consistent with the affinity between services.  Service A, B and C may communicate together but not need communication with D, E and F.  This affinity creates natural fault isolation zones and can be leveraged in the messaging infrastructure to isolate A, B and C from D, E and F.  Doing so provides similar benefits to the noun/resource approach above – allowing the solutions to scale independently and cost effectively.

Z Axis

Whereas the Y axis splits different types of things (nouns or verbs), the Z axis splits “similar” things.  Very often this is along a customer and geography boundary.  You may for instance implement a geographically distributed solution in multiple countries, each country having its own processing center.  Large countries by be subdivided, allowing solutions to exist close to the customer and be fault isolated from other geographic partitions.

Your messaging solution should follow your customer-geography partitions.  Why would you conveniently partition customers for fault isolation, low latency and scalability but rely on a common messaging solution between all segments?  A more elegant solution is to have each boundary have its own messaging solution to increase fault tolerance and significantly reduce latency.  Even monitoring related would ideally be handled locally and then forwarded if necessary, to a common hub.


We have held hundreds of on-site and remote architectural 2 and 3-day reviews for companies of all sizes in addition to thousands of due diligence reviews for investors. Contact us to see how we can help!

Permalink

Architectural Principle: Use Commodity Hardware (and Cloud tools)

July 18, 2019  |  Posted By: Pete Ferguson

use Commodity Hardware - Cheaper is better most of the time. Focus on the need, not the frills. Why? Excess capabilities beyond the need add cost for little additional value

The Tail (pun intended) of Dorothy-Boy the Goldfish

When my now-adult son was 5, he was constantly enamored in the pet aisle of the local superstore of the vast variety of fish of many sizes and colors and eventually convinced us to buy a goldfish.

We paid under $20, bowl, food, rocks, props and all, and “Dorothy-boy” came home with us (my son’s idea for a name, not mine). Of course, there were several mornings that Dorothy-boy was found upside down and a quick trip to the store and good scrubbing of the bowl remedied potential heartbreak before my son even knew anything was wrong. 

Contrast that to my grandmother’s beloved Yorkie Terrier, Sergeant. When Sarge got sick, my grandmother spent thousands on doctor’s office visits, specialized food, and several surgeries. The upfront cost of a well-bred dog was significant enough, the annual upkeep for poor little Sarge was astronomical, but he lived a good, spoiled, and well-loved life.

That is why at AKF we often use the analogy of “goldfish, not thoroughbreds” with our clients to help them make decisions on hardware and software solutions.

Implement only what you need, when you need it, avoiding extraneous features and capabilities. Why? Repeatable, incremental systems combine cost effectiveness with smaller impact of failures and easier additions to scale

If a “pizzabox” 1U Dell or HP (or pick your brand) server dies, no biggie, probably have a few others laying around or can purchase and spin up new ones in days, not months or quarters of a year. Also allows for quickly adding additional web servers, application servers, test servers, etc. The cost per compute cycle is very low and can be scaled very quickly and affordably.

“Cattle not pets” is another way to think about hardware and software selection. When it comes to your next meal (assuming you are not a vegetarian), what is easier to eat with little thought? A nameless cow or your favorite pet?

If your vendor is sending you on annual vacations (err, I mean business conferences) and providing your entire team with tons of swag, you are likely paying way too much in upfront costs and ongoing maintenance fees, licensing, and service agreements. Sorry, your sales rep doesn’t care that much about you; they like their commissions based on high markups better.

Having an emotional attachment to your vendors is dangerous as it removes objectivity in evaluating what is best for your company’s customers and future.

Untapped Capacity at a Great Cost

It is not uncommon for monolithic databases and mainframes to be overbuilt given the upfront cost resulting in only being utilized at 10-20% of capacity. This means there is a lot of untapped potential that is being paid for – but not utilized year-over-year.

Trying to replace large, propriety systems is very difficult due to the lump sum of capital investment required. It is placed on the CapEx budget SWAG year after year and struck early on in the budgeting process as a CFO either dictates or asks, “can you live without the upgrade for one more year?”

Graphic depicting unused compute/storage for large proprietary systems

We have one client that finally got budget approval for a major upgrade to a large system, and in addition to the substantial costs for hardware and licensing of software, they also have over 100 third-party consultants on-site for 18 months sitting in their cubicles (while they are short on space for their employees) to help with the transition. The direct and in-direct costs are massive, including the innovation that is not happening while resources are focused on an incremental upgrade to keep up, not get ahead.

The bloat is amazing and it is easy to see why startups and smaller companies build in the cloud and use opensource databases and in the process, erode market share from the industry behemoths with a fraction of the investment.

Commodities Defined

The goal of commodity systems and solutions is to get as much value for as minimal of an investment as possible. This allows us to build highly available and scalable solutions.

Focus on getting the maximum performance for the least amount of cost for:

  • Compute
  • Storage
  • Network

We often see an interesting dichotomy of architectural principles within aging companies – teams report there is “no money” for new servers to provide customers with a more stable platform, but hundreds of thousands of dollars sunk into massive databases and mainframes.

Vendor lock and budget lock are two reasons why going with highly customized and proprietary systems shackles a company’s growth.

Forget the initial costs for specialized systems – which are substantial – usually the ongoing costs for licensing, service agreements, software upgrade support, etc. required to keep a vendor happy would likely be more than enough to provide a moderately-sized company with plenty of financial headroom to build out many new redundant, and highly available, commodity servers and networks.

Properly implementing along all three axes of the AKF Scale Cube requires a lot of hardware and software - not easily accomplished if providing a DR instance of your database also means giving your first-born and second-born children to Oracle.

Does this principle apply to cloud?

With the majority of startups never racking a single server themselves, and many larger companies migrating to AWS/Azure/Google, etc. – you might think this principle does not apply in the new digital age.

But where there is a will (or rather, profit), there is a way … and as the race for who can catch up to Amazon for hosting market share continues, vendor-specific tools that drive up costs are just as much of a concern as proprietary hardware is in the self-hosting world.

Often our venture capitalists and investor clients ask us about their startup’s hosting fees and if they should be concerned with the cost outpacing financial growth or if it is usual to see costs rise so quickly. Amazon and others have a lot to gain for providing discounted or free trials of proprietary monitoring, database, and other enhancements in hopes that they can ensure better vendor lock – and fair enough – a service that you can’t get with the competition.

We are just as concerned with vendor lock-in the cloud as we are with vendor lock-in for self-hosted solutions during due diligence and architectural reviews of our clients.

Conclusions

  • Commodity hardware allows companies faster time to market, scalability, and availability
  • The ROI on larger systems can rarely compete as the costs are such a large barrier to entry and often compute cycles are underutilized
  • The same principles apply to hosted solutions – beware of vendor specific tools that make moving your platform elsewhere more difficult over time

We have held hundreds of on-site and remote architectural 2 and 3-day reviews for companies of all sizes in addition to thousands of due diligence reviews for investors. Contact us to see how we can help!

Permalink

Enhancing your Product Security Posture and Shifting Left

July 15, 2019  |  Posted By: Larry Steinberg

Shift Left Symbol atop a graphic of digital background with a large lock icon

It’s never been a better time to be a hacker or developer of malware as nearly every company has or is moving core functionality online and this makes these assets an open target for bad actors. Companies who are moving to deliver services from the cloud vs the traditional on premise mode have now taken on the liability for operating their product and the associated security (or lack thereof).  When acquiring or investing in a company there will be significant damage to value or reputation (or both) if a vulnerability is released into production and impacts the critical customer base.

Product security goes well beyond the network and system perimeter of old. Application functionality drives the need for making the product accessible to many different types of consumers. The traditional approach of locking down the perimeter and performing a late stage penetration test prior to release has many pitfalls:

  • Performing security testing at the end of the development cycle makes planning for a release date nearly impossible, or at the minimum, non-deterministic prior to security test results becoming available.
  • Patterns in coding tend to repeat themselves. So if you poorly code a database interaction (SQL injection), that will probably get replicated 10s or 100s of times. This leads to far more fixes in the later stage of delivery vs setting the best practice at the beginning.
  • Context switching is more costly than you might imagine. When 100k+ lines of code have been written, going back to fix code from the beginning or middle of the cycle requires considerable effort in ramping back up appropriate knowledge.

The overall goals for enhancing product security posture are multi-faceted:

  • Move security controls ‘left’ in the SDLC process (closer to the beginning).
  • Augment existing process with secure coding best practices.
  • Product security as a continuous concern – security must be as agile as your product and team.
  • Implement tools and controls via automation to achieve efficiency and compliance.
  • Enable planning for security remediations.
  • Reduce the expense of building secure software. Many studies have shown the expense of finding a bug late in the SDLC process will be 6-15x more than addressing the bug early in the design and coding process. Security vulnerabilities model the same expense curve.

Here are the phases of SDLC and suggestions for implementation:

Design Phase

During the component and system design phase, architecture reviews should focus on security threats in addition to the standard technical oversight.  Identifying all threats to the system and designs are critical along with creating mitigations before the implementation phase begins, or at least early in the implementation phase.

Development Phase

There are multiple options for identifying security issues while writing code and performing check-ins. Static code analysis can be triggered on check-in to assess code for identifying critical issues from the OWASP top 10 and other similar known bad patterns. Some IDE’s also have the ability to assess code prior to check-in or code review. In both cases, the developer is notified nearly immediately when they’ve produced a potential vulnerability.

Build Phase

As the code is built and linked with dependencies, you want to scan for free and open source software. Utilizing open source is great but you need to assess the risk to IP and malware. You should always know the origins of your code base and make intentional decisions about inclusion.  If you have a containerized environment, then the build phase is where you would implement some form of image scanning.

If you are leveraging Javascript and utilizing frameworks and package managers then consider going beyond just scanning for the presence of open source. You are most likely building from internet-based artifacts and how do you know if a vulnerability has been added to one of your dependencies? This is an on-ramp for malware into your environment. TODO – link. You need a solution which will proactively inform you if a dependent Javascript library has been flagged containing malware.

Test Phase

The integration test phase (where all of your components come together for verification) provides a great place for gray box security testing. These security tests would avoid perimeter style tools like WAFs and firewalls allowing the tests to focus on runtime features. Common exploits can be validated like cross-site scripting, SQL injections, XML injections, etc.

Runtime Phase

Traditionally this is where 3rd party penetration tests would occur in pre-production or production environments.  Containerization provides another opportunity to oversee the runtime environment to ensure malware has not made its way into the system or replicating across the system. When running containers you should be very thoughtful in how you compose, group, and manage the assets.

Summary

By implementing the right processes and tools in the right phases, your team will be informed of issues earlier in the SDLC process and reduce the overall cost of development while maintaining the ability to plan for highly secure product deliveries.  In an agile environment, automation and process augmentation can enable security to be a continuous concern. Over time, by building security controls into the SDLC process, security becomes a core part of the culture and everyday awareness.

At AKF we assist companies in enhancing their security posture. Let us know how we can help you.

Permalink

Top 3 Failures in Digital Transformations

July 11, 2019  |  Posted By: Marty Abbott

Attempting to transform a company to compete effectively in the Digital Economy is difficult to say the least.  In the experience of AKF Partners, it is easier to be “born digital” than to transform a successful, long tenured business, to compete effectively in the Digital age. 

There is no single guaranteed fail-safe path to transformation.  There are, however, 10 principles by which you should abide and 3 guaranteed paths to failure. 

Avoid these 3 common mistakes at all costs or suffer a failed transformation.

Top 3 Digital Transformation Failures

Having the Wrong Team and the Wrong Structure

If you have a successful business, you very likely have a very bright and engaged team.  But unless a good portion of your existing team has run a successful “born digital” business, or better yet transformed a business in the digital age, they don’t have the experience necessary to complete your transformation in the timeframe necessary for you to compete.  If you needed lifesaving surgery, you wouldn’t bet your life on a doctor learning “on the job”.  At the very least, you’d ensure that doctor was alongside a veteran and more than likely you would find a doctor with a successful track record of the surgery in question.  You should take the same approach with your transformation.

This does not mean that you need to completely replace your team.  Companies have been successful with organization strategies that include augmenting the current team with veterans.  But you need new, experienced help, as employees on your team. 

Further, to meet the need for speed of the new digital world, you need to think differently about how you organize.  The best, fastest performing Digital teams organize themselves around the outcomes they hope to achieve, not the functions that they perform.  High performing digital teams are

It also helps to hire a firm that has helped guide companies through a transformation.  AKF Partners can help. 

Planning Instead of Doing

The digital world is ever evolving.  Plans that you make today will be incorrect within 6 months.  In the digital world, no plan survives first contact with the enemy.  In the old days of packaged software and brick and mortar retail, we had to put great effort into planning to reduce the risk associated with being incorrect after rather long lead times to project completion.  In the new world, we can iterate nearly at the speed of thought.  Whereas being incorrect in the old world may have meant project failure, in the new world we strive to be incorrect early such that we can iterate and make the final solution correct with respect to the needs of the market.  Speed kills the enemy.

Eschew waterfall models, prescriptive financial models and static planning in favor of Agile methodologies, near term adaptive financial plans and OKRs.  Spend 5 percent of your time planning and 95% of your time doing.  While in the doing phase, learn to adapt quickly to failures and quickly adjust your approach to market feedback and available data. 

The successful transformation starts with a compelling vision that is outcome based, followed by a clear near-term path of multiple small steps.  The remainder of the path is unclear as we want the results of our first few steps to inform what we should do in the next iteration of steps to our final outcome.  Transformation isn’t one large investment, but a series of small investments, each having a measurable return to the business.

Knowing Instead of Discovering

Few companies thrive by repeatedly being smarter than the market.  In fact, the opposite is true – the Digital landscape is strewn with the corpses of companies whose hubris prevented them from developing the real time feedback mechanisms necessary to sense and respond to changing market dynamics.  Yesterdays approaches to success at best have diminishing returns today and at worst put you at a competitive disadvantage.

Begin your journey as a campaign of exploration.  You are finding the best path to success, and you will do it by ensuring that every solution you deploy is instrumented with sensors that help you identify the efficacy of the solution in real time.  Real time data allows us to inductively identify patterns that form specific hypothesis.  We then deductively test these hypotheses through comparatively low-cost solutions, the results of which help inform further induction.  This circle of induction and deduction propels us through our journey to success.

Permalink

The Role of the Business Analyst in Agile

July 11, 2019  |  Posted By: Marty Abbott

Two men looking at a Scrum Board in a daily standup meeting
Our clients often ask us the following question: “What is the role of the business analyst in Agile development?”

Spoiler Alert: There isn’t a role for the business analyst in Agile development.

For a longer answer, we need to explore the history of the Business Analyst (BA) role.

Traditional Business Analyst Role

Business analysts (BA) are typically found within an Information Technology (IT) organization, or adjacent to an IT organization (say within a business unit, working with IT).  Here we use IT to designate the organization group within a company focused on producing or maintaining solutions that support back office business operations, employee productivity, etc.

Theoretically, the role focuses on analyzing a business domain, business processes or problem domain with the purpose of improving the domain or solving problems by defining systems or improvements to systems.  The analyst then works with IT to implement the systems or improvements. 

Practically speaking, the Business Analyst is often a bridge between what a business unit or domain “wants” and how that “want” should be implemented with a technical solution.  This bridge is very often implemented through requirements specifications.  The analyst then is responsible for writing and reviewing requirements and may be involved in some level of design to implement requirements.  The Business Analyst is also often involved in validating requirements, evaluating the quality of an end solution and helping to usher the system through the appropriate sign-offs and training to launch the solution.

Business Analysts are very often found within waterfall development lifecycles where solutions move through “phases” in a linear fashion and where business owners and development teams are not integrated.  They exist to solve a gap between independently operating information technology teams and business units.

Business Analysts in Agile

Teams practicing Agile should not have a need for someone with a Business Analyst title.  One of the Agile principles is to have “Business people and developers [working] together daily throughout a project”.  Within Scrum, the role for this daily interaction is most often through someone with the title of product owner.  The product owner’s role is to optimize the value the entire team delivers through proper prioritization and expression of the product backlog.  To be successful, the product owner must be properly empowered by the business organization he/she represents to achieve the product outcomes.  As the name implies, he/she “owns” the product and associated business outcomes.  Business analysts traditionally do not own either. 

Given that the product owner is now responsible for a team of ideally no more than 12 people, one should not need an additional person “helping” with story writing, backlog prioritization, etc. Given the proximity of the product owner to the team doing the work and very little need for administrative help given the ratio of people on a team, there should be no need for a business analyst in an Agile team.

What Should I Do with My Business Analysts?

If you’ve read this far, you are probably a company transitioning from Waterfall to Agile development.  The question is difficult to answer, and really depends upon the capabilities of the person in question.

Transitioning Business Analysts to Product Owners

We haven’t seen this be successful in many places.  But on a case by case basis, it’s possible.  If the person is smart, really understands the business, is empowered by the business unit he/she represents and is committed to understanding the role of the product owner then the person may be successful.  Frankly, most business analysts have existed for too long as order takers to truly lead business initiatives within a development team.

Transitioning Business Analysts to Scrum Masters

We’ve seen greater success with this approach, but it requires a lot of training.  To be successful in your conversion, you will need at least some number of highly trained and experienced Scrum Masters.  If everyone is learning on the job, your transition to Agile will be slow and flawed.  You can afford to have some (a handful) people transitioning from the BA role but be careful.  The role of the Scrum Master and the role of the Business Analyst are very different.  Some people won’t be able to make the transition, and others won’t have the desire.

Keeping Business Analysts in Waterfall Roles

Your company will no doubt continue to have several waterfall-oriented teams.  Waterfall is appropriate anytime negotiation trumps collaboration (again, see the Agile Manifesto), as is the case in most projects involving systems integrators (I.e. back-office corporate systems or employee facing packaged solutions used by disparate functional teams).  In fact, any contract-based development where outcomes are defined in a contract is ultimately a waterfall process even if the company has deluded itself into thinking the solution is “Agile”.  Take your best business analysts and transition them into these waterfall projects.

Transitioning Business Analysts “Somewhere Else”

This is the catchall category.  Perhaps some of them are recovering software engineers or infrastructure engineers and will want to go back.  Maybe there is a place for great business analysts within a business unit working on non-technology related initiatives.  In many companies, there may be waterfall projects where they can continue to add value.  Lastly, you may no longer have a role for them.

Conclusion

We’ve sene many of our clients try and plug and play traditional IT roles with a simple name change to Agile terminology and then wonder why it isn’t working.  Successful clients bring in proven Agile Coaches and spend time educating business leaders on how Agile applies throughout the organization.  We’ve helped or consulted hundreds of companies on Agile transformations, give us a call, we can help.

Permalink

 < 1 2 3 4 >  Last ›

Categories:

Most Popular: