AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Communicating Across Swim-lanes

We often emphasize the need to break apart complex applications into separate swim-laned services. Doing so not only improves architectural scalability but also permits the organization to scale and innovate faster as each separate team can operate almost as its own independent startup.

Ideally, there would be no need to pass information between these services and they could be easily stitched together on the client-side with a little HTML for a seamless user experience. However, in the real world, you’re likely to find that some message passing between services is needed (If only to save your users the trouble of not entering information twice).

Every cross-service call adds complexity to your application. Often times, teams will implement internal APIs when cross-service communication is needed. This is a good start and solves part of the problem by formalizing communication and discouraging unnecessary cross-service calls.

However, APIs alone don’t address the more important issue with cross-service communication — the reduction in availability. Anytime one service synchronously communicates with another you’ve effectively connected them in series and reduced overall availability. If Service-A calls Service-B synchronously, and Service-B crashes or slows to a crawl, Service-A will do the same. To avoid this trap, communication needs to take place asynchronously between services. This, in fact, is how we define “swim-laning”.

So what are the best means to facilitate asynchronous cross-service communication and what are the pro and cons of each method?

Client Side

Information that needs to be shared between services can be passed via JavaScript & JSON in the browser. This has the advantage of keeping the backend services completely separate and gives you the option to move business logic to the client side, thus reducing load on your transactional systems.

The downside, however, is the increased security risks. It doesn’t take an expert hacker to manipulate variables in JavaScript. (Just think, $price_var gets set to $0 somewhere between the shopping cart and checkout services). Furthermore, data is passed back to the server-side these cross-service calls will now suffer from the same latency and reliability issues as any TCP call over the internet.

Message Bus / Enterprise Service Bus

Message buses and enterprise service buses provide a ready means to transmit messages between services asynchronously in pub/sub model. Advantages include providing a rich set of features for tracking, manipulating, and delivering messages, as well as the ability centralized logging and monitoring. However, the potential for congestion as well as occasional message loss makes them less desirable in many cases than asynchronous point-to-point calls (discussed below).

To limit congestion, it’s best to implement several independent channels and limit message bus traffic to events that will have multiple subscribers.

Asynchronous Point-to-Point

Point-to-point communication (i.e. one service directly calling another) is another effective means of message passing. Advantages include simplicity, speed, and reliability. However, be sure to implement this asynchronously with a queuing mechanism with timeouts, retries (if needed), and exceptions handled in the event of service failure. This will prevent failures from propagating across service boundaries.

If properly implemented, asynchronous point-to-point communication is excellent for invoking a function in another service or transmitting a small amount of information. This method can be implemented for the majority of cross-service communication needs. However, for larger data transfers, you’ll need to consider one of the methods below.

ETL

ETL jobs can be used to move a large amount of data from one datastore to another. Since these are generally implemented as separate batch processes they won’t bring down the availability of your services. However, drawbacks include increased load on transactional databases (unless a read replica of the source DB is used) and the poor timeliness/consistency of data resulting from periodic batch process.

ETL processes are likely best reserved for transferring data from your OLTP to OLAP systems, where immediate consistency isn’t required. If you need both a large amount of data transferred and up to the second consistency consider a DB read replica.

DB Read-Replica

Most common databases (Oracle, MySQL, PostgreSQL) support native replication of DB clones with replication lag measured in milliseconds. By placing a read replica of one service’s DB into another service’s swim-lane, you can successfully transfer a large amount of data in near real-time. The downsides are increased IOPS requirements and a lack of abstraction between services that — in contrast to the abstraction provided by an API — fosters greater service interdependency.

Conclusion

In conclusion, you can see there are a variety of cross-service communication methods that permit fault-isolation between services. The trick is knowing the strengths and weaknesses of each and implementing the right method where appropriate.

Send to Kindle

Making Agile Predictable

One of the biggest complaints that we hear from businesses implementing agile processes is that they can’t predict when things will get delivered. Many people believe, incorrectly, that waterfall is a much more predictable development methodology. Let’s review a few statistics that demonstrate the “predictability” of the waterfall methodology.

In a study of over $37 billion (USD) worth of US Defense Department projects concluded that: 46% of the systems so egregiously did not meet the real needs (although they met the specifications) that they were never successfully used, and another 20% required extensive rework (Larman, 1995).

In another study of 6,700 projects it was found that four out of five key factors contributing to project failure were associated with or aggravated by the waterfall method, including inability to deal with changing requirements and problems with late integration.

Waterfall is not a bad or failed methodology, it’s just a tool like any other that has its limitations. We as the users of that tool misused it and then blamed the tool. We falsely believed that we could fix the project scope (specifications), cost (team size), and schedule (delivery date). No methodology allows for that. As shown in the diagram below, waterfall is meant to fix scope and cost. When we also try to constrain the schedule we’re destined to fail.

fig1

Agile when used properly fixes the cost (team size) and the schedule (2 week sprints), allowing the scope to vary. This is where most organizations struggle, attempting to predict delivery of features when the scope of stories is allowed to vary. As a side note, if you think you can fix the scope and the schedule and vary the cost (team size) read Brooke’s 1975 book The Mythical Man-Month.

This is where the magical measurement of velocity comes in to play. The velocity of a team is simply the number of units of work completed during the sprint. The primary purpose of this measurement is to provide the team with feedback on their estimates. As you can see in the graph below it usually takes a few sprints to get into a controlled state where the velocity is predictable and then it usually rises slightly over time as the team becomes more experienced.

fig2

Using velocity we can now predict when features will be delivered. We simply project out the best and worst velocities and we can demonstrate with high certainly a best and worst delivery date for a set of stories that make up a feature.

fig3

Velocity helps us answer two types of questions. The first is the fixed scope question “when will we have X feature?” to which the answer is “between A and B dates”. The second question is the fixed time question “what will be delivered by the June releases?” to which the answer is “all of this, some of that, and none of those.” What we can’t answer is fixed time and fixed scope questions.

fig4

It’s important to remember is that agile is not a software development methodology, but rather a business process. This means that all parts of your organization must buy-in to and participate in agile product development. Your sales team must get out of the mindset of committing to product new features with fixed time and scope when talking to existing or potential customers. When implemented correctly, agile provides faster time to market and higher levels of innovation than waterfall, which brings greater value to your customers. The tradeoff from a sales side is to change behavior from making long-term product commitments as they did in the past (but were more often than not missed anyway)!

By utilizing velocity, keeping the team consistent, and phrasing the questions properly, agile can be a very predictable methodology. The key is understanding the constraints of the methodology and working within them instead of ignoring them and blaming the methodology.

Send to Kindle

2 comments

AKF is launching an Information Security Service for Clients

In today’s digital world, cyber security is one of the biggest areas that keep CEOs and business executives awake at night. Further, regulation and compliance are continually changing, as both governments and customers are modifying the rules in an attempt to keep pace with rapid changes in technology. Keeping up with these changes in regulation, and particularly how they will apply to your business, is a monumental challenge. As a very recent example, if you operate in both the U.S. and Europe, you’re probably familiar with Safe Harbor (https://en.wikipedia.org/wiki/International_Safe_Harbor_Privacy_Principles). Up until mid 2015, Safe Harbor permitted U.S. companies to store EU customer data on U.S. soil, so long as the infrastructure and usage met the privacy standards of the Safe Harbor Act. However, in October 2015, courts struck down the US-EU Safe Harbor agreement, leaving companies again wondering what approach they should take. Then on Feb 29, 2016, the European Commission introduced the EU-US Privacy Shield (http://europa.eu/rapid/press-release_IP-16-433_en.htm), which is similar to Safe Harbor but, in conjunction with the Judicial Redress Act signed by President Obama, extends the ability of European citizens to challenge U.S. companies that store sensitive personal information. With the changes coming quickly and furiously, what is your company to do?

SecPic1

In addition to those challenges, customers today put more and more demand on companies to not only protect sensitive customer data and IP, but many customers DO expect you to store and process sensitive data in a highly available and scalable manner. Balancing the needs of maintaining data on behalf of customers with regulation and compliance is one of the biggest challenges for any modern company, whether you store or process sensitive data for your customers or not. Sometimes just operating in a regulated industry subjects your company to the long reaches of compliance standards.

Finally, who is in charge of security in your company? Is it spread across teams? Is there a single CSO/CISO and security organization? How well do the security organizations work with your technology and business teams?

Our clients ask us frequently if we can assess their security programs and help develop plans for them, similar to the work we do with system architecture and product development. At AKF, we look at security the same way we look at availability and scaling. It is about managing risk. You can build a system that has nearly 100% availability, and you can build a system that scales nearly infinitely… but you don’t always need to, as it’s not always the most cost-effective way to run your business.

SecPic2

You want to build a system that achieves very high availability, up to the point where the cost of “near perfect” availability exceeds the value it brings to the business. Similarly, you may never be able to completely eliminate each and every risk to your information security. Some industries, like health care and credit card processors, need to be “near perfect.” But other may not need to be quite as perfect. Some may just require that you have ample controls, tightly monitored systems, secure coding practices, and top notch Security Incident Response plans.

Finally, there are plenty of overlaps in regulations that if applied once, can solve several of your security compliance concerns. For example, PCI, SOX, and HIPAA all have requirements about audibility and accessibility, although they apply to different types of data. If you’re subject to more than one of those, why not put controls in place that help solve them for ANY regulation?

SecPic3 SecPic4 SecPic5

We have begun a new program to help our clients with their security needs. Our approach is to help you understand your regulation and compliance landscape, look at the security measures you’ve put in place, and work with you to design a program and projects to close the gaps, focusing on the highest value projects first. We’ll also help you understand where security fits into your organization, and guide you on how to break down barriers that prevent organizations from working cohesively to manage security across the enterprise.

If you are interested in our security program services, please contact us.

Send to Kindle

Comments Off on AKF is launching an Information Security Service for Clients

Paying Down Your Technical Debt

During the course of our client engagements, there are a few common topics or themes that are almost always discussed, and the clients themselves usually introduce them. One such area is technical debt. Every team has it, almost every team believes they have too much of it, and no team has an easy time explaining to the business why it’s important to address. We’re all familiar with the concept of technical debt, and we’ve shared a client horror story in a previous blog post that highlights the dangers of ignoring it: Technical Debt

TDPicture1

When you hear the words “technical debt”, it invokes a negative connotation. However, the judicious use of tech debt is a valuable addition to your product development process. Tech debt is analogous to financial debt. Companies can raise capital to grow their business by either issuing equity or issuing debt. Issuing equity means giving up a percentage of ownership in the company and dilutes current shareholder value. Issuing debt requires the payment of interest, but does not give up ownership or dilute shareholder value. Issuing debt is good, until you can’t service it. Once you have too much debt and cannot pay the interest, you are in trouble.

Tech debt operates in the same manner. Companies use tech debt to defer performing work on a product. As we develop our minimum viable product, we build a prototype, gather feedback from the market, and iterate. The parts of the product that didn’t meet the definition of minimum or the decisions/shortcuts made during development represent the tech debt that was taken on to get to the MVP. This is the debt that we must service in later iterations. Taking on tech debt early can pay big dividends by getting your product to market faster. However, like financial debt, you must service the interest. If you don’t, you will begin to see scalability and availability issues. At that point, refactoring the debt becomes more difficult and time critical. It begins to affect your customers’ experience.

TDPicture2

Many development teams have a hard time convincing leadership that technical debt is a worthy use of their time. Why spend time refactoring something that already “works” when you could use that time to build new features customers and markets are demanding now? The danger with this philosophy is that by the time technical debt manifests itself into a noticeable customer problem, it’s often too late to address it without a major undertaking. It’s akin to not having a disaster recovery plan when a major availability outage strikes. To get the business on-board, you must make the case using language business leaders understand – again this is often financial in nature. Be clear about the cost of such efforts and quantify the business value they will bring by calculating their ROI. Demonstrate the cost avoidance that is achieved by addressing critical debt sooner rather than later – calculate how much cost would be in the future if the debt is not addressed now. The best practice is to get leadership to agree and commit to a certain percentage of development time that can be allocated to addressing technical debt on an on-going basis. If they do, it’s important not to abuse this responsibility. Do not let engineers alone determine what technical debt should be paid down and at what rate – it must have true business value that is greater than or equal to spending that time on other activities.

TDPicture3

Additionally, be clear about how you define technical debt so time spent paying it down is not comingled with other activities. Generally, bugs in your code are not technical debt. Refactoring your code base to make it more scalable, however, would be. A good test is to ask if the path you chose was a conscious or unconscious decision. Meaning, if you decided to go in one direction knowing that you would later need to refactor. This is also analogous to financial debt; technical debt needs to be a conscious choice. You are making a specific decision to do or not to do something knowing that you will need to address it later. Bugs are found in sloppy code, and that is not tech debt, it is just bad code.

So how do you decide what tech debt should be addressed and how do you prioritize? If you have been tracking work with Agile storyboards and product backlogs, you should have an idea where to begin. Also, if you track your problems and incidents like we recommend, then this will show elements of tech debt that have begun to manifest themselves as scalability and availability concerns. We recommend a budget of 12-25% of your development efforts in servicing tech debt. Set a budget and begin paying down the debt. If you are working on less than the lower range, you are not spending enough effort. If you are spending over 25%, you are probably fixing issues that have already manifested themselves, and you are trying to catch up. Setting an appropriate budget and maintaining it over the course of your development efforts will pay down the interest and help prevent issues from arising.

Taking on technical debt to fund your product development efforts is an effective method to get your product to market quicker. But, just like financial debt, you need to take on an appropriate amount of tech debt that you can service by making the necessary interest and principle payments to reduce the outstanding balance. Failing to set an appropriate budget will result in a technical “bankruptcy” that will be much harder to dig yourself out of later.

Send to Kindle

Comments Off on Paying Down Your Technical Debt

The Biggest Problem with Big Data

How to Reduce Crime in Big Cities

I’m a huge Malcolm Gladwell fan.  Gladwell’s ability to convey complex concepts and virtually incomprehensible academic research in easily understood prose is second to none within his field of journalism.  A perfect example of his skill is on display in the Tipping Point, where Gladwell wrestles the topic of Complexity Theory (aka Chaos Theory) into submission, making it accessible to all of us.  In the Tipping Point, Gladwell also introduces us to The Broken Windows Theory.

The Broken Windows Theory gets its name from a 1982 The Atlantic Monthly article.  This article asked the reader to imagine a building with a few broken windows.  The authors claim that the existence of these windows invite vandals to break still more windows.  A continuous cycle of expanding vandalism ensues, with squatters moving in, nearby buildings getting vandalized, etc.  Subsequent authors expanded upon the theory, claiming that the presence of vandalism invites other crimes and that crime rates soar in communities where unhandled vandalism is present.  A corollary to the Broken Windows Theory is that cities can reduce crime rates by focusing law enforcement on petty crimes.  Several high profile examples seem to illustrate the power and correctness of this theory, such as New York Mayor Giuliani’s “Zero Tolerance Program”.  The program focused on vandalism, public drinking, public urination, and subway fare evasion.  Crime rates dropped over a 10 year period, corresponding with the initiation of the program.  Several other cities and other experiments showed similar effects.  Proof that the hypothesis underpinning the theory is correct.

Not So Fast…

Enter the self-described “Rogue Economist” Stephen Levitt and his co-author Stephen Dubner – both of Freakonomics fame.  While the two authors don’t deny that the Broken Windows theory may explain some drop in crime, they do cast significant doubt on the approach as the primary explanation for crime rates dropping.  Crime rates dropped nationally during the same 10 year period in which New York pursued its Zero Tolerance Program.  This national drop in crime occurred in cities that both practiced Broken Windows and those that did not.  Further, crime rate dropped irrespective of either an increase or decrease in police spending.  The explanation therefore, argue the authors, cannot primarily be Broken Windows.  The most likely explanation and most highly correlated variable is a reduction in a pool of potential criminals.   Roe v. Wade legalized abortion, and as a result there was a significant decrease in the number of unwanted children, a disproportionately high percentage of whom would grow up to be criminals.

Gladwell isn’t therefore incorrect in proffering Broken Windows as an explanation for reduction in crime.  But the explanation is not the best one available and as a result it holds residence somewhere between misleading (worst case) and incomplete (best case).

What Happened?

To be fair, it’s hard to hold Gladwell accountable for this oversight.  Gladwell is not a scientist and therefore not trained in how to scientifically evaluate the research he reported.  Furthermore, his is an oft repeated mistake even among highly trained researchers.  And what exactly is that mistake?  The mistake made here is illustrated by the difference in approach between the Broken Windows researchers and the Freakonomics authors.  The Broken Windows researchers started with something like the following question “Does the presence of vandalism invite additional vandalism and escalating crime?”  Levitt and Dubner first asked the question “What variables appear to explain the rate of crime?”

Broken Windows started with a question focused on deductive analysis.  Deduction starts with a hypothesis – “Evidence of vandalism and/or other petty crimes invites similar and more egregious crimes”.   The process continues to attempt to confirm or disprove the hypothesis.  Deduction starts with a broad and abstract view of the data – a generalization or hypothesis as to relationships – and attempts to move to show specific relationships between data elements.  The Broken Windows folks started with a hypothesis, developed a series of experiments to test the hypothesis and then ultimately evaluated time series data in cities with various Broken Windows approaches to policing.  What they lacked was a broad question that may have developed a range of options indicating possible causes.

The Freakonomics authors started with an inductive question.  Induction is the process of moving from specific observations about data into generalizations.  These generalizations are often in the form of hypothesis or models as to how data interacts.  Induction helps to inform what questions should be asked of the data.  Induction is the asking of “What change in what independent variables seem to correspond with a resulting change in some dependent variable?”  Whereas deduction works from independent variable to dependent variable, induction attempts to work backwards from dependent variable to identify independent variable relationships.

So What?

The jump to deduction, without forming the right questions and hypotheses through induction, is the biggest mistake we see in developing Big Data programs and implementing Big Data solutions.   We all approach problems with unique experiences and unique biases.  The combination of these often cause us to race to hypotheses and want to test them.  The issue here is two-fold. The best case is that we develop an incomplete (and as a result partially or mostly incorrect) answer similar to that of The Broken Windows researchers.  The worst case is that we suffer what statisticians call a Type 1 error – confirming an incorrect answer.  The probability of type 1 errors increases when we don’t look for alternative or better answers for outcomes within our data sets.  Induction helps to uncover those alternative or supporting explanations.  Exploring the data to discover potential relationships helps us to ask the right questions and form better hypotheses and better models.  Skipping induction makes it highly probable that we will get an incorrect, misleading or substandard answer.

But it is not enough to simply ensure that we practice both induction and deduction.  We must also recognize that the solutions that support induction are different from those that support deduction.  Further, we must understand that the two processes while complimentary can actually interfere with each other when performed on the same system.  Induction is necessarily a very broad and as a result slow and tedious process.  Deduction, on the other hand, needs significantly less data and “prefers” to be faster in implementation.  Inductive systems are best supported by solutions that impose very few relations or structure on the data we observe.  Systems that support deduction, in order to allow for faster response times, impose increased structure relative to inductive systems.  While the two phases of discovery (Induction and Deduction) support each other, their differences suggest that they should be performed on solutions purpose built to their specific needs.

Similarly, not everyone is equally qualified to perform both induction and deduction.  Our experience is that the folks who tend to be good at determining how to prove relationships between variables are often not as good at identifying patterns and vice versa.

These two observations, that the systems that support induction and deduction should be separated and that the people performing these tasks may need to be different, have ramifications to how we develop our analytics systems and organize our Big Data teams.  We’ll discuss these ramifications and more in our next post, “10 Anti-Patterns within Big Data”.

Send to Kindle

Comments Off on The Biggest Problem with Big Data

Data Driven Product Development

When we bring up the idea of data driven product development, almost every product manager or engineering director will quickly say that of course they do this. Probing further we usually hear about A/B testing, but of course they only do it when they aren’t sure what the customer wants. To all of this we say “hogwash!” Testing one landing page against another is a good idea, but it’s a far cry from true data driven product development. The bottom line is that if your agile teams don’t have business metrics (e.g. increase the conversion rate by 5% or decrease the shopping cart abandonment rate by 2%) that they measure after every release to see if they are coming closer to achieving, then you are not data driven. Most likely you are building products based on the HiPPO. The Highest Paid Person’s Opinion. In case that person isn’t always around, there’s this service.
hippo

Let’s break this down. Traditional agile processes have product owners seeking input from multiple sources (customers, stakeholders, marketplace). All these diverse inputs are considered and eventually they are combined into a product vision that helps prioritize the backlog. In the most advanced agile processes, the backlog is ranked based on effort and expected benefit. This is almost never done. And if it is done, the effort and benefit are never validated afterwards to see how accurate the estimates where. We even have an agile measurement for doing this for the effort, velocity. Most agile teams don’t bother with the velocity measurement and have no hope of measuring the expected benefit.

Would we approach anything else like this? Would we find it acceptable if other disciplines worked this way? What if doctors practiced medicine with input from patients and pharmaceutical companies but without monitoring how their actions impacted the patients’ health? We would probably think that they got off to a good start, understanding the symptoms from the patient and the possible benefits of certain drugs from the pharmaceutical representatives. But, they failed to take into account the most important part of the process. How their treatment was actually performing against their ultimate goal, the improved health of the patient.
doctor

We argue that you’re wasting time and effort doing anything that isn’t measured by a success factor. Why code a single line if you don’t have a hypothesis that this feature, enhancement, story, etc. will benefit a customer and in turn provide greater revenue for the business. Without this feedback mechanism, your effort would be better spent taking out the trash, watering the plants, and cleaning up the office. Here’s what we know about most people’s gut instinct with regards to products, they stink. Read, Lund’s very first Usability Maxims “Know thy user, and YOU are not thy user.” What does that mean? It means that you know more about your product, industry, service, etc. than any user ever will. Your perspective is skewed because you know too much. Real users don’t know where everything is on the page, they have to search for it. Real users lie when you ask them what they want. Real users have bosses, spouses, and children distracting them while they use your product. As Henry Ford supposedly said, “If I had asked customers what they wanted, they would have said a faster horse.” You can’t trust yourself and you can’t trust users. Trust the data. Form a hypothesis “If I add a checkout button here, shoppers will convert more.” Then add the button and watch the data.
checkout_button

This is not A/B or even multivariate testing. It is much more than that. It’s changing the way to develop products and services. Remember, Agile is a business process, not just a development methodology. Nothing gets shipped without an expected benefit that is measured. If you didn’t achieve the business benefit that was expected you either try again next sprint or you deprecate the code and start over. Why leave code in a product that isn’t achieving business results? It only serves to bloat the code base and make maintenance harder. Data driven product development means we don’t celebrate shipping code. That’s suppose to happen every two weeks or even daily if we’re practicing continuous delivery. You haven’t achieved anything pushing code to production. When the customer interacts with that code and a business result is achieved then we celebrate.

Data driven product development is how you explain the value of what the product development team is delivering. It’s also how you ensure your agile teams are empowered to make meaningful impact for the business. When the team is measured against real business KPIs that directly drive revenue, it raises the bar on product development. Don’t get caught in the trap of measuring success with feature releases or sprint completions. The real measurement of success is when the business metric shows the predicted improvement and the agile team becomes better informed of what the customers really want and what makes the business better.

Send to Kindle

Comments Off on Data Driven Product Development

Guest Post: Three Mistakes in Scaling Non-Relational Databases

The following guest post comes from another long time friend and business associate Chris LaLonde.  Chris was with us through our eBay/PayPal journey and joined us again at both Quigo and Bullhorn.  After Bullhorn, Chris and co-founders Erik Beebe and Kenny Gorman (also eBay and Quigo alums) started Object Rocket to solve what they saw as a serious deficiency in cloud infrastructure – the database.  Specifically they created the first high speed Mongo instance offered as a service.  Without further ado, here’s Chris’ post – thanks Chris!

Three Mistakes in Scaling Non-Relational Databases

In the last few years I’ve been lucky enough to work with a number of high profile customers who use and abuse non-relational databases (mostly MongoDB) and I’ve seen the same problems repeated frequently enough to identify them as patterns. I thought it might be helpful to highlight those issues at a high level and talk about some of the solutions I’ve seen be successful.

Scale Direction:

At first everyone tries to scale things up instead of out.  Sadly that almost always stops working at some point. So generally speaking you have two alternatives:

  • Split your database – yes it’s a pain but everyone gets there eventually. You probably already know some of the data you can split out; users, billing, etc.Starting early will make your life much simpler and your application more scalable down the road. However the number of people that don’t consider that they’ll ever need a 2nd or a 3rd database is frightening. Oh and one other thing, put your analytics some place else; the fewer things placing load on your production database from the beginning the better. Copying data off of a secondary is cheap.  
  • Scale out – Can you offload heavy reads or writes ?  Most non-relational databases ’s have horizontal scaling functionality built in (e.g. sharding in MongoDB).  Don’t let all the negative articles fool you, these solutions do work. However you’ll want an expert on your side to help in picking the right variable or key to scale against ( e.g shard keys in MongoDB ). Seek advice early as these decisions will have a huge impact on future scalability.

Pick the right tool:

Non-Relational databases are very different by design and most “suffer” from some of the “flaws” of the eventually consistent model.  My grandpa always said “use the right tool for the right job”  in this case that means if you’re writing a application or product that requires your data to be highly consistent you probably shouldn’t use a non-relational database. You can make most modern non-relational databases more consistent with configuration and management changes but almost all lack any form of ACID compliance   Luckily in the modern world databases are cheap; pick  several  and get someone to manage them for you, always use the right tool for the right job. When you need consistency  use an ACID  compliant database,   when you need raw speed  use an in-memory data store,  and when you need flexibility use a non-relational database .

Write and choose good code:

Unfortunately not all database drivers are created equal. While I understand that you should write code in the language you’re strongest in sometimes it pays to be flexible. There’s a reason why Facebook writes code in PHP, transpiles it to C++, and then compiles it into a binary for production. In the case of your database the driver is _the_ interface to your database, so if the interface is unreliable or difficult to deal with or not updated frequently, you’re going to have a lot of bad days.  If the interface is stable, well documented and is frequently updated,  you’ll have a lot time to work on features instead of fixes.  Make sure to take a look at the communities around each driver, look at the number of bugs reported and how quickly those bugs are being fixed.  Another thing about connecting to your database: please remember nothing is perfect so write code as if it’s going to fail. At some point in time some component, the network, the NIC, load balancer, a switch or the database itself crashing, is going to cause your application to _not_ be able to talk to your database. I can’t tell you how many times I’ve talked to or heard of a developer assuming that “the database is always up, it’s not my  responsibility” and that’s exactly the wrong assumption. Until the application knows that the data is written assume that it isn’t. Always assume that there’s going to be a problem until you get an “I’ve written the data” confirmation from the database, to assume otherwise is asking for data loss.   

These are just a few quick pointers to help guide folks in the right direction. As always the right answer is to get advice from an actual expert about your specific situation.

Send to Kindle

Comments Off on Guest Post: Three Mistakes in Scaling Non-Relational Databases

When Should You Split Services?

The Y axis of the AKF Scale Cube indicates that growing companies should consider splitting their products along services (verb) or resources (noun) oriented boundaries. A common question we receive is “how granular should one make a services split?” A similar question to this is “how many swim lanes should our application be split into?” To help answer these questions, we’ve put together a list of considerations based on developer throughput, availability, scalability, and cost. By considering these, you can decide if your application should be grouped into a large, monolithic codebases or split up into smaller individual services and swim lanes. You must also keep in mind that splitting too aggressively can be overly costly and have little return for the effort involved. Companies with little to no growth will be better served focusing their resources on developing a marketable product than by fine tuning their service sizes using the considerations below.

 

Developer Throughput:

Frequency of Change – Services with a high rate of change in a monolithic codebase cause competition for code resources and can create a number of time to market impacting conflicts between teams including product merge conflicts. Such high change services should be split off into small granular services and ideally placed in their own fault isolative swim lane such that the frequent updates don’t impact other services. Services with low rates of change can be grouped together as there is little value created from disaggregation and a lower level of risk of being impacted by updates.

The diagram below illustrates the relationship we recommend between functionality, frequency of updates, and relative percentage of the codebase. Your high risk, business critical services should reside in the upper right portion being frequently updated by small, dedicated teams. The lower risk functions that rarely change can be grouped together into larger, monolithic services as shown in the bottom left.

frequency_v_functionality

Degree of Reuse – If libraries or services have a high level of reuse throughout the product, consider separating and maintaining them apart from code that is specialized for individual features or services. A service in this regard may be something that is linked at compile time, deployed as a shared dynamically loadable library or operate as an independent runtime service.

Team Size – Small, dedicated teams can handle micro services with limited functionality and high rates of change, or large functionality (monolithic solutions) with low rates of change. This will give them a better sense of ownership, increase specialization, and allow them to work autonomously. Team size also has an impact on whether a service should be split. The larger the team, the higher the coordination overhead inherent to the team and the greater the need to consider splitting the team to reduce codebase conflict. In this scenario, we are splitting the product up primarily based on reducing the size of the team in order to reduce product conflicts. Ideally splits would be made based on evaluating the availability increases they allow, the scalability they enable or how they decrease the time to market of development.

Specialized Skills – Some services may need special skills in development that are distinct from the remainder of the team. You may for instance have the need to have some portion of your product run very fast. They in turn may require a compiled language and a great depth of knowledge in algorithms and asymptotic analysis. These engineers may have a completely different skillset than the remainder of your code base which may in turn be interpreted and mostly focused on user interaction and experience. In other cases, you may have code that requires deep domain experience in a very specific area like payments. Each of these are examples of considerations that may indicate a need to split into a service and which may inform the size of that service.

 

Availability and Fault Tolerance Considerations:

Desired Reliability – If other functions can afford to be impacted when the service fails, then you may be fine grouping them together into a larger service. Indeed, sometimes certain functions should NOT work if another function fails (e.g. one should not be able to trade in an equity trading platform if the solution that understands how many equities are available to trade is not available). However, if you require each function to be available independent of the others, then split them into individual services.

Criticality to the Business – Determine how important the service is to business value creation while also taking into account the service’s visibility. One way to view this is to measure the cost of one hour of downtime against a day’s total revenue. If the business can’t afford for the service to fail, split it up until the impact is more acceptable.

Risk of Failure – Determine the different failure modes for the service (e.g. a billing service charging the wrong amount), what the likelihood and severity of each failure mode occurring is, and how likely you are to detect the failure should it happen.   The higher the risk, the greater the segmentation should be.

 

Scalability Considerations:

Scalability of Data – A service may be already be a small percentage of the codebase, but as the data that the service needs to operate scales up, it may make sense to split again.

Scalability of Services – What is the volume of usage relative to the rest of the services? For example, one service may need to support short bursts during peak hours while another has steady, gradual growth. If you separate them, you can address their needs independently without having to over engineer a solution to satisfy both.

Dependency on Other Service’s Data – If the dependency on another service’s data can’t be removed or handled with an asynchronous call, the benefits of disaggregating the service probably won’t outweigh the effort required to make the split.

 

Cost Considerations:

Effort to Split the Code – If the services are so tightly bound that it will take months to split them, you’ll have to decide whether the value created is worth the time spent. You’ll also need to take into account the effort required to develop the deployment scripts for the new service.

Shared Persistent Storage Tier – If you split off the new service, but it still relies on a shared database, you may not fully realize the benefits of disaggregation. Placing a read-only DB replica in the new service’s swim lane will increase performance and availability, but it can also raise the effort and cost required.

Network Configuration – Does the service need its own subdomain? Will you need to make changes load balancer routing or firewall rules? Depending on the team’s expertise, some network changes require more effort than others. Ensure you consider these changes in the total cost of the split.

 

 

The illustration below can be used to quickly determine whether a service or function should be segmented into smaller microservices, be grouped together with similar or dependent services, or remain in a multifunctional, infrequently changing monolith.

decision_matrix

Send to Kindle

Comments Off on When Should You Split Services?

The Downside of Stored Procedures

I started my engineering career as a developer at Sybase, the company producing a relational database going head-to-head with Oracle. This was in the late 80’s, and Sybase could lay claim to a very fast RDBMS with innovations like one of the earliest client/server architectures, procedural language extensions to SQL in the form of Transact-SQL, and yes, stored procedures. Fast forward a career, and now as an Associate Partner with AKF, I find myself encountering a number of companies that are unable to scale their data tier, primarily because of stored procedures. I guess it is true – you reap what you sow.

Why does the use of stored procedures (sprocs) inhibit scalability? To start with, a database filled with sprocs is burdened by the business logic contained in those sprocs. Rather than relatively straightforward CRUD statements, sprocs seem to typically become the main application ‘tier’, containing 100s of lines of SQL within one sproc. So, in addition to persisting and retrieving your precious data, maintaining referential integrity and transactional consistency, you are now asking for your database to run the bulk of your application as well. Too many eggs in one basket.

And, that basket will remain a single basket, as it is quite difficult to horizontally scale a database filled with sprocs. Out of the 3 axes described in the AKF scalability cube, the only axis that is an option to readily achieve horizontal scalability with is the Z axis, where you have divided or partitioned your data across multiple shards. The X axis, the classic horizontal approach, works only if you are able to replicate your database across multiple read-only replicas, and can segregate read-only activity out of your application, difficult to do if many of those reads are coming from sprocs that also write. Additionally, most applications built on top of a sproc-heavy system have no DAL, no data access layer that might control or route read access to a different location than writes. The Y axis, dividing your architecture up into services, is also difficult to do in a sproc-heavy environment, as these environments are often extremely monolithic, making it very difficult to separate sprocs by service.

Your database hardware is typically the beefiest, most expensive box you will have. Running business logic on it vs. on smaller, commodity boxes typically used for an application tier is simply causing the cost of computation to be inordinately high. Also, many companies will deploy object caching (e.g. memcached) to offload their database. This works fine when the application is separate from the database, but when the reads are primarily done in sprocs? I’d love to see someone attempt to insert memcached into their database.

Sprocs seem to be something of a gateway drug, where once you have tasted the flavor of application logic living in the database, the next step is to utilize abilities such as CLR (common language runtime). Sure, why not have the database invoke C# routines? Why not ask your database machine to also perform functions such as faxing? If Microsoft allows it, it must be good, right?

Many of the companies we visit that are suffering from stored procedure bloat started out building a simple application that was deployed in a turnkey fashion for a single customer with a small number of users. In that world, there are certainly benefits to using sprocs, but that world was 20th century and no longer viable in today’s world of cloud and SaaS models.

Just say no to stored procedures!

Send to Kindle

Comments Off on The Downside of Stored Procedures

Definition of MVP

We often use the term minimum viable product or MVP but do we all agree on what it means? In the Scrum spirt of Definition of Done, I believe the Definition of MVP is worth stating explicitly within your tech team. A quick search revealed these three similar yet different definitions:

  • A minimum viable product (MVP) is a development technique in which a new product or website is developed with sufficient features to satisfy early adopters. The final, complete set of features is only designed and developed after considering feedback from the product’s initial users. Source: Techopedia
  • In product development, the minimum viable product (MVP) is the product with the highest return on investment versus risk…A minimum viable product has just those core features that allow the product to be deployed, and no more. Source: Wikipedia
  • When Eric Ries used the term for the first time he described it as: A Minimum Viable Product is that version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort. Source: Leanstack

I personally like a combination of these definitions. I would choose something along the lines of:

A minimum viable product (MVP) has sufficient features to solve a problem that exists for customers and provides the greatest return on investment reducing the risk of what we don’t know about our customers.

Just like no two teams implement Agile the same way, we don’t all have to agree on the definition of MVP but all your team members should agree. Otherwise, what is an MVP to one person is a full featured product to another. Take a few minutes to discuss with your cross-functional agile team and come to a decision on your Definition of MVP.

Send to Kindle

Comments Off on Definition of MVP