AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Category » CTO/CIO

Active-Passive and Spare Tires

If a company has already established a disaster recovery plan that involves failover to a cold or passive datacenter it is often hard to convince them to switch to a solution that involves taking traffic at both datacenters. We call this failover type of architecture active-passive and active-active when you accept traffic at both datacenters. If you too believe active-passive architecture is a satisfactory solution for disaster recovery consider this analogy.

Most cars have four tires and a spare tire in the trunk. However, you might have noticed that semi-trucks have pairs of tires on both the tractor and trailer, except for the front wheels that turn. An active-passive architecture is like a car and the active-active solution is like the semi-trucks. Here are three comparisons that might help sell you on an active-active solution.

1) Even with the best of intentions active-passive solutions don’t get tested regularly. How often do you practice changing or even check your spare tire? If you don’t check your spare tire periodically for the correct air pressure it might be flat when you need it most. Passive datacenters are the same way. If you don’t rollout code to it every release and occasionally take actual traffic, it probably won’t work when you need it.

2) Active-active solutions are is much faster to take over traffic when there is a disaster. If your racing down the road which is faster when you get a flat, stop and replace the flat tire with the spare or keep riding on the extra tire? Even if you use a DNS solution like UltraDNS that can failover quickly, you’ll likely need to warm up cache, apply the last round of data logs, etc. before you can take traffic safely in a passive datacenter.

3) Active-active solutions make better use of the investment in equipment than active-passive solutions. The spare tire in your trunk might get used once every year, if you’re unlucky. The second tire on the semi gets used every day helping carry a greater load and reducing the wear on the other tire.

While active-passive is better than not having a disaster recovery plan it’s not the best that you can do. Consider getting to an active-active solution that exercises your DR solution every day and makes use of all that investment.


Alternative Solutions to Old Problems

Are you like @devops_borat and not a fan of DevOps? Or, maybe you think deploying dozens of time each day to production is ludicrous. I’m actually a fan of both DevOps and continuous deployment but if you’re not don’t worry these are just new solutions to old problems and there are alternatives.


The Problems
As long as people have been divided into separate organizations there has existed strife and competition between the teams. In the technology field this is no place more apparent than between development and operations. In at least 50+% of the companies that we meet with they have problems getting these teams to work together. If you’ve been around for a few years you’ve surely heard one team pointing to the other as the problem, whether that problem is an outage or slow product development.

A solution to this problem is DevOps. Wikipedia states that DevOps “relates to the emerging understanding of the interdependence of development and operations in meeting a business’ goal to producing timely software products and services.”

Another common tech problem is that large changes are risky. It is called “Big Bang” for a reason…things go bang! If you’ve been part of an ERP implementation that took months if not years to prepare for you know how risky these large changes are.

A solution to this problem is to make small changes more frequently. According to Eric Ries, co-founder and former CTO of IMVU, continuous deployment is a method of improving software quality due to the discipline, automation, and rigorous standards that are required in order to accomplish continuous deployment.

Alternative Solutions
Admittedly, DevOps and continuous deployment are somewhat extreme for some teams. For those or for teams that just don’t believe that these are the solutions, don’t fret there are alternatives.

JAD/ ARB – For improving the coordination between development and operations, we’ve recommend the JAD and ARB processes. These are very lightweight processes that force the teams to work together for better architected and better supported solutions.

Progressive Rollout – For reducing risk by making smaller changes, we recommend progressive rollout. This is a simple concept that involves first pushing code to a very small set of servers, monitoring for issues, and then progressively increasing the percentage of servers that receive the new code. The time between rollouts can be 30 min to 24 hours depending on how quickly you are likely to detect problems. We often suggestion the percentage of servers in the progressive rollout to be 1%, 5%, 20%, 50%, 100%.

The bottom line is something technologists know – there are almost always multiple ways to solve a problem. If you don’t like the current or new solution look for an alternative.

Comments Off on Alternative Solutions to Old Problems

Cascading Failures

I was chatting with Nanda Kishore (@nkishore) the ShareThis CTO about the recent problems Amazon had in one of their zones. Even though ShareThis is 100% in the cloud, because they have properly architectured their system, these regional outages didn’t affect ShareThis services at all. Of course kudos to Nanda and his team for their design and implementation but more interesting was our discussion about this being a cascading failure in which one small problem cascades into a much bigger problem. A few days later Amazon provided a bit of a postmortem confirming that a simple error during a network change started the problem. The incorrect traffic shift left the primary and secondary EBS nodes isolated, each thinking the other had failed. When they were reconnected they rapidly searched for free space to re-mirror, which exhausted spare capacity and led to a “re-mirroring storm.”

As we were discussing the Amazon issue, I brought up another recent outage of a major service, Facebook. In Sep 2010 they had a several hour outage for many users caused by an invalid configuration value in their cahcing tier. This caused every client that saw the value to attempt to fix it, which involved a query to the database. The DBs were quickly overwhelmed by hundreds of thousands of queries per second.

Both of these are prime examples of how in complex systems, small problems can cascade into large incidents. Of course there has been a good deal of research on cascading failures, including models of the probability distributions of outages to predict their occurrence. What I don’t believe exists and should is a framework to prevent them. As Chapter 9 in Scalability Rules states the most common scalability related failure is not designing to scale and the second most common is not designing to fail. Everything fails, plan for it! Of course utilizing swim lanes or fault isolation zones will certainly minimize the impact of any of these issues but there is a need for handling this at the application layer as well.

As an example, say we have a large number of components (storage devices, caching services, etc) that have a failsafe plan such as refreshing the cache or re-mirroring the data. Before these actions are executed, the component should check in with an authority that determines if the request should be executed or if too many other components are doing similar tasks. Alternatively, a service could monitor for these requests over the network and throttle/rate limit them much like we do in an API. This way a small problem that causes a huge cascade of reactions can be paused and handled in a controlled and more graceful manner.

Comments Off on Cascading Failures

Federated Cloud

In an interesting paper in the IBM Journal of Research and Development, the concept of a federated cloud model is introduced. This model is one in which computing infrastructure providers can join together to create a federated cloud. The advantages pointed out in the article include cost savings due to not over provisioning for spikes in capacity demand. To me the biggest advantage of this federated model is the lack of reliance on a single vendor and likely higher availability due to greater distribution of computing resources across different infrastructure. One of our primary aversions to a complete cloud hosting solution is the reliance on a single vendor for the entire availability of your site. A true federated cloud would eliminate this issue.

However, as the article aptly points out there are many obstacles in the way of achieving such a federated cloud. Not the least of which are technical challenges to architect applications in such a modular manner as to be able to start and stop components in different clouds as demand requires. Other issues include administrative control and monitoring of multiple clouds and security concerns over allowing direct access to hypervisors by other cloud providers.

As we’ve prognosticated, pure VM based clouds like AWS have had to offer dedicated servers for those high intensity IO systems like large relational databases. We’ve also predicted that with double digit growth in cloud services predicted for the next several years, providers will resist the commoditization of their offerings through service differentiation. This attempt at differentiation will come in the form of add-on features and simplification across the entire PDLC. This unfortunately makes the likelihood of a federated cloud offering happening in the next couple of years very unlikely.


Availability as a Feature

It doesn’t matter if you run a commerce site, a services product (such as a SaaS offering) or simply use your homepage to distribute information:  The table stakes for playing online is high availability.  So many companies just take for granted that they will be highly available because they have multiple instances of systems and multiple copies of their data.  This assumption of availability will likely, at the very least, cost you significant pain and in the extreme cost you either significant market share or close your doors as a business.  Customers expect the unachievable – 100% availability.  At the very least you need to give them something close to that.  What will happen to you if you have a data center failure?  How about if a DBA accidentally drops a critical table in your production database?  What will you do when that marketing campaign triggers a near overnight doubling of traffic?  What happens when that new feature has a significant performance bug and gets adopted so quickly that it brings your entire site to its knees?

We often tell our clients that they should treat high availability as a feature.  Unfortunately, it is a somewhat expensive feature that requires constant investment overtime to achieve and maintain. It is a must have feature that will only differentiate your firm if you have competitors who do not make similar investments; when competition exists, customers are more likely to leave a site for a competitor due to availability and performance issues than nearly any other reason.  If you don’t believe us on this topic, just go ask the folks at Friendster.

Treating availability as a feature means measuring availability from a customer perspective rather than a systems or device perspective.  How many times did customer requests not complete?  In this regard, availability now becomes a percentage of failed transactions against an expected number of transactions.   We define an approach to accomplish this in our first book “The Art of Scalability”.  Every executive in the company should “own” the availability metric and understand its implication to the business.    You should track how much you invest in availability over time and significant decreases in engineering or capital should be questioned as it may be an early indicator that you are under investing and a harbinger of hard times to come.

One of the most common failures we see is to assume that disaster recovery is something that only big companies need.  Make no mistake about it, disasters do happen and given enough time they will happen to you.  Data centers catch on fire, have water (sprinkler) discharges that ruin equipment, have complete power equipment failures that take hours to fix and are prone to damages from vehicles, earthquakes, employees and tornados.  In our past lives as executives and current roles as advisors we’ve seen no less than 4 data center fires, 2 data centers incapacitated from earthquakes and tornados and one data center leveled by a truck running into it.  You are never too young to invest in DR.

And DR need not break your bank.  The dials of RTO (recovery time objective) and RPO (recovery point objective) allow you to determine how much you will invest.  Perhaps you simply replicate your databases to a smaller set of databases at a remote datacenter and have a copy of each of your systems there with an additional copy ready “in the cloud”.  While you won’t be able to run production from that data center, you may be able to leverage the cloud to add capacity for relatively low cost by cloning the cloud based systems.  Such a solution has a fast recovery point objective (you lose very little data) and a moderate recovery time objective (several hours) for very low comparative cost.  Of course, you would need to test the solution from time to time to show that it is viable, but it’s a cheap and effective insurance policy for the business.

So remember – availability is your most important feature.  Customers expect it always and will run away from you to competitors if you do not have it.  Create an availability metric and ensure that everyone understands it as a critical KPI.  Evaluate the company spend against availability quarterly or annually as an additional indicator of potential problems.   Assume that disasters happen and have a DR plan regardless of your company size.


Comments Off on Availability as a Feature

The Future of IaaS and PaaS

Even though I’m a fan of technology futurist, I’m not much of a prognosticator myself. However, some recent announcements from Amazon and recent work with some clients got me thinking about the future of Infrastructure as a Service (IaaS) such as Amazon’s AWS and Platform as a Service (PaaS) such as Google’s App Engine or Microsoft’s Azure.

Amazon’s most recent announcement was about Beanstalk. In case you missed it this new service is a combination of existing services, according to their announcement “AWS Elastic Beanstalk is an even easier way for you to quickly deploy and manage applications in the AWS cloud. You simply upload your application, and Elastic Beanstalk automatically handles the deployment details of capacity provisioning, load balancing, auto-scaling, and application health monitoring.” This sounds like a move towards the PaaS to me but the announcement made a point that users retained the ability for total control if desired. It states “…you retain full control over the AWS resources powering your application and can access the underlying resources at any time.”

Werner Vogel, Amazon’s CTO, stated on his blog the need for Beanstalk was in dealing with the complexity involved in managing the entire software stack, which to me id the reason the concept of PaaS was developed. He cites examples already in use of Heroku and Engine Yard for Ruby on Rails, CloudFoundry for Springsource, Acquia for Drupal, and phpfrog for PHP. He states “These platforms take away much of the ‘muck’ of software development to the extent that most RoR developers these days will choose to run on a platform instead of managing the whole stack themselves.” This to me sounds like a blurring of the lines between IaaS and PaaS.

Another item, that we’ve actually written about at the end of last year, is the concept of DevOps. This idea which has gained popularity recently acknowledges the interdependence of development and operations in order to producing timely software products and services. Software developers in many organizations need simpler consolidated platform services in order to procure, deploy, and support virtual instances themselves. This is another push for PaaS platforms but with the flexibility for control when necessary.

Market predictions for cloud services in 2014 span from $55B according to IDC up to $148B according to Gartner. Regardless of the exact number, the trend is double digit growth for many years to come. While the market will pressure for commoditization of these services, providers will resist this through service differentiation. This attempt at differentiation will come in the form of add-on features and simplification across the entire PDLC.

The future of Iaas and PaaS is a blurring of the lines between the two. IaaS providers will offer simpler alternatives while still offering full control and PaaS providers will likely start allowing greater control to attract larger markets. Let us know if you have any thoughts on the future of IaaS or PaaS or both.


Why A Technology Leader Should Code

After I left the military, I started in corporate America as a software developer. I spent several years programming on various projects in a variety of languages. Perhaps more quickly than I wanted, I entered the management ranks. Starting as an engineering manager, I progressed into a number of executive roles including VP of Engineering, CIO, and CTO. It has now been well over a decade for me as a manager and executive but through these years I have continued to program. From the technology executives that I’ve met this is fairly unusual. Most tech execs gladly give up programming upon entering management and never look back.

I’ve never considered myself a great programmer and what I do today compared to a professional developer is like comparing a weekend gardener with an industrial farmer. Recently I’ve been considering whether continuing to program is clutching to my technical youth or actually beneficial as a technology leader. We’ve written about How Technical a CTO Should Be but here are a few more specific thoughts on programming.

Technical and Tactical Proficiency
As a junior officer I was taught that in order to lead one had to be “technically and tactically” proficient. I owed it to the soldiers in my unit to understand the equipment our unit employed and the basic combat tactics that we would be following. This concept has stuck with me and I believe that technology leaders need to understand the tools that their team is working with and the processes that they are following. The exact level of understanding is a personal choice and highly debatable. For me, I like if at all possible to have hands on experience. Periodically having to code a feature and deploy it will provide the engineering manager a better understanding and appreciation for what her engineers go through on a daily basis.

Tangible Results
Leading people can be one of the most challenging and yet rewarding jobs. Getting a team to buy into a single vision and motivating them to deliver that vision is a day-to-day challenge that can wear the best of us down. When that team finally delivers or when the junior employee that you’ve been coaching starts performing like the star that you knew they could be, it all seems worth it. Unfortunately, those reward days are months or years in between. During the interim days and weeks it can be difficult to not achieve tangible results.

This is where programming fits. Coding provides immediate feedback and accomplishment of short-term goals. When your function works perfectly the first time you test it or when the solution to that very difficult problem becomes clear, you receive instant gratification and tangible results.

Some leaders use other hobbies like woodworking or gardening to provide this short-term gratification. Start working on a garden and within a couple of hours or days you can see the impact of your work. The ground is turned over, weeds are removed, seeds are planted. After a couple of weeks or months the project is completed with the results on your dinner table, proof of your achievement.

While these physical activities are enjoyable and rewarding they don’t expand your knowledge of developing systems. Consider deliberate practice by picking up a programming project to receive tangible rewards and improve your technical and tactical proficiency.


Defining Pods, Shards and Swim Lanes

In the course of our engagements we often have to pause for a few minutes to acquaint everyone with a few terms that we use. It is often the case that they have heard or even use some terms common in the industry. Three of these that are often used and/or confused are pods, shards, and swim lanes. Let’s start by defining each one and then explaining the differences

According to Merriam-Webster a shard is a small piece or part. Wikipedia defines a database shard as “…a method of horizontal partitioning in a database or search engine.” The term horizontal partitioning refers to a database design principle whereby rows of a database table are separated possibly onto physically distinct database servers.

A shard to AKF is an Z-axis split on the AKF Scale Cube. This involves splitting the tables in the database between two or more database servers based on some appropriate key such as customer ID or sales items. An X-axis split involves replicas such as read-only slaves or standbys that are complete copies of the primary database. The Y-axis splits are one done by service, which usually aligns to a sub-set of tables. An example of this would be pulling session off the primary database an onto it’s own database server.

One of our clients, Salesforce.com, uses the term pods especially for its Force.com software-as-a-service platform. Pods are self-contained sets of functionality that can consist of an app server or database. If a pod goes down because the platform isn’t running it, only the customers on that pod will be effected. Salesforce executives claimed that it delivered 99.95 percent uptime last year.

Swim Lanes
AKF uses the term “swim lane” to describe a failure domain or fault isolation architecture. A failure domain is a group of services within a boundary such that any failure within that boundary is contained and failures do not propagate outside. The benefit of such a failure domain is two-fold:

  1. Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.
  2. Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. As such, and depending upon approach only a portion of users or a portion of functionality of the product is affected.

Between swim lanes synchronous calls are absolutely forbidden because any synchronous call between failure domains, even with appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures. An example of how this happens is in your database when one long running query slows down all the other queries competing for locks or resources.

Similarity and Differences
All of these terms describe similar architectures (splitting by customers or similar key) but they are done for different purposes. Shards are very specific to databases and don’t imply whether or not the application tier is sharded or not. The purpose of shards are to scale an RDBMS onto many different servers instead of larger hardware. Pods and Swim Lanes aim to achieve both scalability of the overall system (application and database) as well as achieve fault isolation.


RAC Rant

We’ve written about trying to use vendor features to scale but given how often we run across companies that have been convinced by vendors to rely on them, this topic is worth revisiting. To state it as directly as possible, every major SaaS company that has relied on a vendor, software or hardware, to scale them through hyper-growth has failed and had to solve the scale problem themselves.

Since Oracle World took place recently I’ve decided to use Oracle RDBMS as an example of failing to scale with vendor features. We have nothing against using Oracle as an RDBMS, even though there are open source options that can scale just as well, but let’s use one of their scalability features, Real Application Clusters (RAC), as an example. In Oracle’s own words RAC “…enables a single database to run across a cluster of servers, providing unbeatable fault tolerance, performance, and scalability with no application changes necessary.” A nice concept – to scale with “no application changes” – but this isn’t realistic with hyper-growth companies. One large reason is that RAC does not scale across multiple data centers, which is a requirement for hyper-growth companies since everything fails eventually including data centers. Even with the “Extended Distance Clusters” for RAC nodes, they only extend to 25 kilometers using Dark Fiber (DWDM or CWM) technology.

The use of RAC for increased availability is fine but you should review our post on the downside of using vendor features and how to negotiate with vendors. In particular you should be aware that by using this feature you have weakened your position during renewal negotiations. If you think your sales person is being nice by throwing in the RAC feature for a low price, think again. As soon as you start using this feature they have the upper hand in negotiations.

Enough of the RAC rant, especially since this is just one example of many that are out there. Hardware vendors, both servers and storage, are just as guilty of trying to convince SaaS companies to rely on them for scalability. Keep your destiny in your own hands and resist relying on short term solutions to long term problems.

Comments Off on RAC Rant

Outsourcing Engineering or Operations

A quick summary of AKF Partners' approach of what, why and how to outsource engineering efforts.

Our clients very often have questions over Why, What and How to outsource software development efforts, infrastructure, hosting, etc.  Readers of our book or frequent readers of our blog will notice that the questions are similar to those we ask in our “Build v. Buy” analysis.   The decision of what to outsource isn’t significantly different than determining when to buy rather than build.

Why outsource?  There are three very good and common reasons to outsource engineering efforts.

1)      You want to reduce your average cost of engineering and outsourcing may provide a way to do that (especially “offshoring”).  The right kind of outsourcing can reduce your unit cost of labor for engineering efforts.  But before you outsource, you should understand the full cost per unit developed of your engineering efforts so that you can measure and validate your cost benefit.

2)      You have near term capacity needs to increase engineering capacity that you cannot meet with current hiring practices.  If you need to 3x the size of your engineering team in 2 months, you probably need outside help.

3)      You fear that the engineering capacity need will be short lived and do not want the risk of hiring W2 employees.  Sometimes (2) and (3) are bundled together.  If you don’t have follow on work for some new system or product, you probably don’t want to hire and then fire employees.

The “What you should outsource” is very often mistaken as “why one should not outsource”.  There are almost always things you can outsource, and very often there are things you absolutely should not outsource.   We typically discuss 4 areas with our clients to help them understand what can and what should not be outsourced.

1)      Don’t outsource things that create strategic competitive differentiation for your company.  Having a third party develop the thing that differentiates you from your competitors is giving away the secret sauce.  It’s hard enough to protect intellectual property – if you simply give it to someone else you might as well just give it away.  Now probably not everything you do differentiates you from competitors.  For instance, if you run an ecommerce site you might determine that your product proposal system is a differentiator while search is not.  Outsource search, keep the development of your product proposal and analytics system in house.

2)      Don’t outsource product definition.  If you are in a product business, you really can’t outsource the definition of the product that makes you money.  We’ve seen customers try and it’s not pretty.

3)      Don’t outsource your architecture or standards.  Tightly coupled with product definition is the need to set the standards and architecture by which the platform abides.   You may believe that the beauty is in the idea or the specification of the product but if it takes off it will need to scale.  Few outsourcers are adept at defining scalable platforms because the largest and best companies simply don’t outsource that – ever.

4)      Don’t outsource areas where you need rapid response and flexibility.  These things might not be competitive differentiators – but if you expect a turn on a dime response in specific areas you aren’t likely to get those with a contractual relationship.

Finally we come to “How you should outsource”.  Here again, we have three common rules for our clients.

1)      Manage the outsourcer.  That means that you need to add employees to manage the outsourcer and the projects, which in turn means that the actual cost of outsourcing is higher than what the outsourcer has quoted.  Keep this in mind when considering outsourcing to dollar average costs down.

2)      Expect conflict.  Rarely do we see outsourced projects that don’t have conflict between internal engineering teams and the outsourced team.  Expect it and be prepared to manage it quickly.

3)      Deliver standards with specifications.  If you expect something to be 99.99% available, scale to 10x the current volume and deliver new functionality be very specific and demand proof.  We’ve even helped negotiate contracts where payment happens after proof in the production environment rather than delivery.

Summary:  Look to outsource when you want to manage the risk of growth or contraction and to lower your engineering costs.  Always expect that you will have to aggressively manage your outsourcer and always deliver specific standards of operation with your product specifications.  Never outsource areas that strategically differentiate your company or product offering or where you need strategic or tactical flexibility.

1 comment