AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Category » CTO/CIO

5-95 Rule

Previously, I wrote about mitigating risk in the face of uncertainty. I suggested that an agile development model was one way successful companies have been able to mitigate risk. In that post I compared the similarities between organizing into “Scrums” to how Army Special Forces organized in order to succeed in uncertain operating conditions. In addition to the way a company is organized, the way a company approaches project planning can mitigate risk. One of the most often overlooked aspects of project management is contingency planning.

Army Green Berets are fond of the saying, “No plan survives contact with the enemy,” a quote attributed to the famous Prussian General Helmut Von Moltke. Von Moltke and his Prussian contemporaries were brilliant strategists and sought to perfect the “Theory of War.” Their ideas still serve as basis for our doctrine today. Moltke understood that battles can be broken down to a near infinite set of complex options, each of them in turn having still more options depending upon the enemy’s response. A plan only goes so far, at which point the enemy ultimately casts his vote. Von Moltke provided his subordinate commanders with his intentions (a method still taught today in military basic leadership courses), and held them responsible for extensively preparing for all plausible contingencies.

As a Special Forces Detachment Commander, once my team received our orders, we immediately began to plan. Along with the mission, we were given a specific deadline by which we needed to brief our commander. Regardless of the deadline, whether it was 6 hours or a week, we spent approximately 5% of the time coming up with a solid, practical, and safe plan. We spent the other 95% of the time “war-gaming” the plan. During the “war-gaming” portion, we would go through every step of the plan from the time we left our barracks to the time we got back. We thought of every possible thing that could go wrong. What if only two helicopters showed up and not four? What if the helicopter couldn’t land where we wanted it? What if someone got injured? What if more bad guys were on the objective than we had originally thought? We would break our mission down into distinct pieces, or milestones, and would work in teams to come up with solutions to every possible contingency. We would collaborate with our partners, the pilots for example, and let the subject matter experts take the lead in their scope of responsibility. Time permitting, we built mock-ups of the buildings we anticipated operating in, and conducted detailed rehearsals. The rehearsals revealed other contingencies we hadn’t planned for, such as equipment that was being carried by the wrong guy or individuals who needed to work together that weren’t located near each other. We did this because we knew the battlefield was going to be fluid and dynamic. Being prepared not only enabled us to be successful, it saved lives.

While the tech industry doesn’t deal in life or death, although it certainly might feel like that at times, Von Moltke’s wisdom still applies, especially in complex project development. What if equipment doesn’t arrive on time during a data center build? A critical engineer or developer gets sick during a launch or gets hired away? AKF Partners is a company heavily influenced by veterans and our collective experience on and off the battlefield. We encourage our clients to practice the AKF “5-95 Rule” in their Agile methodology. Having a solid plan is a great start but understanding the possible variations in the project’s execution will help ensure the projected is delivered on time and within the budget. Remember the AKF “5-95 Rule” and spend 5 percent of your time planning and 95 percent of your time developing contingencies.

1 comment

Recent Interviews

Below is a list of recent interviews by AKF Partners, Mike Fisher & Marty Abbott on a variety of topics including scalability, healthcare.gov, and customer misbehavior:

Make sure you follow Marty and Mike  and AKF’s Facebook Pages to keep up to date with the latest.



Power of Customer Misbehavior

AKF Partners


Comments Off on Recent Interviews

A False Sense Of Security and Complacency = Revenue Loss

Its Monday morning and past Saturday evening issues in one of your datacenters triggered a failover to your second data center for service restoration. In other words, all customer traffic has been routed to a single datacenter. The failover was executed flawlessly and the team went back to bed waiting for Monday morning to permanently fix the issue so traffic could once again run out of both datacenters. On Monday morning, we are expecting a flash sale and will make close to $8000 a minute at peak. All is well and there is nothing to worry about. Right?

Hopefully you cringed at the above scenario. What if the data center you are running out of suffers from a failure? Or what if the only data center and its components that is now live for all of your traffic simply wasn’t sized correctly for acceptable performance during a traffic spike?

If it hasn’t happened yet, it will. If that were the case, your business would stand to lose significant revenue. We see it over and over again with many clients and have also experienced it in practice. Multiple datacenters can serve as a false sense of security and teams can become complacent. Remember, assume everything will fail as a monolith. If you are only running out of a single data center and the other is unable to take traffic, you now have a SPOF and as a whole the DC is a monolith. As a tech ops leader you have to drive the right sense of urgency and lead your team to have the right mindset. Restoring service with a failover is perfectly acceptable. However, the team cannot stop there. They must quickly diagnose the problem and return the site to normal service, which means you are once again running out of two datacenters. Don’t let the false sense of security slip into your ops teams. If you spot it, call it out and explain why.

To help combat complacency from setting in, we recommend considering the following:

  1. Run a Morning Ops meeting with your business and review issues from the past 24 hours. Determine which issues need to undergo a postmortem. See one of our earlier blogs for more information: http://akfpartners.com/techblog/2010/08/29/morning-operations-meeting/
  2. Communicate to your team and your business on the failure and what is being done about it.
  3. Run a postmortem determine multiple causes and actions and owners to address the causes: http://akfpartners.com/techblog/2009/09/03/a-lightweight-post-mortem-process/
  4. Always restore your systems to normal service as quickly as possible. If you have split your architecture along the Y or Z axis and one of the swim lanes fails or an entire datacenter fails, you need to bring it back up as quickly as possible. See one of our past blogs for more details on splitting your architecture: http://akfpartners.com/techblog/2008/05/30/fault-isolative-architectures-or-“swimlaning”/


Comments Off on A False Sense Of Security and Complacency = Revenue Loss

Wine-Dark Sea

“There is a land called Crete in the midst of the wine-dark sea, a fair land and a rich, begirt with water, and therein are many men innumerable, and ninety cities.” – Homer (fl. 850 B.C.), Odyssey, Book XIX

And some one shall some day say even of men that are yet to be, as he saileth in his many-benched ship over the wine-dark sea…” – Homer (fl. 850 B.C.), Iliad, Book VII

The phrase “wine-dark sea” appears dozens of times in the Iliad and the Odyssey resulting in much debate at what Homer actually meant by the phrase. One theory is that he was describing an outbreak of red-colored marine algae. Another theory put forward by researchers Wrignt and Cattley, was that the Greeks mixed highly alkaline water with their wine, resulting in blue-wine.

The explanation that sounds most plausible to me is that ancient Greeks did not have a word for “blue”. At the time of Homer’s writing there were only five colors (metallics, black, white, yellow-green, and red). Lacking the appropriate term to describe the world, they used what they knew.

Many of us are not unlike the ancient Greeks. We can’t describe a new architecture or we can’t imagine our business differently because we don’t have words for it. Whether you are just out of school and don’t have a ton of experience or your more senior but haven’t seen highly scalable system architectures, both leave you blind to “seeing” a different architecture or business model. Another way we’re blinded is just by being at a business for a number of years. Companies and institutions have collective beliefs, cultures, and even memories that become our own. This is completely normal. If you don’t adapt to the standards and norms of the company you’re going to have a rough go of it. Just like the body tries to reject transplanted organs because the body doesn’t recognize the foreign object, companies reject employees and even leaders who don’t fit or adapt to fit.

So, what is one to do if you know you suffer from a blind spot? The solution to this is to bring in new people or, occasionally, consultants who have seen this before and can help you learn to describe the new world. As made famous by many twelve-step programs, the first step is to “admit there is a problem”. If you can’t admit that you might not see or be able to accurately describe everything, you’ll eventually get blindsided by either the inability to scale or, even worse, a competitor able to see things differently.

Comments Off on Wine-Dark Sea

It’s Not About the Technology

Perhaps it’s because we’re technologists that we love shiny new technologies. However, for years now AKF has been telling anyone that will listen or read, that “scaling is not about the technology”. Scale comes from system-level or deployment-focused architecture which is the intersection of software and hardware. No doubt, we have some amazingly scalable technologies available to us today like Hadoop and MongoDB but when your entire datacenter goes down (like Amazon and Calgary and GoDaddy and Sears and the list goes on…) these scalable technologies don’t keep your service available. And if you think that your customers care whether it was your fault or your vendor’s fault…you’re wrong. They pay you for your service and expect it available when they need it.

No doubt, the technology decisions are important. Whether you use Backbone or Knockout or whether you choose Memcached or Redis, all of these technology decisions have pros and cons which can effect your team for a long time. But, at the end of the day these decisions are not ones that will affect whether your application and organization can scale with growth. These technology decisions affect the speed and cost factors of your organization and technology. Some technologies your team knows more about or are naturally faster to learn; therefore, these cost you less. Other technologies are very popular (PHP) and thus engineers’ salaries are lower because there is more supply. Yet still other technologies (assembly language) are complex, appeal to a select group of engineers, are very costly to develop in but might cost very little to process transactions because of the efficiency of that technology.

Technology decisions are important but for different reasons than scaling. Relying on a technology or single vendor to scale is risky. To the vendor or open source project, you are one of many customers and the longevity of their business or project doesn’t depend on keeping your service available. However, your business does depend on this. Take control of the future of your business by scaling your service and organization based on systems-level or deployment-focused architecture. Leave the technology decisions outside of the systems architecture.


Comments Off on It’s Not About the Technology


One of the most important aspects of managing a successful technology organization is ensuring that you are practicing & instilling the concept of enablement at all levels. This concept applies to both the product/service you are producing and for people. A good example for your organization is enabling decision-making at the lowest levels possible. I have often seen this represented as “delegation” but I believe that enablement of decision-making is a more powerful concept than delegation which is driven from the top-down. I recently had the opportunity to lead a large infrastructure team and one of the first changes we made was breaking apart into reasonable sized PODs with the primary purpose of ensuring that decisions for the product & technology were driven from the bottom-up while guidance was flowing in from various stakeholders. Many teams practice a flavor of Agile but without enabling each POD to make the appropriate decisions you will run into organizational scalability problems.

The allure of IaaS & PaaS is firmly rooted in the concept of enablement. Self-service is an amazing if you are a DBA, developer or even the end user of your product. The cloud may not be suitable for your needs but don’t let that stop your organization from thinking strategically about bringing those processes & technologies “in-house” for scalability reasons. Implemented correctly, the infrastructure and platform you are building should enable the users and not hinder them, as is sometimes the case. Reducing the number of dependencies between technology teams for launching products is not only good for cycle time of product launches but also critical in scaling up.

Consider making enablement part of your technology team’s DNA and you will likely see that employee morale, productivity and other metrics like NPS will rise.

The End of Scalability?

If you received any sort of liberal arts education in the past twenty years you’ve probably read or at least had an assignment to read Francis Fukuyama’s 1989 essay “The End of History?”[1] If you haven’t read the article or the follow on book, Fukuyama argues that the advent of Western liberal democracy is the final form of human government and therefore it is the end point of humanity’s sociocultural evolution. He isn’t arguing that events will stop happening in the future but rather that democracy will become more and more prevalent in the long term, despite possible setbacks such as totalitarian governments for periods of time.

I have been involved, in some form or another, in scaling technology systems for nearly two decades, which does not take into account the decade before when I was hacking on Commodore PETs and Apple IIs learning how to program. Over that time period there have been noticeable trends such as the centralization/decentralization cycle within both technology and organizations. With regards to technology, think about the transitions from mainframe (centralized) to client/server (decentralized) to web 1.0 (centralized) to web 2.0/Ajax (decentralized) as an example of the cycle. The trend that has lately attracted my attention is about scaling. I’m proposing that we’re approaching the end of scalability.

As a scalability consultant who travels almost every week to clients in order to help them scale their systems, I don’t make this end of scalability statement lightly. However, before we jump into my reasoning, we first need to define scalability. To some scalability means that a system needs to scale infinitely no matter what the load over any period of time. While certainly ideal, the challenge with this definition is that it doesn’t take into account the underlying business needs. Investing too much in scalability before its necessary isn’t a wise investment for a business when there are other great projects in which to invest such as more customer facing features. A definition that takes this into account defines scalability as “the ability of a system to maintain the satisfaction of its quality goals to levels that are acceptable to its stakeholders when characteristics of the application domain and the system design vary over expected operational ranges.” [2:119]

The most difficult problem with scaling a system is typically the database or persistent data storage. AKF Partners teaches general database and applications scale theory in terms of a three-dimensional cube where the X-axis of the cube represents replication of identical code or data, the Y-axis represents a split by dissimilar functions or services, and the Z-axis represents a split across similar transactions or data.[3] Having taught this scaling technique and seen it implemented in hundreds of systems, we know that by combining all three axes a system can scale infinitely. However, the cost of this scalability is increased complexity for development, deployment, and management of the system. But is this really the only option?

The NoSQL and NewSQL movement has produced a host of new persistent storage solutions that attempt to solve the scalability challenges without increased complexity. Solutions such as MongoDB, a self-proclaimed “scalable, high-performance, open source NoSQL database”, attempt to solve scaling by combining replica data sets (X-axis splits) with sharded clusters (Y & Z-axis splits) to provide high levels of redundancy for large data sets transparently for applications. Undoubtedly, these technologies have advanced many systems scalability and reduced the complexity of requiring developers to address replica sets and sharding.

But the problem is that hosting MongoDB or any other persistent storage solution requires keeping the hardware capacity on hand for any expected increase in traffic. The obvious solution to this is to host it in the cloud, where we can utilize someone else’s hardware capacity to satisfy our demand. Unless you are utilizing a hybrid-cloud with physical hardware you are not getting direct attached storage. The problem with this is that I/O in the cloud is very unpredictable, primarily because it requires traversing the network of the cloud provider. Enter Solid-State Drives (SSD).

Chris Lalonde, CEO of ObjectRocket a MongoDB cloud provider hosted entirely on SSDs, states that “Developers have been thinking that they need to keep their data set size the same size as memory because of poor I/O in the cloud and prior to 2.2.x MongoDB had a single lock, both making it unfeasible to go to disk in systems that require high performance. With SSDs the I/O performance gains are so large that it effectively negates this and people need to re-examine how their apps/platforms are architected.”

Lots of forward thinking technology organizations are moving towards SSDs. Facebook’s appetite for solid-state storage has made it the largest customer for Fusion-io, putting NAND Flash memory products in its new data centers in Oregon and North Carolina. Lalonde says “When I/O becomes cheap and fast it drastically changes how developers think about architecting their application e.g. a flat file might be just fine for storing some data types vs. the heavy over head of any kind of structured data.” ObjectRocket’s service offering also provides some other nice features such as “instant sharding” where through the click of a button provides an entire 3-node shard on demand.

Besides the advances being made in leveraging NoSQL and SSDs to allow applications to scale using Infrastructure as a Service (IaaS), there are advances in Platform as a Service (PaaS) offerings such as Google App Engine (GAE) that are also helping systems scale with little to no burden on developers. GAE allows applications to take advantage of scalable technologies like BigTable that Google applications use, allowing them to claim that, “Automatic scaling is built in with App Engine, all you have to do is write your application code and we’ll do the rest. No matter how many users you have or how much data your application stores, App Engine can scale to meet your needs.” While GAE doesn’t have customers as large as Netflix who run exclusively on Amazon’s Web Services (AWS) their customers do include companies like the Khan Academy, which has over 3.8 million monthly unique visitors to their growing collection of over 2,000 videos.

So with solutions like ObjectRocket, GAE, and the myriad of others that make it easier to scale to significant numbers of users and customers without having to worry about data set replication (X-axis splits) or sharding (Y & Z-axis splits), are we at the end of scalability? If we’re not there yet we soon will be. “But hold on” you say, “our systems are producing more and consuming more data…much more.” No doubt the amount of data that we process is rapidly expanding. In fact, according to the EMC sponsored IDC study in 2009, the amount of digital data in the world was almost 500 exabytes and doubling every 18 months. But when we combine the benefits we achieve from such advances as transistor density on circuits (Moore’s Law), NoSQL technologies, cheaper and faster storage (e.g. SSD), IaaS, and PaaS offerings, we are likely to see the end of the need for most applications developers to care about manually scaling their applications themselves. This will at some point in the future all be done for them in “the cloud”.

So What?
Where does this leave us as experts in scalability? Do we close up shop and go home? Fortunately, no, there are still reasons that application developers or technologists need to be concerned with splitting data replica sets and sharding data across nodes. Two of these reasons are 1) reduce risk and 2) improve developer efficiency.

Reducing Risk
As we’ve written about before, risk has several high-level components (probability of an incident, duration, and % of customers impacted). Google “GAE outages” or “AWS outages” or any other IaaS or PaaS provider and the word “outage” and see what you find. All hosting providers that I’m aware of have had outages in their not-so-distant past. GAE had a major outage on October 26, 2012 for 4 hours. GAE proudly states at the bottom of their outage post “Since launching the High Replication Datastore in January 2011, App Engine has not experienced a widespread system outage.” Which sounds impressive until you do the math and realize that this one outage caused their availability to drop to 99.975% for the entire year and a half that the service has been available. Not to mention that they have much more frequent local outages or issues that affect some percentage of their customers. We have been at clients when they’ve experienced incidents caused by GAE.

The point here is not to call out GAE, trust me all other providers have the exact same issue. The point is that when you rely on a 3rd party for 100% of your availability you by definition have their availability as your ceiling. Now add on your availability issues because of maintenance, code releases, bugs in your code, etc. Why is this? Incidents are almost always have multiple root causes that include architecture, people, and process. Everything eventually fails including our people and processes.

Given that you cannot reduce the probability of an incident to 0%, no matter whether you run the datacenter or a 3rd party provider does, you must focus on the other risk factors (reduce the duration and reduce the % of customers impacted). The way you achieve this is by splitting services (Y-axis splits) and by separating customers (Z-axis splits). While leveraging AWS’s RDS or GAE’s HRD provides cross availability zone / datacenter redundancy, in order to have your application take advantage of these you still have to do the work to split it. And if you want even higher redundancy (across vendors) you definitely have to do the work to split applications or customers between IaaS or PaaS providers.

Improving Efficiency
Let’s say you’re happy with GAE’s 99.95% SLA which no doubt is pretty good especially when you don’t have to worry about scaling. But don’t throw away the AKF Scalability Cube just yet. One of the major reasons we get called in to clients is because their BOD or CEO aren’t happy with how the development team is delivering new products. They recall the days when there were 10 developers and features flew out of the door. Now that they have 100 developers, everything seems to take forever. The reason for this loss of efficiency is that with a monolithic code base (no splits for different services) all 100 developers trying to make changes and add functionality, they are stepping all over each other. There needs to be much more coordination, more communication, more integration, etc. By splitting the application in to separate services (Y-axis splits) with separate code bases the developers can split into independent teams or pods that makes them much more efficient.

We are likely to see continued improvement in IaaS and PaaS solutions that auto-scale and perform to such a degree that most applications will not need to worry about scaling because of user traffic. However, this does not obviate the need to consider scaling for greater availability / vendor independence or to improve a development teams efficiency. All great technologist, CTOs, or application developers will continue to care about scalability for these and other reasons.

1. Francis, F., The End of History? The National Interest, 1989. 16(4).
2. Duboc, L., E. Leiter, and D. Rosenblum, Systematic Elaboration of Scalability Requirements through Goal-Obstacle Analysis. 2012.
3. Abbott, M.L. and M.T. Fisher, The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise. 2009: Addison-Wesley Professional.

Comments Off on The End of Scalability?

Signs That You May Be Disconnected From Your Business

We as engineers love to problem solve. In fact, we love to explore new technologies and debate with our colleagues as to what may be the best to use. Should we use Python? Should we use PHP? Should we use Java? Should we try Google’s App Engine and code in GO? Should we use Amazon Web Services exclusively for our product? What database technology should we use? All of these decisions are important and factors such as skillsets, security, performance, and cost should be considered. We love to code and see the product in action as it provides us with a sense of accomplishment. Once we are done with a project and its deployed in production we typically celebrate all of the hard work that went into the solution and we move on to the next project. Many times after diving deep into the technical aspect of the solution, for weeks and maybe even months, we wake up and we discover we are not in touch with the business like we should be. All of us have seen this in practice and many of our clients face this challenge.

What are some of the signs that you may not be aligned closely enough with the business and its performance and what should you do about it?

1) Not understanding feature impact – New features are introduced into your product without an understanding of the business impact.

You should never introduce new features without understanding the impact it is supposed to have on your business. In other words, establish a business goal for new features. For example, a new feature that allows for one click purchase is expected to improve conversion rates by 15% within 2 months. Remember all goals should be SMART goals (per chapter 1 in The Art of Scalability – Specific, Measurable, Attainable, Realistic, and Time constrained).

2) Celebrating launch and not success – Upon deployment of your product and confirmation that everything is working as expected, fireworks go off, confetti falls from the ceiling, and the gourmet lunch is ordered.

While recognizing your team for their efforts to launch a feature can be important, you really should celebrate when you have reached the business goal that is associated with that feature (see item #1 above). This might be achieving a revenue target, increasing the number of active accounts, or reaching a conversion target. If you deploy a feature or a product, and your business targets are not met as expected, you have more work to do. This should not be a surprise. Agile development’s basic premise is that we do not know what the final state of the feature and thus we must iterate.

3) No business metric monitoring – Your DevOps team is alerted that something is wrong with one of your services but you rely on your customer support or service desk to tell you if customers are impacted.

This is something that many of our clients struggle with. We believe its critical to detect a problem before you customers have to tell you or you have to ask them. You should be able determine if your business is being impacted without asking your customers.

By monitoring real time business metrics and comparing the actual data to a historical curve you can more quickly detect if there is a problem and avoid sifting trough alerting and monitoring white noise that your systems will inevitability produce. For more details on our preferred strategy, visit one of our earlier blogs. You can also read more about the Design To Be Monitored principle in our book Scalability Rules.

As you company grows, make sure that your product, your engineering team, and your technical operations are closely aligned with your business. We firmly believe that we as technologists must stay aligned with our business for success. In fact, we really are the business. Our solutions enable the company to earn revenue.

1 comment

Dealing with Shared Services

In our latest Newsletter, we wrote about the importance of aligning your agile teams to the architecture of the system and the trend we are seeing as our clients move towards PODS. We believe and teach that designing your architecture is only the first step in building an organization that can scale in support of your product. Remember, agile autonomous teams are able to act more efficiently which results in a speedier TTM. Ideally, they should almost be able to behave like mini-startups.

Aligning teams to swim lanes is pretty straightforward but what do you do with a team that’s central to multiple services?

We often see clients who have this challenge. They have a shared service or feature that the other clearly split autonomous teams depend on. For example, there are several sites that bring consumers together with businesses in their local area. These sites often have categories or verticals that need search functionality. Asking each team to design its own search functionality would be wasteful as you would end up with engineers designing redundant functionality which would exponentially cost more to operate. It is absolutely feasible to create a team that would focus on search and be used by other teams in this situation. We do recommend that you minimize the existence of these types of teams when possible, as there is always risk that they could slow down TTM.

Great! All of the other teams are going to bombard the shared service team with new development requests. So what do you do to mitigate the risk of over allocating your engineers with such a team?

This risk is real as the other teams will make requests for enhancements or functionality to support their services and they will want it quickly. To mitigate this risk we suggest thinking of the team almost like you would an open source project. That doesn’t mean you simply open up the search code base to all of your engineers and let them have at it. Rather, it means you put mechanisms in place to help control the quality and design for your business. An open source project often has its own repo and typically only allows trusted developers to commit. In our search example, you could designate a couple trusted and experienced engineers in the other PODS that can code and commit to the search repo. Engineers on the search team can be focused on making sure new functionality aligns with architectural and design principles that your company has established. This approach should help to mitigate the potential bottleneck such a team could create.

OK, now that you have spread out the development of search, who really owns it?

Remember, ownership by many is ownership by none. In our example, the search team ultimately owns the search feature and code base. As other developers commit new code to the repo, the search team should conduct code, design, and architectural reviews. Just as the other PODS will deploy new features to production, the search team will also own deployment of search. Overall, all of your teams should have objectives that align with a few key business success metrics.

Remember, whatever mechanisms you put in place, your shared service or tools team should be a gas pedal and not a break for TTM. Good luck scaling your architecture and your organization. We would love to hear about some of the experiences from those of you that have tried this or other approaches.


Cloud Services

Recently a reader asked if we still thought the points in this 2008 article “The Cloud Isn’t For Everyone” were still valid. At that time we pointed out that there were 5 major concerns: Security, Non-portability, Control, Limitations (persistent storage, public IPs, etc), and Performance. We also pointed out these pros: Cost, Speed, and Flexibility. Our short response was “yes”, other than the limitations which have mostly been solved, these are still valid. However, this discussion got us interested in revisiting this topic.

We deal with a wide variety of companies from hospitals to ad tech to ecommerce companies, all of which have different levels of knowledge about cloud computing. In this post we’re going to first define different types of cloud computing and then discuss some of the concerns with each.

A common definition of cloud computing is the delivery of computing and storage capacity as a service. A distinct characteristic is that users are charged based on usage instead of a more conventional licensing or upfront purchase. There are three basic types of cloud computing:

  • Infrastructure as a Service (IaaS) – This is the most basic type of cloud service and offers servers (often as virtual machines), networking, and storage as services. Examples of this include Amazon’s Web Services and Rackspace.
  • Platform as a Service (PaaS) – This cloud offering provides not only the hardware but a layer above, providing platforms to run custom applications usually specific to certain programming languages. Examples of these include Microsoft’s Azure and Google’s App Engine.
  • Software as a Servie (SaaS) – This service is an offering of a finished product hosted in a multi-tenant manner (many customers on a single implementation). Examples of this include Gmail, Sales Force, Service-Now, New Relic, and many more.

In a 2012 survey with 785 respondents by North Bridge Venture Partners, we see trends indicating that companies are much more comfortable with cloud computing offerings. Down from 11% last year, a very low 3% of respondents consider cloud services to be too risky. Only 12% say the cloud platform is too immature and 50% of the survey respondents now say they have “complete confidence” in the cloud. Eighty-four percent of all net new software will be SaaS-based. The take away is that cloud computing is simply becoming the way we do things.

The demand for SaaS offerings is growing rapidly. Gartner predicts that SaaS will hit $14.5 billion in 2012, a 17.9% growth from the previous year, with growth continuing through 2015 when it will be $22.1 billion. This growth is being fueled by several factors including new software design and delivery modules allowing for more instances of an application to run simultaneously, bandwidth costs continuing to drop, and customer frustration over the cycle of purchasing, paying for maintenance, and going through time consuming upgrades. According to CIOZone.com, a list of the top 60 fastest-growing public software companies in 2007 was dominated by companies switching from a proprietary license model to a subscription model.

When you consider implementing / purchasing a SaaS solution there are a number of questions that you should ask.

  • What if the service is unavailable? Our often repeated mantra is that “Everything Fails.” If you’re in the game long enough you’ve seen it all fail – servers, network, storage devices, ISPs, and even entire datacenters. If the service fails, how does this impact your business or your own services?
  • Where is your data secured and backed up? If you’re storing sensitive data – corporate email, customer names, or PII – it is important that you understand how your data is being handled and protected.
  • What is the cost? Besides the monthly subscription or usage cost, you should investigate the total cost including startup, transfer, and periodic data migration.
  • What level of access? It’s your data but that doesn’t mean you’ll have unlimited access to it. Consider how you retrieve the data today and most importantly how you might want to access it in the future.
  • Does it comply with industry regulations? If your business has regulatory requirements such as SOX, PCI, PA-DSS, etc you should investigate whether the SaaS provider supports these.

The demand for PaaS, at least among our clients, is not that great. As Barb Darrow on GigaOm stated “For die-hard .Net heads, Azure is probably the PaaS of choice. But for the army of new-age web developers, it’s an also-ran.” In fact, there are early indications that PaaS providers are starting to offer services that are moving them more into the IaaS market. If you’re interested in PaaS your choices are very limited based on the technology stack that you are using. The real question to ask for PaaS is whether you should go directly to an IaaS provider or if you gain enough benefits from the PaaS provider to makeup for the additional cost.

The spend on IaaS cloud computing is expected to grow 48.7 percent in 2012 to $5.0 billion, up from $3.4 billion in 2011. Some of our clients have 100% of their services hosted on an IaaS provider while others are completely in a collocation facility or datacenter. We are seeing more clients move towards a hybrid model where they make use of collocation for the majority of their hardware but burst demand to the cloud when needed. This burst might be triggered by an unusually high load of user traffic or from nightly batch jobs to process log files or during QA of the iteration. Wherever you are on the spectrum of IaaS utilization here are some concerns that you get comfortable with before diving in.

  • Security – Many IaaS providers are becoming PCI DSS, ISO 27001, and HIPAA compliant. However, simply using their service doesn’t provide you with the compliance and you need to be responsible for your own auditing. Passing audits is not cut and dry but rather a negotiation with the auditors and therefore you need to be able to clearly articulate how you are following the standards or guidelines while utilizing cloud computing.
  • Cost – While the cost of IaaS is decreasing, most companies still find that if they run the servers (virtual machines) 24 hrs / day the break even is around 18 months. This is a simple spreadsheet analysis that should be run to determine the most cost effective solution.
  • Inconsistent I/O – There is a huge amount of variability with some IaaS storage. Some of our clients run Bonnie to test I/O prior to using the instance and then periodically to ensure it hasn’t dropped drastically. You need to make sure your application can handle this variability or create work-arounds to handle this issue.
  • High failure rate – The virtual machines of most IaaS seem to be less reliable than bare metal. Netflix who moved to 100% IaaS came up with Chaos Monkey and Simian Army to address this issue of less reliability.

Whether you are in the market for IaaS, PaaS, or SaaS cloud offerings there are a variety of things to consider. In general the concerns revolve around Security, Non-portability, Control, and Performance while the benefits include Cost, Speed of Deployment, and Flexibility.