AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Category » CTO/CIO

Enablement

One of the most important aspects of managing a successful technology organization is ensuring that you are practicing & instilling the concept of enablement at all levels. This concept applies to both the product/service you are producing and for people. A good example for your organization is enabling decision-making at the lowest levels possible. I have often seen this represented as “delegation” but I believe that enablement of decision-making is a more powerful concept than delegation which is driven from the top-down. I recently had the opportunity to lead a large infrastructure team and one of the first changes we made was breaking apart into reasonable sized PODs with the primary purpose of ensuring that decisions for the product & technology were driven from the bottom-up while guidance was flowing in from various stakeholders. Many teams practice a flavor of Agile but without enabling each POD to make the appropriate decisions you will run into organizational scalability problems.

The allure of IaaS & PaaS is firmly rooted in the concept of enablement. Self-service is an amazing if you are a DBA, developer or even the end user of your product. The cloud may not be suitable for your needs but don’t let that stop your organization from thinking strategically about bringing those processes & technologies “in-house” for scalability reasons. Implemented correctly, the infrastructure and platform you are building should enable the users and not hinder them, as is sometimes the case. Reducing the number of dependencies between technology teams for launching products is not only good for cycle time of product launches but also critical in scaling up.

Consider making enablement part of your technology team’s DNA and you will likely see that employee morale, productivity and other metrics like NPS will rise.


The End of Scalability?

If you received any sort of liberal arts education in the past twenty years you’ve probably read or at least had an assignment to read Francis Fukuyama’s 1989 essay “The End of History?”[1] If you haven’t read the article or the follow on book, Fukuyama argues that the advent of Western liberal democracy is the final form of human government and therefore it is the end point of humanity’s sociocultural evolution. He isn’t arguing that events will stop happening in the future but rather that democracy will become more and more prevalent in the long term, despite possible setbacks such as totalitarian governments for periods of time.

I have been involved, in some form or another, in scaling technology systems for nearly two decades, which does not take into account the decade before when I was hacking on Commodore PETs and Apple IIs learning how to program. Over that time period there have been noticeable trends such as the centralization/decentralization cycle within both technology and organizations. With regards to technology, think about the transitions from mainframe (centralized) to client/server (decentralized) to web 1.0 (centralized) to web 2.0/Ajax (decentralized) as an example of the cycle. The trend that has lately attracted my attention is about scaling. I’m proposing that we’re approaching the end of scalability.

As a scalability consultant who travels almost every week to clients in order to help them scale their systems, I don’t make this end of scalability statement lightly. However, before we jump into my reasoning, we first need to define scalability. To some scalability means that a system needs to scale infinitely no matter what the load over any period of time. While certainly ideal, the challenge with this definition is that it doesn’t take into account the underlying business needs. Investing too much in scalability before its necessary isn’t a wise investment for a business when there are other great projects in which to invest such as more customer facing features. A definition that takes this into account defines scalability as “the ability of a system to maintain the satisfaction of its quality goals to levels that are acceptable to its stakeholders when characteristics of the application domain and the system design vary over expected operational ranges.” [2:119]

The most difficult problem with scaling a system is typically the database or persistent data storage. AKF Partners teaches general database and applications scale theory in terms of a three-dimensional cube where the X-axis of the cube represents replication of identical code or data, the Y-axis represents a split by dissimilar functions or services, and the Z-axis represents a split across similar transactions or data.[3] Having taught this scaling technique and seen it implemented in hundreds of systems, we know that by combining all three axes a system can scale infinitely. However, the cost of this scalability is increased complexity for development, deployment, and management of the system. But is this really the only option?

NoSQL
The NoSQL and NewSQL movement has produced a host of new persistent storage solutions that attempt to solve the scalability challenges without increased complexity. Solutions such as MongoDB, a self-proclaimed “scalable, high-performance, open source NoSQL database”, attempt to solve scaling by combining replica data sets (X-axis splits) with sharded clusters (Y & Z-axis splits) to provide high levels of redundancy for large data sets transparently for applications. Undoubtedly, these technologies have advanced many systems scalability and reduced the complexity of requiring developers to address replica sets and sharding.

But the problem is that hosting MongoDB or any other persistent storage solution requires keeping the hardware capacity on hand for any expected increase in traffic. The obvious solution to this is to host it in the cloud, where we can utilize someone else’s hardware capacity to satisfy our demand. Unless you are utilizing a hybrid-cloud with physical hardware you are not getting direct attached storage. The problem with this is that I/O in the cloud is very unpredictable, primarily because it requires traversing the network of the cloud provider. Enter Solid-State Drives (SSD).

SSD
Chris Lalonde, CEO of ObjectRocket a MongoDB cloud provider hosted entirely on SSDs, states that “Developers have been thinking that they need to keep their data set size the same size as memory because of poor I/O in the cloud and prior to 2.2.x MongoDB had a single lock, both making it unfeasible to go to disk in systems that require high performance. With SSDs the I/O performance gains are so large that it effectively negates this and people need to re-examine how their apps/platforms are architected.”

Lots of forward thinking technology organizations are moving towards SSDs. Facebook’s appetite for solid-state storage has made it the largest customer for Fusion-io, putting NAND Flash memory products in its new data centers in Oregon and North Carolina. Lalonde says “When I/O becomes cheap and fast it drastically changes how developers think about architecting their application e.g. a flat file might be just fine for storing some data types vs. the heavy over head of any kind of structured data.” ObjectRocket’s service offering also provides some other nice features such as “instant sharding” where through the click of a button provides an entire 3-node shard on demand.

GAE
Besides the advances being made in leveraging NoSQL and SSDs to allow applications to scale using Infrastructure as a Service (IaaS), there are advances in Platform as a Service (PaaS) offerings such as Google App Engine (GAE) that are also helping systems scale with little to no burden on developers. GAE allows applications to take advantage of scalable technologies like BigTable that Google applications use, allowing them to claim that, “Automatic scaling is built in with App Engine, all you have to do is write your application code and we’ll do the rest. No matter how many users you have or how much data your application stores, App Engine can scale to meet your needs.” While GAE doesn’t have customers as large as Netflix who run exclusively on Amazon’s Web Services (AWS) their customers do include companies like the Khan Academy, which has over 3.8 million monthly unique visitors to their growing collection of over 2,000 videos.

So with solutions like ObjectRocket, GAE, and the myriad of others that make it easier to scale to significant numbers of users and customers without having to worry about data set replication (X-axis splits) or sharding (Y & Z-axis splits), are we at the end of scalability? If we’re not there yet we soon will be. “But hold on” you say, “our systems are producing more and consuming more data…much more.” No doubt the amount of data that we process is rapidly expanding. In fact, according to the EMC sponsored IDC study in 2009, the amount of digital data in the world was almost 500 exabytes and doubling every 18 months. But when we combine the benefits we achieve from such advances as transistor density on circuits (Moore’s Law), NoSQL technologies, cheaper and faster storage (e.g. SSD), IaaS, and PaaS offerings, we are likely to see the end of the need for most applications developers to care about manually scaling their applications themselves. This will at some point in the future all be done for them in “the cloud”.

So What?
Where does this leave us as experts in scalability? Do we close up shop and go home? Fortunately, no, there are still reasons that application developers or technologists need to be concerned with splitting data replica sets and sharding data across nodes. Two of these reasons are 1) reduce risk and 2) improve developer efficiency.

Reducing Risk
As we’ve written about before, risk has several high-level components (probability of an incident, duration, and % of customers impacted). Google “GAE outages” or “AWS outages” or any other IaaS or PaaS provider and the word “outage” and see what you find. All hosting providers that I’m aware of have had outages in their not-so-distant past. GAE had a major outage on October 26, 2012 for 4 hours. GAE proudly states at the bottom of their outage post “Since launching the High Replication Datastore in January 2011, App Engine has not experienced a widespread system outage.” Which sounds impressive until you do the math and realize that this one outage caused their availability to drop to 99.975% for the entire year and a half that the service has been available. Not to mention that they have much more frequent local outages or issues that affect some percentage of their customers. We have been at clients when they’ve experienced incidents caused by GAE.

The point here is not to call out GAE, trust me all other providers have the exact same issue. The point is that when you rely on a 3rd party for 100% of your availability you by definition have their availability as your ceiling. Now add on your availability issues because of maintenance, code releases, bugs in your code, etc. Why is this? Incidents are almost always have multiple root causes that include architecture, people, and process. Everything eventually fails including our people and processes.

Given that you cannot reduce the probability of an incident to 0%, no matter whether you run the datacenter or a 3rd party provider does, you must focus on the other risk factors (reduce the duration and reduce the % of customers impacted). The way you achieve this is by splitting services (Y-axis splits) and by separating customers (Z-axis splits). While leveraging AWS’s RDS or GAE’s HRD provides cross availability zone / datacenter redundancy, in order to have your application take advantage of these you still have to do the work to split it. And if you want even higher redundancy (across vendors) you definitely have to do the work to split applications or customers between IaaS or PaaS providers.

Improving Efficiency
Let’s say you’re happy with GAE’s 99.95% SLA which no doubt is pretty good especially when you don’t have to worry about scaling. But don’t throw away the AKF Scalability Cube just yet. One of the major reasons we get called in to clients is because their BOD or CEO aren’t happy with how the development team is delivering new products. They recall the days when there were 10 developers and features flew out of the door. Now that they have 100 developers, everything seems to take forever. The reason for this loss of efficiency is that with a monolithic code base (no splits for different services) all 100 developers trying to make changes and add functionality, they are stepping all over each other. There needs to be much more coordination, more communication, more integration, etc. By splitting the application in to separate services (Y-axis splits) with separate code bases the developers can split into independent teams or pods that makes them much more efficient.

Conclusion
We are likely to see continued improvement in IaaS and PaaS solutions that auto-scale and perform to such a degree that most applications will not need to worry about scaling because of user traffic. However, this does not obviate the need to consider scaling for greater availability / vendor independence or to improve a development teams efficiency. All great technologist, CTOs, or application developers will continue to care about scalability for these and other reasons.


REFERENCES
1. Francis, F., The End of History? The National Interest, 1989. 16(4).
2. Duboc, L., E. Leiter, and D. Rosenblum, Systematic Elaboration of Scalability Requirements through Goal-Obstacle Analysis. 2012.
3. Abbott, M.L. and M.T. Fisher, The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise. 2009: Addison-Wesley Professional.


Comments Off

Signs That You May Be Disconnected From Your Business

We as engineers love to problem solve. In fact, we love to explore new technologies and debate with our colleagues as to what may be the best to use. Should we use Python? Should we use PHP? Should we use Java? Should we try Google’s App Engine and code in GO? Should we use Amazon Web Services exclusively for our product? What database technology should we use? All of these decisions are important and factors such as skillsets, security, performance, and cost should be considered. We love to code and see the product in action as it provides us with a sense of accomplishment. Once we are done with a project and its deployed in production we typically celebrate all of the hard work that went into the solution and we move on to the next project. Many times after diving deep into the technical aspect of the solution, for weeks and maybe even months, we wake up and we discover we are not in touch with the business like we should be. All of us have seen this in practice and many of our clients face this challenge.

What are some of the signs that you may not be aligned closely enough with the business and its performance and what should you do about it?

1) Not understanding feature impact – New features are introduced into your product without an understanding of the business impact.

You should never introduce new features without understanding the impact it is supposed to have on your business. In other words, establish a business goal for new features. For example, a new feature that allows for one click purchase is expected to improve conversion rates by 15% within 2 months. Remember all goals should be SMART goals (per chapter 1 in The Art of Scalability – Specific, Measurable, Attainable, Realistic, and Time constrained).

2) Celebrating launch and not success – Upon deployment of your product and confirmation that everything is working as expected, fireworks go off, confetti falls from the ceiling, and the gourmet lunch is ordered.

While recognizing your team for their efforts to launch a feature can be important, you really should celebrate when you have reached the business goal that is associated with that feature (see item #1 above). This might be achieving a revenue target, increasing the number of active accounts, or reaching a conversion target. If you deploy a feature or a product, and your business targets are not met as expected, you have more work to do. This should not be a surprise. Agile development’s basic premise is that we do not know what the final state of the feature and thus we must iterate.

3) No business metric monitoring – Your DevOps team is alerted that something is wrong with one of your services but you rely on your customer support or service desk to tell you if customers are impacted.

This is something that many of our clients struggle with. We believe its critical to detect a problem before you customers have to tell you or you have to ask them. You should be able determine if your business is being impacted without asking your customers.

By monitoring real time business metrics and comparing the actual data to a historical curve you can more quickly detect if there is a problem and avoid sifting trough alerting and monitoring white noise that your systems will inevitability produce. For more details on our preferred strategy, visit one of our earlier blogs. You can also read more about the Design To Be Monitored principle in our book Scalability Rules.

As you company grows, make sure that your product, your engineering team, and your technical operations are closely aligned with your business. We firmly believe that we as technologists must stay aligned with our business for success. In fact, we really are the business. Our solutions enable the company to earn revenue.


1 comment

Dealing with Shared Services

In our latest Newsletter, we wrote about the importance of aligning your agile teams to the architecture of the system and the trend we are seeing as our clients move towards PODS. We believe and teach that designing your architecture is only the first step in building an organization that can scale in support of your product. Remember, agile autonomous teams are able to act more efficiently which results in a speedier TTM. Ideally, they should almost be able to behave like mini-startups.

Aligning teams to swim lanes is pretty straightforward but what do you do with a team that’s central to multiple services?

We often see clients who have this challenge. They have a shared service or feature that the other clearly split autonomous teams depend on. For example, there are several sites that bring consumers together with businesses in their local area. These sites often have categories or verticals that need search functionality. Asking each team to design its own search functionality would be wasteful as you would end up with engineers designing redundant functionality which would exponentially cost more to operate. It is absolutely feasible to create a team that would focus on search and be used by other teams in this situation. We do recommend that you minimize the existence of these types of teams when possible, as there is always risk that they could slow down TTM.

Great! All of the other teams are going to bombard the shared service team with new development requests. So what do you do to mitigate the risk of over allocating your engineers with such a team?

This risk is real as the other teams will make requests for enhancements or functionality to support their services and they will want it quickly. To mitigate this risk we suggest thinking of the team almost like you would an open source project. That doesn’t mean you simply open up the search code base to all of your engineers and let them have at it. Rather, it means you put mechanisms in place to help control the quality and design for your business. An open source project often has its own repo and typically only allows trusted developers to commit. In our search example, you could designate a couple trusted and experienced engineers in the other PODS that can code and commit to the search repo. Engineers on the search team can be focused on making sure new functionality aligns with architectural and design principles that your company has established. This approach should help to mitigate the potential bottleneck such a team could create.

OK, now that you have spread out the development of search, who really owns it?

Remember, ownership by many is ownership by none. In our example, the search team ultimately owns the search feature and code base. As other developers commit new code to the repo, the search team should conduct code, design, and architectural reviews. Just as the other PODS will deploy new features to production, the search team will also own deployment of search. Overall, all of your teams should have objectives that align with a few key business success metrics.

Remember, whatever mechanisms you put in place, your shared service or tools team should be a gas pedal and not a break for TTM. Good luck scaling your architecture and your organization. We would love to hear about some of the experiences from those of you that have tried this or other approaches.


6 comments

Cloud Services

Recently a reader asked if we still thought the points in this 2008 article “The Cloud Isn’t For Everyone” were still valid. At that time we pointed out that there were 5 major concerns: Security, Non-portability, Control, Limitations (persistent storage, public IPs, etc), and Performance. We also pointed out these pros: Cost, Speed, and Flexibility. Our short response was “yes”, other than the limitations which have mostly been solved, these are still valid. However, this discussion got us interested in revisiting this topic.

We deal with a wide variety of companies from hospitals to ad tech to ecommerce companies, all of which have different levels of knowledge about cloud computing. In this post we’re going to first define different types of cloud computing and then discuss some of the concerns with each.

A common definition of cloud computing is the delivery of computing and storage capacity as a service. A distinct characteristic is that users are charged based on usage instead of a more conventional licensing or upfront purchase. There are three basic types of cloud computing:

  • Infrastructure as a Service (IaaS) – This is the most basic type of cloud service and offers servers (often as virtual machines), networking, and storage as services. Examples of this include Amazon’s Web Services and Rackspace.
  • Platform as a Service (PaaS) – This cloud offering provides not only the hardware but a layer above, providing platforms to run custom applications usually specific to certain programming languages. Examples of these include Microsoft’s Azure and Google’s App Engine.
  • Software as a Servie (SaaS) – This service is an offering of a finished product hosted in a multi-tenant manner (many customers on a single implementation). Examples of this include Gmail, Sales Force, Service-Now, New Relic, and many more.

In a 2012 survey with 785 respondents by North Bridge Venture Partners, we see trends indicating that companies are much more comfortable with cloud computing offerings. Down from 11% last year, a very low 3% of respondents consider cloud services to be too risky. Only 12% say the cloud platform is too immature and 50% of the survey respondents now say they have “complete confidence” in the cloud. Eighty-four percent of all net new software will be SaaS-based. The take away is that cloud computing is simply becoming the way we do things.

SaaS
The demand for SaaS offerings is growing rapidly. Gartner predicts that SaaS will hit $14.5 billion in 2012, a 17.9% growth from the previous year, with growth continuing through 2015 when it will be $22.1 billion. This growth is being fueled by several factors including new software design and delivery modules allowing for more instances of an application to run simultaneously, bandwidth costs continuing to drop, and customer frustration over the cycle of purchasing, paying for maintenance, and going through time consuming upgrades. According to CIOZone.com, a list of the top 60 fastest-growing public software companies in 2007 was dominated by companies switching from a proprietary license model to a subscription model.

When you consider implementing / purchasing a SaaS solution there are a number of questions that you should ask.

  • What if the service is unavailable? Our often repeated mantra is that “Everything Fails.” If you’re in the game long enough you’ve seen it all fail – servers, network, storage devices, ISPs, and even entire datacenters. If the service fails, how does this impact your business or your own services?
  • Where is your data secured and backed up? If you’re storing sensitive data – corporate email, customer names, or PII – it is important that you understand how your data is being handled and protected.
  • What is the cost? Besides the monthly subscription or usage cost, you should investigate the total cost including startup, transfer, and periodic data migration.
  • What level of access? It’s your data but that doesn’t mean you’ll have unlimited access to it. Consider how you retrieve the data today and most importantly how you might want to access it in the future.
  • Does it comply with industry regulations? If your business has regulatory requirements such as SOX, PCI, PA-DSS, etc you should investigate whether the SaaS provider supports these.

PaaS
The demand for PaaS, at least among our clients, is not that great. As Barb Darrow on GigaOm stated “For die-hard .Net heads, Azure is probably the PaaS of choice. But for the army of new-age web developers, it’s an also-ran.” In fact, there are early indications that PaaS providers are starting to offer services that are moving them more into the IaaS market. If you’re interested in PaaS your choices are very limited based on the technology stack that you are using. The real question to ask for PaaS is whether you should go directly to an IaaS provider or if you gain enough benefits from the PaaS provider to makeup for the additional cost.

IaaS
The spend on IaaS cloud computing is expected to grow 48.7 percent in 2012 to $5.0 billion, up from $3.4 billion in 2011. Some of our clients have 100% of their services hosted on an IaaS provider while others are completely in a collocation facility or datacenter. We are seeing more clients move towards a hybrid model where they make use of collocation for the majority of their hardware but burst demand to the cloud when needed. This burst might be triggered by an unusually high load of user traffic or from nightly batch jobs to process log files or during QA of the iteration. Wherever you are on the spectrum of IaaS utilization here are some concerns that you get comfortable with before diving in.

  • Security – Many IaaS providers are becoming PCI DSS, ISO 27001, and HIPAA compliant. However, simply using their service doesn’t provide you with the compliance and you need to be responsible for your own auditing. Passing audits is not cut and dry but rather a negotiation with the auditors and therefore you need to be able to clearly articulate how you are following the standards or guidelines while utilizing cloud computing.
  • Cost – While the cost of IaaS is decreasing, most companies still find that if they run the servers (virtual machines) 24 hrs / day the break even is around 18 months. This is a simple spreadsheet analysis that should be run to determine the most cost effective solution.
  • Inconsistent I/O – There is a huge amount of variability with some IaaS storage. Some of our clients run Bonnie to test I/O prior to using the instance and then periodically to ensure it hasn’t dropped drastically. You need to make sure your application can handle this variability or create work-arounds to handle this issue.
  • High failure rate – The virtual machines of most IaaS seem to be less reliable than bare metal. Netflix who moved to 100% IaaS came up with Chaos Monkey and Simian Army to address this issue of less reliability.

Conclusion
Whether you are in the market for IaaS, PaaS, or SaaS cloud offerings there are a variety of things to consider. In general the concerns revolve around Security, Non-portability, Control, and Performance while the benefits include Cost, Speed of Deployment, and Flexibility.


11 comments

Active-Passive and Spare Tires

If a company has already established a disaster recovery plan that involves failover to a cold or passive datacenter it is often hard to convince them to switch to a solution that involves taking traffic at both datacenters. We call this failover type of architecture active-passive and active-active when you accept traffic at both datacenters. If you too believe active-passive architecture is a satisfactory solution for disaster recovery consider this analogy.

Flat Tire

Most cars have four tires and a spare tire in the trunk. However, you might have noticed that semi-trucks have pairs of tires on both the tractor and trailer, except for the front wheels that turn. An active-passive architecture is like a car and the active-active solution is like the semi-trucks. Here are three comparisons that might help sell you on an active-active solution.

1) Even with the best of intentions active-passive solutions don’t get tested regularly. How often do you practice changing or even check your spare tire? If you don’t check your spare tire periodically for the correct air pressure it might be flat when you need it most. Passive datacenters are the same way. If you don’t rollout code to it every release and occasionally take actual traffic, it probably won’t work when you need it.

2) Active-active solutions are is much faster to take over traffic when there is a disaster. If your racing down the road which is faster when you get a flat, stop and replace the flat tire with the spare or keep riding on the extra tire? Even if you use a DNS solution like UltraDNS that can failover quickly, you’ll likely need to warm up cache, apply the last round of data logs, etc. before you can take traffic safely in a passive datacenter.

3) Active-active solutions make better use of the investment in equipment than active-passive solutions. The spare tire in your trunk might get used once every year, if you’re unlucky. The second tire on the semi gets used every day helping carry a greater load and reducing the wear on the other tire.

While active-passive is better than not having a disaster recovery plan it’s not the best that you can do. Consider getting to an active-active solution that exercises your DR solution every day and makes use of all that investment.


4 comments

Alternative Solutions to Old Problems

Are you like @devops_borat and not a fan of DevOps? Or, maybe you think deploying dozens of time each day to production is ludicrous. I’m actually a fan of both DevOps and continuous deployment but if you’re not don’t worry these are just new solutions to old problems and there are alternatives.

devops_borat

The Problems
As long as people have been divided into separate organizations there has existed strife and competition between the teams. In the technology field this is no place more apparent than between development and operations. In at least 50+% of the companies that we meet with they have problems getting these teams to work together. If you’ve been around for a few years you’ve surely heard one team pointing to the other as the problem, whether that problem is an outage or slow product development.

A solution to this problem is DevOps. Wikipedia states that DevOps “relates to the emerging understanding of the interdependence of development and operations in meeting a business’ goal to producing timely software products and services.”

Another common tech problem is that large changes are risky. It is called “Big Bang” for a reason…things go bang! If you’ve been part of an ERP implementation that took months if not years to prepare for you know how risky these large changes are.

A solution to this problem is to make small changes more frequently. According to Eric Ries, co-founder and former CTO of IMVU, continuous deployment is a method of improving software quality due to the discipline, automation, and rigorous standards that are required in order to accomplish continuous deployment.

Alternative Solutions
Admittedly, DevOps and continuous deployment are somewhat extreme for some teams. For those or for teams that just don’t believe that these are the solutions, don’t fret there are alternatives.

JAD/ ARB – For improving the coordination between development and operations, we’ve recommend the JAD and ARB processes. These are very lightweight processes that force the teams to work together for better architected and better supported solutions.

Progressive Rollout – For reducing risk by making smaller changes, we recommend progressive rollout. This is a simple concept that involves first pushing code to a very small set of servers, monitoring for issues, and then progressively increasing the percentage of servers that receive the new code. The time between rollouts can be 30 min to 24 hours depending on how quickly you are likely to detect problems. We often suggestion the percentage of servers in the progressive rollout to be 1%, 5%, 20%, 50%, 100%.

The bottom line is something technologists know – there are almost always multiple ways to solve a problem. If you don’t like the current or new solution look for an alternative.


Cascading Failures

I was chatting with Nanda Kishore (@nkishore) the ShareThis CTO about the recent problems Amazon had in one of their zones. Even though ShareThis is 100% in the cloud, because they have properly architectured their system, these regional outages didn’t affect ShareThis services at all. Of course kudos to Nanda and his team for their design and implementation but more interesting was our discussion about this being a cascading failure in which one small problem cascades into a much bigger problem. A few days later Amazon provided a bit of a postmortem confirming that a simple error during a network change started the problem. The incorrect traffic shift left the primary and secondary EBS nodes isolated, each thinking the other had failed. When they were reconnected they rapidly searched for free space to re-mirror, which exhausted spare capacity and led to a “re-mirroring storm.”

As we were discussing the Amazon issue, I brought up another recent outage of a major service, Facebook. In Sep 2010 they had a several hour outage for many users caused by an invalid configuration value in their cahcing tier. This caused every client that saw the value to attempt to fix it, which involved a query to the database. The DBs were quickly overwhelmed by hundreds of thousands of queries per second.

Both of these are prime examples of how in complex systems, small problems can cascade into large incidents. Of course there has been a good deal of research on cascading failures, including models of the probability distributions of outages to predict their occurrence. What I don’t believe exists and should is a framework to prevent them. As Chapter 9 in Scalability Rules states the most common scalability related failure is not designing to scale and the second most common is not designing to fail. Everything fails, plan for it! Of course utilizing swim lanes or fault isolation zones will certainly minimize the impact of any of these issues but there is a need for handling this at the application layer as well.

As an example, say we have a large number of components (storage devices, caching services, etc) that have a failsafe plan such as refreshing the cache or re-mirroring the data. Before these actions are executed, the component should check in with an authority that determines if the request should be executed or if too many other components are doing similar tasks. Alternatively, a service could monitor for these requests over the network and throttle/rate limit them much like we do in an API. This way a small problem that causes a huge cascade of reactions can be paused and handled in a controlled and more graceful manner.


Federated Cloud

In an interesting paper in the IBM Journal of Research and Development, the concept of a federated cloud model is introduced. This model is one in which computing infrastructure providers can join together to create a federated cloud. The advantages pointed out in the article include cost savings due to not over provisioning for spikes in capacity demand. To me the biggest advantage of this federated model is the lack of reliance on a single vendor and likely higher availability due to greater distribution of computing resources across different infrastructure. One of our primary aversions to a complete cloud hosting solution is the reliance on a single vendor for the entire availability of your site. A true federated cloud would eliminate this issue.

However, as the article aptly points out there are many obstacles in the way of achieving such a federated cloud. Not the least of which are technical challenges to architect applications in such a modular manner as to be able to start and stop components in different clouds as demand requires. Other issues include administrative control and monitoring of multiple clouds and security concerns over allowing direct access to hypervisors by other cloud providers.

As we’ve prognosticated, pure VM based clouds like AWS have had to offer dedicated servers for those high intensity IO systems like large relational databases. We’ve also predicted that with double digit growth in cloud services predicted for the next several years, providers will resist the commoditization of their offerings through service differentiation. This attempt at differentiation will come in the form of add-on features and simplification across the entire PDLC. This unfortunately makes the likelihood of a federated cloud offering happening in the next couple of years very unlikely.


2 comments

Availability as a Feature

It doesn’t matter if you run a commerce site, a services product (such as a SaaS offering) or simply use your homepage to distribute information:  The table stakes for playing online is high availability.  So many companies just take for granted that they will be highly available because they have multiple instances of systems and multiple copies of their data.  This assumption of availability will likely, at the very least, cost you significant pain and in the extreme cost you either significant market share or close your doors as a business.  Customers expect the unachievable – 100% availability.  At the very least you need to give them something close to that.  What will happen to you if you have a data center failure?  How about if a DBA accidentally drops a critical table in your production database?  What will you do when that marketing campaign triggers a near overnight doubling of traffic?  What happens when that new feature has a significant performance bug and gets adopted so quickly that it brings your entire site to its knees?

We often tell our clients that they should treat high availability as a feature.  Unfortunately, it is a somewhat expensive feature that requires constant investment overtime to achieve and maintain. It is a must have feature that will only differentiate your firm if you have competitors who do not make similar investments; when competition exists, customers are more likely to leave a site for a competitor due to availability and performance issues than nearly any other reason.  If you don’t believe us on this topic, just go ask the folks at Friendster.

Treating availability as a feature means measuring availability from a customer perspective rather than a systems or device perspective.  How many times did customer requests not complete?  In this regard, availability now becomes a percentage of failed transactions against an expected number of transactions.   We define an approach to accomplish this in our first book “The Art of Scalability”.  Every executive in the company should “own” the availability metric and understand its implication to the business.    You should track how much you invest in availability over time and significant decreases in engineering or capital should be questioned as it may be an early indicator that you are under investing and a harbinger of hard times to come.

One of the most common failures we see is to assume that disaster recovery is something that only big companies need.  Make no mistake about it, disasters do happen and given enough time they will happen to you.  Data centers catch on fire, have water (sprinkler) discharges that ruin equipment, have complete power equipment failures that take hours to fix and are prone to damages from vehicles, earthquakes, employees and tornados.  In our past lives as executives and current roles as advisors we’ve seen no less than 4 data center fires, 2 data centers incapacitated from earthquakes and tornados and one data center leveled by a truck running into it.  You are never too young to invest in DR.

And DR need not break your bank.  The dials of RTO (recovery time objective) and RPO (recovery point objective) allow you to determine how much you will invest.  Perhaps you simply replicate your databases to a smaller set of databases at a remote datacenter and have a copy of each of your systems there with an additional copy ready “in the cloud”.  While you won’t be able to run production from that data center, you may be able to leverage the cloud to add capacity for relatively low cost by cloning the cloud based systems.  Such a solution has a fast recovery point objective (you lose very little data) and a moderate recovery time objective (several hours) for very low comparative cost.  Of course, you would need to test the solution from time to time to show that it is viable, but it’s a cheap and effective insurance policy for the business.

So remember – availability is your most important feature.  Customers expect it always and will run away from you to competitors if you do not have it.  Create an availability metric and ensure that everyone understands it as a critical KPI.  Evaluate the company spend against availability quarterly or annually as an additional indicator of potential problems.   Assume that disasters happen and have a DR plan regardless of your company size.