AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » Scalability Blogs and Reviews

Scalability Warning Signs

Is your system trying to tell you that you're going to have scalability problems? We like to think that we couldn't have predicted problems at 10x our last year's traffic but there are often warning signs that we can heed if we know what to look for.

Unless you’re one of the incredibly lucky sites where the traffic spikes 100x overnight, scalability problems don’t sneak up on you. They give you warning signs that if you are able to recognize and react to appropriately, allow you to stay ahead of the issues. However, we’re often so head down getting the next release out the door that we don’t take the time to realize we’re experiencing warning signs until they become huge problems staring us in the face.  Here are a few of the warnings that we’ve heard teams talk about in the past couple of months that were clearly signs of problems on the horizon.

Not wanting to make changes – If you find yourself denying request for changes to certain parts of your system, this might be a warning sign that you have scalability issues with that component. A completely scalable system has components that can fail without catastrophic impact to customers. If you’re avoiding changes to a component because of the risk of problems this is a warning sign that you need to re-architect to eliminate or at least mitigate the risk.

Performance creep – If after each release you need to add hardware to a subsystem or you accept a performance degradation in a service you could have a scaling issue approaching quickly. Consistently increasing consumption of CPU or memory resources in a service with each release will lead you into an unsustainable situation. If today you’re comfortably sitting at 40% CPU utilization and you allow a modest 10% degradation in each release you have less than nine releases before you are well above 100% but the reality is you won’t get close to that without significant issues.

Investigating larger hardware – If you’ve started asking your vendors or VAR about bigger hardware you’re heading down the path of scalability problems. The scale of more computational resources per dollar is not linear, it’s closer to cubic or even exponential scales. Purchasing more expensive hardware might seem like the economical way out when you compare the cost of the first hardware upgrade versus developer time but run the calculation out several iterations. When you get to a Sun Fire™ E25K Server with Oracle Database 10g at a $6M price tag you might feel differently about the decision.

Asking vendors for advanced features – When you start exploring advanced options of your vendor’s software you’re likely heading down the path of increased complexity and this is a clear warning sign of scalability problems in your future. Besides potentially locking you into a vendor which lowers your negotiating power it puts the fate of your company in someone else’s hands, which wouldn’t make me sleep very well at night. See our post on using vendor features to scale for more information.

Watch out for these or similar warning signs that scalability problems are looming on the horizon. Dealing with the problems today while you have time to plan properly might not get you an award for being a firefighter but you’ll more likely deliver a quality product without costly interruption.


1 comment

Using Vendor Features to Scale

If you’ve had the opportunity to participate in an engagement with us you know that we tend to utilize the Socratic method of asking questions. And, it’s not uncommon for us to hear a company explain how they plan to utilizing a particular vendor’s feature to scale. For databases this is often clustering. We believe relying on a vendor to scale violates several of our architectural and scalability principles. Let us explain a couple of our more significant concerns with this practice of relying on a vendor.

First and foremost you should want the fate of your company, your team, and your career in your own hands. Do not look for vendors to relieve you of this burden. As a CTO if the vendor you vetted and selected fails, causing downtime for your business, your are just as responsible as if you had written every line of code. And if it is significant or frequent enough you should expect to be relieved of your position. All code has bugs, even vendor provided code, and personally I would rather have the source code to fix it rather than have to rely on a vendor to find the problem and provide me with a patch. This statement should not be taken to imply that you should do everything yourself such as writing your own database. Use vendors for things that they can do better than you and that are not part of your core competency. See our post on build vs buy for the four question to answer when making this decision.

With scalability, as with many other things in life, simple is better. The more complex you make your system to provide scalability the more you are likely to suffer from availability issues. More complex systems are more difficult and more costly to maintain. Clustering technologies are much more complex than straightforward log shipping for creating read replicas.

One of our beliefs, and should be one of yours also, is that it is most cost effective to be vendor neutral. Locking yourself into a single vendor, whether hardware or software, gives them the upper hand in negotiations. If you don’t think the sales rep knows this, ask yourself why they are willing to throw in some of these features at such an initially steep discount. The reason is that they can make up for it next year when you have to renegotiate.

If these concerns resonate with you, check out our database scalability cube post for ideas on how to design a scaling strategy that you can own, is relatively simple, and is vendor agnostic.


2 comments

Complacency Kills Scalability – A Review of Kotter's "A Sense of Urgency"

We recently read John Kotter’s “A Sense of Urgency”.  Professor Kotter, of the Harvard Business School, is often thought of as one of the premier authorities on change and has written books such as Leading Change and The Heart of Change.

The book is an easy read and we highly recommend it.  In it Kotter argues that all companies need to have a sense of urgency to succeed in today’s world of continuous change.  Without urgency, companies are doomed and unfortunately most companies do not act urgently.   Instead, they act with the equally insidious enemies of urgency: complacency and false urgency.

Complacency has its roots in past success and is very pervasive.   People feel confident and content that they know what they need to do. Change comes seldom, even while the business needs change rapidly.   False urgency is equally pervasive and is mistakenly taken for a true sense of urgency.   False Urgency springs from recent failures and problems, focusing on short-term results even in the light of long term declines.   Anxiousness, anger and frustration coupled with frenetic activity netting little benefit are all characteristics of False urgency.  False urgency is often mistaken for urgency.

Urgency is rare and critically important.  It springs from great leadership and the recognition that opportunities and hazards abound.  The focus is on winning and in purging the company of all unnecessary activities.   Whereas false urgency is deflating, urgency can be rejuvenating.

In our experience, there are few places where complacency is more to blame than in the failure to scale your product.  You are successful and growing.   Things are going well and you are profitable or well on your path.   The press says great things about you and investors are flocking to your door.   Complacency abounds.   Why would you do anything other than you were doing yesterday? Get ready for failure!

Then when you fail, you look to just fix the current issues.   False urgency sets in and people rush about creating spreadsheets, presentations and meetings occur hourly.  But where is the simultaneous focus on long term needs?   Who is focusing on making the crisis a future success?   How are you ensuring that the processes you need are in place to keep you from having future failures?   Whom do you have looking at all the other limitations within your architecture?   Without the right focus, you will return to complacency and start a cycle of scale related failures that will bring your company to its knees.

Your only answer is to set a real sense of urgency.   The strategy, as Kotter recommends it, is to win over the minds and the hearts of your team and company to the scalability initiatives.   Explain why scale is important in a way that speaks to their hearts; make it about the customer!

Tactically, Kotter recommends four steps:

1) Bring the outside in: Focus externally.  What are other companies doing to solve their scale problems?   Get expert help where you need it.

2) Behave with urgency.  We take this to mean “set the example”.  Discuss scale every day and ask scale related questions.

3) Find opportunity in crises.   As we discuss in our soon to be released, a crisis is a chance for you to make your company better!

4) Deal with No-Nos.  These are the people who say “we’ve tried that before” or “that won’t work here”.   See our article entitled “Seed, Feed and Weed to Succeed”.   You can’t afford to have folks diluting your culture and sowing the seeds of complacency.


1 comment

Scalability Best Practices

Here are a baker’s dozen of items that we feel are Best Practices for Scalability:

  1. Asynchronous – Use asynchronous communication when possible. Synchronous calls tie the availability of the two services together. If one has a failure or is slow the other one is affected.
  2. Swim Lanes – Create fault isolated “swim lanes” of hardware by customer segmentation. This prevents problems with one customer from causing issues across all customers. This also helps with diagnosis of issues and code roll outs.
  3. Cache – Make use of cache at multiple layers including object caches in front of databases (such as memcached), page or item caches for content (such as squid) and edge caches (such as Akamai).
  4. Monitoring – Understand your application’s performance from a customer’s perspective. Monitor outside of your network and have tests that simulate a real user’s experience. Also monitor the internal working of the application in terms of query and transaction execution count and timing.
  5. Replication – Replicate databases for recovery as well as to off load reads to multiple instances.
  6. Sharding Split the application and databases by service and / or by customer using a modulus. While this requires slightly more complicated logic in the application it allows for massive scaling.
  7. Use Few RDBMS Features – Use the OLTP database as a persistent storage device as much as possible. The more you rely on the features offered in most RDBMS for your transactions, the greater load you are putting on the hardest item in your system to scale. Remove all business logic from the database such as stored procedures and move it into the application. When significant scaling is required join in the application and not through the SQL.
  8. Slow Roll – Roll out new code versions slowly, to a small subset of your servers without bringing the entire site down. This requires that all code be backwards compatible because you will have two versions of code running in production during the roll out. This method allows you to find problems that your quality and L&P testing missed while having minimal impact on customers.
  9. Load & Performance TestingTest the performance of the application version before it goes into production. This will not catch all the issues, which is why you need the ability to rollback, but it is very worthwhile.
  10. Capacity Planning / Scalability Summits – Know how much capacity you have on all tiers and services in your system. Use Scalability Summits to plan for the increased capacity demands.
  11. Rollback – Always have the ability to rollback a code release.
  12. Root Cause Analysis – Ensure you have a learning culture that is evident by utilizing Root Cause Analysis to find and fix the real cause of issues.
  13. Quality From The Beginning – Quality can’t be tested into a product, it must be designed in from the beginning.

5 comments

To Succeed Big, Think Small

I, like a lot of you, read lots of blogs and articles each week, some of my favorites include Joel on Software, Coding Horror, High Scalability, Seth Godin, and Tim Ferriss. I also read just about cover to cover the IEEE publications that I receive. I use a variety of methods to keep track of the ones that I like so that I can find them easily when I want to reference them. Two of my favorite tools to do this with are Evernote, where you can clip web pages with tags into your notebook, and ShareThis, where I can drop something in my sharebox and send it to someone at the same time.

I was having a discussion the other day about small services as business opportunities and recalled several threads that seemed to be tangentially related. I parsed through my clipped, starred, and shared items and found these pearls. Whether you are already part of a start-up or considering one here are some ideas that you might consider.

And since one of the technical editors of our book has pounded into my brain the demand to “explain how this relates to scalability”, let me explain. If you’ve read a few of our posts you know that we believe that scalability is about more than technology, in fact if done correctly it’s technology agnostic, but depends greatly on people, process, and architecture. This starts even before the business is founded. The right business plan that balances people and investment with real revenue earning products is critical to scaling. If your cost are too high to service a given number of customers you are losing money. Now you can argue that you’ll get efficiency of scale at some point but you need to survive long enough for that to occur. Without further ado, on to our amalgamation of advice:

Seth Godin says the way to make money on the Internet is by connecting people with what they need. He gives examples that include the following as well as many more:

Connect advertisers to people who want to be advertised to.
Connect job hunters with jobs.
Connect information seekers with information.
Connect teams to each other.

Kevin Kelly says the solution for inventors who are up against the giant aggregators like Amazon is to find 1,000 “true fans”.  If each one is willing to pay $100 each year for something that you invent or service that you provide you can make a great living. This might be four CDs that you produce each year. For those organizations comprised of more than just a solo artist, Kelly states that an increase in fans is necessary but is linearly proportional to the increase in the team size.  He continues that because of the network effect (Metcalfe’s Law) it is likely that the value of the your fans increases proportionally to the square of the number of fans, which means the number of true fans does not have to double to support a duet.

Paul Buchheit, the 23rd employee at Google and creator of Gmail says stick with it, overnight success takes a long time.  They started Gmail in 2001, launched it in 2004 and 7 years later is seen as a huge success with annual growth rates of 40%.

Matt from 37Signals advises to keep your day job and work on your start-up on the side. Even though he capitulates that starting a business does require plenty of time and effort, quitting puts a shot clock on your idea. When coupled with Buchheit’s notion of success is a measure of endurance not speed, this seems like sound advice when possible. For those already under the pressure of the shot clock just remember that is monetization is king and survival is a competitive advantage.

So far in our bucket of advice we have 1) connect people with what they need, 2) find 1,000 fans to support your dream, and 3) don’t jump in until you are able to stick with it for the long haul. What I can add to this is Think Small. Throw away the fifty page business plan that requires $25 million of investment to sustain a profitless company for seven years. Instead think of a single service that people or businesses need.

For Internet companies, there are dozens of services that they need as part of their product offering but only as a small part.  Therefore they either don’t have the expertise or they can’t dedicate the resources to build it well. An example of this is search. Lots of websites need search functionality but few are going to build it themselves when there is a site search tool built by experts available.

Services that come to mind and that either need to be done or need to be done better are contextual classification, yield optimization, micro-payments, recommendations, abstracted scaling, and application monitoring. Go give it a shot and think about scaling from day one of your business.


Comments Off on To Succeed Big, Think Small

Scalability Summit

One of the processes that we often recommend to clients is known as a Scalability Summit. The purpose of this summit is to identify which component in your application is most likely to prevent you from scaling. This idea of fixing the next bottlenecks or the next thing that is going to prevent you from scaling is how YouTube.com scaled. You can see a presentation by Cuong Do at a Google Tech Talk. About three minutes into the video Cuong expresses his algorithm for “Handling Rapid Growth” as:

while (true)

{

identify_and_fix_bottlenecks();

drink();

sleep();

notice_new_bottleneck();

}

YouTube’s growth was so rapid that this cycle of identifying the next bottleneck and fixing it was often weeks or even days. For all other “normal” scaling issues performing this bottleneck identification is usually done on a quarterly basis. When done at this interval we refer to them as Scalability Summits. We recommend that a select group of individuals should be invited to participate and discuss what they believe to be the next set of issues the platform will experience. The participants should include people representing architecture, operations, engineering.

When we run Scalability Summits we generally will go through this exercise twice. Once for the expected growth rate of the business and then once again for the expected growth rate multiplied by 10. So if you plan on growing by 200,000 users over the next quarter use that number first then use 2,000,000 users and identify which components would fail at those usage numbers.

Once these potential bottlenecks are identified they are prioritized by a return on investment analysis that takes into account factors such as how expensive it is to fix (in terms of both capital expenditure as well as personnel), the component’s Time To Break (how much growth it can sustain), and the severity in the event it does break.

The most important step comes after the Scalability Summit. A set amount of labor from each team must be set aside to focus on scalability related issues that come out of the summit. If a team spends several hours identifying bottlenecks that get ignored no one is going to participate again. As an organization you must take action on these or 1) you will likely experience these issues that will hamper your growth and 2) participants will lose interest in the process.


Comments Off on Scalability Summit