AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Tag » Scalability

Communicating Across Swim-lanes

We often emphasize the need to break apart complex applications into separate swim-laned services. Doing so not only improves architectural scalability but also permits the organization to scale and innovate faster as each separate team can operate almost as its own independent startup.

Ideally, there would be no need to pass information between these services and they could be easily stitched together on the client-side with a little HTML for a seamless user experience. However, in the real world, you’re likely to find that some message passing between services is needed (If only to save your users the trouble of not entering information twice).

Every cross-service call adds complexity to your application. Often times, teams will implement internal APIs when cross-service communication is needed. This is a good start and solves part of the problem by formalizing communication and discouraging unnecessary cross-service calls.

However, APIs alone don’t address the more important issue with cross-service communication — the reduction in availability. Anytime one service synchronously communicates with another you’ve effectively connected them in series and reduced overall availability. If Service-A calls Service-B synchronously, and Service-B crashes or slows to a crawl, Service-A will do the same. To avoid this trap, communication needs to take place asynchronously between services. This, in fact, is how we define “swim-laning”.

So what are the best means to facilitate asynchronous cross-service communication and what are the pro and cons of each method?

Client Side

Information that needs to be shared between services can be passed via JavaScript & JSON in the browser. This has the advantage of keeping the backend services completely separate and gives you the option to move business logic to the client side, thus reducing load on your transactional systems.

The downside, however, is the increased security risks. It doesn’t take an expert hacker to manipulate variables in JavaScript. (Just think, $price_var gets set to $0 somewhere between the shopping cart and checkout services). Furthermore, data is passed back to the server-side these cross-service calls will now suffer from the same latency and reliability issues as any TCP call over the internet.

Message Bus / Enterprise Service Bus

Message buses and enterprise service buses provide a ready means to transmit messages between services asynchronously in pub/sub model. Advantages include providing a rich set of features for tracking, manipulating, and delivering messages, as well as the ability centralized logging and monitoring. However, the potential for congestion as well as occasional message loss makes them less desirable in many cases than asynchronous point-to-point calls (discussed below).

To limit congestion, it’s best to implement several independent channels and limit message bus traffic to events that will have multiple subscribers.

Asynchronous Point-to-Point

Point-to-point communication (i.e. one service directly calling another) is another effective means of message passing. Advantages include simplicity, speed, and reliability. However, be sure to implement this asynchronously with a queuing mechanism with timeouts, retries (if needed), and exceptions handled in the event of service failure. This will prevent failures from propagating across service boundaries.

If properly implemented, asynchronous point-to-point communication is excellent for invoking a function in another service or transmitting a small amount of information. This method can be implemented for the majority of cross-service communication needs. However, for larger data transfers, you’ll need to consider one of the methods below.


ETL jobs can be used to move a large amount of data from one datastore to another. Since these are generally implemented as separate batch processes they won’t bring down the availability of your services. However, drawbacks include increased load on transactional databases (unless a read replica of the source DB is used) and the poor timeliness/consistency of data resulting from periodic batch process.

ETL processes are likely best reserved for transferring data from your OLTP to OLAP systems, where immediate consistency isn’t required. If you need both a large amount of data transferred and up to the second consistency consider a DB read replica.

DB Read-Replica

Most common databases (Oracle, MySQL, PostgreSQL) support native replication of DB clones with replication lag measured in milliseconds. By placing a read replica of one service’s DB into another service’s swim-lane, you can successfully transfer a large amount of data in near real-time. The downsides are increased IOPS requirements and a lack of abstraction between services that — in contrast to the abstraction provided by an API — fosters greater service interdependency.


In conclusion, you can see there are a variety of cross-service communication methods that permit fault-isolation between services. The trick is knowing the strengths and weaknesses of each and implementing the right method where appropriate.

Comments Off on Communicating Across Swim-lanes

Top 5 Posts

We love data and think most things should be data-driven, as is evident from our newest book that reveals how successful companies learn from customer misbehavior. This isn’t done by asking customers what they want but rather by watching them through monitoring, logs, and alerting. This concept is summarized nicely in this article we published last month. In that vein, we keep track of how people read and use this blog. In this post we thought we’d share some of the most popular posts. Here are the top 5 posts that people have been reading:

  1. The Future of IaaS and Paas
  2. JAD and ARB
  3. Dealing With Shared Services
  4. Be A Leader
  5. Fault Isolation

Not too surprising, our readers are interested in how to scale via clouds, better architectures, and improved processes. They are also interested in leadership, since without that scaling a great organization becomes impossible.

Let us know what your favorite post is. You can find a list of all our posts here.

Comments Off on Top 5 Posts

The End of Scalability?

If you received any sort of liberal arts education in the past twenty years you’ve probably read or at least had an assignment to read Francis Fukuyama’s 1989 essay “The End of History?”[1] If you haven’t read the article or the follow on book, Fukuyama argues that the advent of Western liberal democracy is the final form of human government and therefore it is the end point of humanity’s sociocultural evolution. He isn’t arguing that events will stop happening in the future but rather that democracy will become more and more prevalent in the long term, despite possible setbacks such as totalitarian governments for periods of time.

I have been involved, in some form or another, in scaling technology systems for nearly two decades, which does not take into account the decade before when I was hacking on Commodore PETs and Apple IIs learning how to program. Over that time period there have been noticeable trends such as the centralization/decentralization cycle within both technology and organizations. With regards to technology, think about the transitions from mainframe (centralized) to client/server (decentralized) to web 1.0 (centralized) to web 2.0/Ajax (decentralized) as an example of the cycle. The trend that has lately attracted my attention is about scaling. I’m proposing that we’re approaching the end of scalability.

As a scalability consultant who travels almost every week to clients in order to help them scale their systems, I don’t make this end of scalability statement lightly. However, before we jump into my reasoning, we first need to define scalability. To some scalability means that a system needs to scale infinitely no matter what the load over any period of time. While certainly ideal, the challenge with this definition is that it doesn’t take into account the underlying business needs. Investing too much in scalability before its necessary isn’t a wise investment for a business when there are other great projects in which to invest such as more customer facing features. A definition that takes this into account defines scalability as “the ability of a system to maintain the satisfaction of its quality goals to levels that are acceptable to its stakeholders when characteristics of the application domain and the system design vary over expected operational ranges.” [2:119]

The most difficult problem with scaling a system is typically the database or persistent data storage. AKF Partners teaches general database and applications scale theory in terms of a three-dimensional cube where the X-axis of the cube represents replication of identical code or data, the Y-axis represents a split by dissimilar functions or services, and the Z-axis represents a split across similar transactions or data.[3] Having taught this scaling technique and seen it implemented in hundreds of systems, we know that by combining all three axes a system can scale infinitely. However, the cost of this scalability is increased complexity for development, deployment, and management of the system. But is this really the only option?

The NoSQL and NewSQL movement has produced a host of new persistent storage solutions that attempt to solve the scalability challenges without increased complexity. Solutions such as MongoDB, a self-proclaimed “scalable, high-performance, open source NoSQL database”, attempt to solve scaling by combining replica data sets (X-axis splits) with sharded clusters (Y & Z-axis splits) to provide high levels of redundancy for large data sets transparently for applications. Undoubtedly, these technologies have advanced many systems scalability and reduced the complexity of requiring developers to address replica sets and sharding.

But the problem is that hosting MongoDB or any other persistent storage solution requires keeping the hardware capacity on hand for any expected increase in traffic. The obvious solution to this is to host it in the cloud, where we can utilize someone else’s hardware capacity to satisfy our demand. Unless you are utilizing a hybrid-cloud with physical hardware you are not getting direct attached storage. The problem with this is that I/O in the cloud is very unpredictable, primarily because it requires traversing the network of the cloud provider. Enter Solid-State Drives (SSD).

Chris Lalonde, CEO of ObjectRocket a MongoDB cloud provider hosted entirely on SSDs, states that “Developers have been thinking that they need to keep their data set size the same size as memory because of poor I/O in the cloud and prior to 2.2.x MongoDB had a single lock, both making it unfeasible to go to disk in systems that require high performance. With SSDs the I/O performance gains are so large that it effectively negates this and people need to re-examine how their apps/platforms are architected.”

Lots of forward thinking technology organizations are moving towards SSDs. Facebook’s appetite for solid-state storage has made it the largest customer for Fusion-io, putting NAND Flash memory products in its new data centers in Oregon and North Carolina. Lalonde says “When I/O becomes cheap and fast it drastically changes how developers think about architecting their application e.g. a flat file might be just fine for storing some data types vs. the heavy over head of any kind of structured data.” ObjectRocket’s service offering also provides some other nice features such as “instant sharding” where through the click of a button provides an entire 3-node shard on demand.

Besides the advances being made in leveraging NoSQL and SSDs to allow applications to scale using Infrastructure as a Service (IaaS), there are advances in Platform as a Service (PaaS) offerings such as Google App Engine (GAE) that are also helping systems scale with little to no burden on developers. GAE allows applications to take advantage of scalable technologies like BigTable that Google applications use, allowing them to claim that, “Automatic scaling is built in with App Engine, all you have to do is write your application code and we’ll do the rest. No matter how many users you have or how much data your application stores, App Engine can scale to meet your needs.” While GAE doesn’t have customers as large as Netflix who run exclusively on Amazon’s Web Services (AWS) their customers do include companies like the Khan Academy, which has over 3.8 million monthly unique visitors to their growing collection of over 2,000 videos.

So with solutions like ObjectRocket, GAE, and the myriad of others that make it easier to scale to significant numbers of users and customers without having to worry about data set replication (X-axis splits) or sharding (Y & Z-axis splits), are we at the end of scalability? If we’re not there yet we soon will be. “But hold on” you say, “our systems are producing more and consuming more data…much more.” No doubt the amount of data that we process is rapidly expanding. In fact, according to the EMC sponsored IDC study in 2009, the amount of digital data in the world was almost 500 exabytes and doubling every 18 months. But when we combine the benefits we achieve from such advances as transistor density on circuits (Moore’s Law), NoSQL technologies, cheaper and faster storage (e.g. SSD), IaaS, and PaaS offerings, we are likely to see the end of the need for most applications developers to care about manually scaling their applications themselves. This will at some point in the future all be done for them in “the cloud”.

So What?
Where does this leave us as experts in scalability? Do we close up shop and go home? Fortunately, no, there are still reasons that application developers or technologists need to be concerned with splitting data replica sets and sharding data across nodes. Two of these reasons are 1) reduce risk and 2) improve developer efficiency.

Reducing Risk
As we’ve written about before, risk has several high-level components (probability of an incident, duration, and % of customers impacted). Google “GAE outages” or “AWS outages” or any other IaaS or PaaS provider and the word “outage” and see what you find. All hosting providers that I’m aware of have had outages in their not-so-distant past. GAE had a major outage on October 26, 2012 for 4 hours. GAE proudly states at the bottom of their outage post “Since launching the High Replication Datastore in January 2011, App Engine has not experienced a widespread system outage.” Which sounds impressive until you do the math and realize that this one outage caused their availability to drop to 99.975% for the entire year and a half that the service has been available. Not to mention that they have much more frequent local outages or issues that affect some percentage of their customers. We have been at clients when they’ve experienced incidents caused by GAE.

The point here is not to call out GAE, trust me all other providers have the exact same issue. The point is that when you rely on a 3rd party for 100% of your availability you by definition have their availability as your ceiling. Now add on your availability issues because of maintenance, code releases, bugs in your code, etc. Why is this? Incidents are almost always have multiple root causes that include architecture, people, and process. Everything eventually fails including our people and processes.

Given that you cannot reduce the probability of an incident to 0%, no matter whether you run the datacenter or a 3rd party provider does, you must focus on the other risk factors (reduce the duration and reduce the % of customers impacted). The way you achieve this is by splitting services (Y-axis splits) and by separating customers (Z-axis splits). While leveraging AWS’s RDS or GAE’s HRD provides cross availability zone / datacenter redundancy, in order to have your application take advantage of these you still have to do the work to split it. And if you want even higher redundancy (across vendors) you definitely have to do the work to split applications or customers between IaaS or PaaS providers.

Improving Efficiency
Let’s say you’re happy with GAE’s 99.95% SLA which no doubt is pretty good especially when you don’t have to worry about scaling. But don’t throw away the AKF Scalability Cube just yet. One of the major reasons we get called in to clients is because their BOD or CEO aren’t happy with how the development team is delivering new products. They recall the days when there were 10 developers and features flew out of the door. Now that they have 100 developers, everything seems to take forever. The reason for this loss of efficiency is that with a monolithic code base (no splits for different services) all 100 developers trying to make changes and add functionality, they are stepping all over each other. There needs to be much more coordination, more communication, more integration, etc. By splitting the application in to separate services (Y-axis splits) with separate code bases the developers can split into independent teams or pods that makes them much more efficient.

We are likely to see continued improvement in IaaS and PaaS solutions that auto-scale and perform to such a degree that most applications will not need to worry about scaling because of user traffic. However, this does not obviate the need to consider scaling for greater availability / vendor independence or to improve a development teams efficiency. All great technologist, CTOs, or application developers will continue to care about scalability for these and other reasons.

1. Francis, F., The End of History? The National Interest, 1989. 16(4).
2. Duboc, L., E. Leiter, and D. Rosenblum, Systematic Elaboration of Scalability Requirements through Goal-Obstacle Analysis. 2012.
3. Abbott, M.L. and M.T. Fisher, The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise. 2009: Addison-Wesley Professional.

Comments Off on The End of Scalability?

Scalability Rules Videos and Reviews

Here are a few videos that we did to explain Scalability Rules…ignore the scary faces:

You can find more videos by following this link.

Here are a couple reviews of the book:
Code Ranch
LinkedIn Reviews

Comments Off on Scalability Rules Videos and Reviews

Scalability Rules Android App

Whether you love Scalability Rules or you haven’t gotten around to purchasing your copy, check out the new android application. The app has the what, when, why, and how for each of the 50 rules. Follow this link, scan this QR code, or search for “scalability” in the android market place.

Comments Off on Scalability Rules Android App

Multi-paradigm Data Storage Architectures

We often have clients ask about one or more of the NoSQL technologies as potential replacements for their RDBMS (typically MySQL) to simplify scaling. What I think makes much more sense with regard to these NoSQL and SQL storage systems is an AND instead of an OR discussion. Consider implementing a multi-paradigm data storage layer that provides the appropriate persistent storage system for the different types of data in your application. This approach is similar to our RFM approach to data storage. Consider questions such as how often do you need the data, how quickly do you need it, how relational is the data, etc. There are at least four benefits of this multi-paradigm approach: simpler scaling, improved application performance, easier application development, and reduced cost.

The AKF Scale Cube provides a straightforward way to scale any relational database through the three axes but we know that splitting data relationships once they’ve been established isn’t easy. It requires work and lots of coordination between teams. By limiting what gets stored relationally to only the minimum that is required means fewer splits along any axis. Many of the NoSQL technologies provide auto sharding and asynchronous replication. Re-indexing keys across another node is much simpler than migrating tables into another database.

While relational databases can have great performance, unless the table is pinned in memory or the query results are cached in memory, an in memory data store will always outperform SQL. In many applications we could make use of in memory solutions like Memcache or MongoDB to improve performance of retrieving high value data.

Application Development
As Debasish Ghosh states in his article Multiparadigm Data Storage for Enterprise Applications, storing data the way it is used in an application, simplifies programming and makes it easier to decentralize data processing. If the application treats the data as a document why break it apart to store it relationally when we have viable document storage engines. Storing the data in a more native format allows for quicker development.

For data that’s not needed often, cache it in other places (such as a CDN) or lazy load it from a low cost storage tier such as Amazon’s S3. This might work well for applications hosted in the cloud. The benefit of this a lower cost per byte stored, especially when considering all costs including administrators for the more complex data storage systems such as relational databases.

A final step in implementing a multi-paradigm data storage layer is an asynchronous message queue for data that needs to move up and down the stack. Implementing ActiveMQ or RabbitMQ to asynchronously move data from one layer to another as needed relieves the application of this burden. As an example consider an application that routes picking baskets for inventory in a warehouse. This is typically thought of as a graph with bins of inventory as nodes and the aisles as edges. For faster retrieval you could store this in a graph database such as Neo4J for ease of development and performance reasons. You could then asynchronously persist these maps in a MySQL database for reporting and older versions into an S3 bucket for historic archiving. This combination provides faster performance, easier development, simpler scaling, and reduced cost.

Comments Off on Multi-paradigm Data Storage Architectures

Scalability Rules – Released This Week

Our newest book, Scalability Rules, has just been released. Here are a few places you can purchase the book:

You can also help us get the word out about this book by liking and sharing the book’s Facebook page or the book’s official website, where we’ll keep up to date information about reviews and speaking engagements.

Scalability Rules brings together 50 rules that are grounded in experience garnered from over a hundred companies such as eBay, Intuit, PayPal, Etsy, Folica, and Salesforce. Put together and organized to be easily read and referenced for rapid application to nearly any technical environment. The rules are technology agnostic and have been applied to LAMP, .net, and even midrange system architectures.

We are very thankful for everyone’s help in making this project come together and here are just a few of those folks:

    Technical Reviewers – Robert Guild, Geoffrey Weber, and Jeremy Wright
    Pre-reviewers – Chad Dickerson, Chris Lalonde, Jonathan Heiliger, Jerome Labat, and Nanda Kishore.
    Senior Acquisitions Editor – Trina MacDonald
    Development Editor – Songlin Qiu
    Project Editor – Anne Goebel

We dedicated this book to to our friend and partner Tom Keeven who in our mind is the originator of many of these concepts and has helped countless companies in his nearly 30 years in the business.


Newsletter – Spring 2011

Below is part of our Fall 2010 Newsletter.  If you haven’t subscribed yet, click here to do so.

In this newsletter:

Scalability Rules

Scalability Rules: 50 Principles For Scaling Websites is available for presale. We are just a few short weeks away from the release date and are very excited about this project. This book is meant to serve as a primer, a refresher, and a lightweight reference manual to help engineers, architects, and managers develop and maintain scalable Internet products. It is laid out in a series of rules, each of them bundled thematically by different topics. Most of the rules are technically focused, while a smaller number of them address some critical mindset or process concern – each of which is absolutely critical to building scalable products.

It is available for preorder from these sites:

You can also help us get the word out about this book by liking and sharing the book’s Facebook page or the book’s official website, where we’ll keep up to date information about reviews and speaking engagements.

With the success of The Art of Scalability, we’ve been asked by a few folks, why write another book on scale? Our answer is that there simply aren’t many good books on scalability on the market yet, and Scalability Rules is unique in its approach in this sparse market.  Also, this is the first book to address the topic of scalability in a rules-oriented fashion. One of our most-commented-on blog posts is on the need for scalability to become a discipline. We and the community of technologists that tackle scalability problems believe that scalability architects are needed in today’s technology organizations. This book will help scalability architects, scalability evangelists, and the like to share their knowledge with others in scaling their systems.  See More…

Our first book The Art of Scalability is still available at these retailers:


Most Popular Posts

We know everyone is busy and often our RSS readers get filled with too many interesting articles to keep up with.  Here are summaries of a few of our posts and some by other authors that we particularly enjoyed.

Why A Technology Leader Should Code
The military teaches that a leader should be “technically and tactically” proficient. Military leaders owe it to their subordinates to understand the equipment that the unit employed and the basic combat tactics that would be followed. This concept is transferable to technology companies; the CTO owes it to their subordinates to understand the technology. They also owe it to the business to understand the economic aspects of the business and be able to straddle these two worlds. Additionally, periodically having to code a feature and deploy it will provide the engineering manager a better understanding and appreciation for what her engineers go through on a daily basis. Read more

What Is That Delay Costing?
Most technologists know that the slower the page the more likely the user will flee the page or the transaction flow and not make the purchase.  Research is teaching us that it may be less important to reduce actual delay rather than create a system where users will be less likely to attribute the delay to the site. An example that we sometimes see is to give the user the option of selecting a low or high graphic site in order to provide the users with the control. Users will likely perceive this as an active effort on the part of the SaaS provider to minimize download time and thus attribute delays to themselves, their computer, their ISP, etc but not the site. Read more

DevOps is an umbrella concept that refers to anything that smoothes out the interaction between development and operations and is a response to the growing awareness of the disconnect between development and operations. There is an emerging understanding of the interdependence of development and operations in meeting a business’ goals. While not a new concept, we’ve been living and suggesting ARB and JAD as cornerstones of this coordination for years, DevOps has recently grown into a discipline of its own. Read more

Google Megastore
Google provided a paper detailing their design and development of “Megastore.” This is a storage system developed to meet the requirements of today’s interactive online services and according to the paper it blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, providing strong consistency and high availability. The system’s underlying datastore is Google’s Bigtable but it additionally provides for serializable ACID semantics and fine-grained partitions of data. Read more

Scalability at the Cost of Availability
One subtle concept that is sometimes misunderstood is that if not careful an increase in scalability can actually decrease your availability. The reason for this is the multiplicative affect of failure with items in series.  If two pieces of hardware independently have 99.9% uptime, when combined into a single system that relies on both to respond to requests, the availability of the system to go down to 99.9% x 99.9% = 99.8%. Read more

8 Lessons We Can Learn From The MySpace Incident
Robert Scoble wrote a case study, MySpace’s death spiral: insiders say it’s due to bets on Los Angeles and Microsoft, in which he reports that MySpace insiders blame the Microsoft stack on why they lost to Facebook.  Some lessons can be gleaned from this including All computer companies are technology companies first and Enterprise Programming != Web Programming and Intranet != Intranet. Read more

Aztec Empire Strategy: Use Dual Pipes For High Availability
The Aztecs built the great aqueduct 600 years ago but even then thought about uninterrupted supply.  This post states that the purpose of the twin pipes was to keep water flowing during maintenance.  When one pipe got dirty, the water was diverted to the other pipe while the dirty pipe was cleaned. Read more


Research Update and Request for Help
Marty and Mike will both be presenting their research at the 2011 Academy of Management Conference. Marty’s research deals with tenure based conflct and Mike’s research is focused on social contagion (a.k.a. viral growth). You can read the abstracts and full text for both papers here.

We are continuing our research and could use your help. Please consider completing one or both surveys.

If you are an executive team member at a startup, please take this survey and pass it along to your colleagues within your company.

If you participate in any of the following social networks (Facebook, Friendster, LinkedIn, Twitter, MySpace, Ning, Orkut, or Yahoo!360), please take this survey and pass it along to your friends or colleagues.

Thanks for your support!

Comments Off on Newsletter – Spring 2011

Scalability at the Cost of Availability

Do you associate scalability with availability? Sometimes these go hand-in-hand but sometimes these are at odds with each other. We’re obviously big proponents of architecting your systems so that you have the necessary scalability when you need it but we’re also realistic. We often help young companies make tradeoffs between capital expenditure and scalability. It’s not uncommon for us to spend a good deal of time explaining the concepts of Design-Implement-Deploy and Recency-Frequency-Monetization to help with this discussion.

One subtle concept that is sometimes misunderstood is that if not careful an increase in scalability can actually decrease your availability. In order to understand how this can happen we need to talk about the multiplicative affect of failure with items in series. Let’s take for example a system with a single web server with 99.9% availability, forget about network gear for now but it has the same affect. The availability of the system is 99.9% If we now add a database, also with 99.9% availability, to the system. Assume that the DB is required for the web server to respond i.e. pages are built by querying the DB. This causes the availability of the system to go down to 99.9% x 99.9% = 99.8%. The reason is that with 99.9% availability the system is going to be down for ~43 min per month. The chance that the database experiences its 43 min of downtime at the same time as the web server is down is very small. Much more likely is that you experience 86 min of downtime each month, half caused by the DB and half by the web server.

Back to scaling causing problems with availability. Let’s take the same example, a single web server and a single DB server, both with 99.9% availability. If our database is starting to get busy and we decide to split it, most likely we’d start by adding a read slave (X-Axis split), where the write queries (insert, update, delete) go to the master and the reads (select) go to the slave. To accomplish this we need to introduce another piece of hardware and replicate the database. If the web pages in our system require both read and write queries to the DB, then we’ve just decreased the overall system availability by increasing its scalability. This is a very simplistic example and makes a lot of assumptions but hopefully it gets the point across that you can actually decrease your availability by increasing your scalability.

So why make this tradeoff? In most cases the availability of our hardware is much higher than three-nines so the addition of a small amount of downtime is worth the gain in scalability. Also, by using swim lanes we can mitigate this by splitting our downtime across parts of our users, effectively cutting downtime in half with our first swim lane split.

All of this reminds us that scalability is much more of an art than a science, hence the name of our first book The Art of Scalability. But don’t despair, there are definite rules that govern how to scale effectively, such as the X, Y, or Z Axis splits, and why we’re calling this book Scalability Rules. You just need to use art in applying them. As an analogy, think about an artist painting. Mixing red with blue will always result in purple, a rule, but how the artist applies that color to the canvas is pure art.


Outbrain’s CTO on Scalability

I had the opportunity to speak with Ori Lahav, Outbrain’s co-founder and CTO, about his experience scaling Outbrain to handle 1.5 billion page views in just three years.  Outbrain is a content recommendation service that provides for blogs and articles what Netflix does for videos or Amazon does for products.

Ori is oversees the R&D center for the company located in the heart of Israel’s technology center. Prior to founding Outbrain, Ori led the R&D groups at Shopping.com (acquired by eBay) in search and classification. Before joining the internet revolution Ori led the video streaming Server Group at Vsoft.

Below are my notes from the interview.  Here is the link to the audio version but the quality is fairly poor.

What is your background and how did you come to start Outbrain?
Outbrain was founded by Yaron Galai and myself (Navy friendship).  Before that I spent 3.5 years at shopping.com, which was acquired by eBay, leading software groups.  I liked the challenges of a start-up and joining forces with Yaron was a great opportunity.

How has Outbrain grown over the past couple of years?
We did some false starts with several directions but when we started with recommendations on content sites it started catching fire.  We started with self serve blogs, then professional blogs, small publishers, national publishers, and now we are international.  In 3 years we grew from 0 to over 1.5B page views, 3 production servers to over 150, 0 system administrators to 4, and from 3 developers to 15 today.

How has your architecture changed over this period?
Surprisingly… not much, it was first 2 application servers and a MySQL database.  Then we started adding more application servers and replicating the MySQL.  As the product grew, we added more and more components including memcached, Solr, TokyoTyrant, and Cassandra.  We recently added layer that warms the caches and is being notified by ActiveMQ.

On the backend, we have machines that are fetching content from the sites we are on.  We gather article text and images and save them to MogileFS where we have indexers that index them in Solr.
We started investing in reporting infrastructure and started with MySql but with the amount of data we have we shortly understood that Hadoop is the way to go – we now have a cluster of more than 10 nodes in our Hadoop cluster.

What is Outbrains view of scalability?
First – scalability is culture – if you think big – you will be big.  The regular rules of scalability apply here too:

  • No singles
  • Scale out instead of up
  • Replicate data to ease the load
  • Shard data to scale
  • Utilize commodity hardware

Specifically for Outbrain I would add:

  • open-source is crucial for scalability
  • be cost conscious
  • build many simple/small environments and not a single big one

How have you managed data centers to be so cost effective?
We simplify it!  We create many small datacenters and not a big one.  We have no upfront costs of network gear (stacking) and real-estate (Cages).  We use open-source for our network (load balancer and firewalls) and infrastructure (OS).

In order to ease the step function of cost, grow as you go. Our data center provider, Atlantic Metro, has helped with metered power instead of paying by the circuit.  This way we’re motivated to power down servers not needed.

What has been the most difficult scalability challenge?
Scaling the team…we are more of a patrol boat DNA not an aircraft Carrier DNA.  Technology challenges are fun – never too hard but the team and process are much more challenging.

We have also started using the Continuous Deployment process and this has been a great help. By empowering the team members to act and change the system as they need – you can grow to be an Aircraft carrier and still keep the maneuverability of a patrol boat.

What do you think of the NoSQL movement?
We are big fans.  We use Hadoop, Hive, Cassandra, TokyoTyrant, MySql, and others.  This helps us maintain our low serving cost and attract the type of talent we need on the team.
Outbrain is proud to be 100% open-source.  We use open-source from the office telephony system to all platforms and network infrastructure.

And… we are hiring top technology talent in Israel, so contact Outrbain if you are interested.

1 comment