Scalability at the Cost of Availability
Do you associate scalability with availability? Sometimes these go hand-in-hand but sometimes these are at odds with each other. We’re obviously big proponents of architecting your systems so that you have the necessary scalability when you need it but we’re also realistic. We often help young companies make tradeoffs between capital expenditure and scalability. It’s not uncommon for us to spend a good deal of time explaining the concepts of Design-Implement-Deploy and Recency-Frequency-Monetization to help with this discussion.
One subtle concept that is sometimes misunderstood is that if not careful an increase in scalability can actually decrease your availability. In order to understand how this can happen we need to talk about the multiplicative affect of failure with items in series. Let’s take for example a system with a single web server with 99.9% availability, forget about network gear for now but it has the same affect. The availability of the system is 99.9% If we now add a database, also with 99.9% availability, to the system. Assume that the DB is required for the web server to respond i.e. pages are built by querying the DB. This causes the availability of the system to go down to 99.9% x 99.9% = 99.8%. The reason is that with 99.9% availability the system is going to be down for ~43 min per month. The chance that the database experiences its 43 min of downtime at the same time as the web server is down is very small. Much more likely is that you experience 86 min of downtime each month, half caused by the DB and half by the web server.
Back to scaling causing problems with availability. Let’s take the same example, a single web server and a single DB server, both with 99.9% availability. If our database is starting to get busy and we decide to split it, most likely we’d start by adding a read slave (X-Axis split), where the write queries (insert, update, delete) go to the master and the reads (select) go to the slave. To accomplish this we need to introduce another piece of hardware and replicate the database. If the web pages in our system require both read and write queries to the DB, then we’ve just decreased the overall system availability by increasing its scalability. This is a very simplistic example and makes a lot of assumptions but hopefully it gets the point across that you can actually decrease your availability by increasing your scalability.
So why make this tradeoff? In most cases the availability of our hardware is much higher than three-nines so the addition of a small amount of downtime is worth the gain in scalability. Also, by using swim lanes we can mitigate this by splitting our downtime across parts of our users, effectively cutting downtime in half with our first swim lane split.
All of this reminds us that scalability is much more of an art than a science, hence the name of our first book The Art of Scalability. But don’t despair, there are definite rules that govern how to scale effectively, such as the X, Y, or Z Axis splits, and why we’re calling this book Scalability Rules. You just need to use art in applying them. As an analogy, think about an artist painting. Mixing red with blue will always result in purple, a rule, but how the artist applies that color to the canvas is pure art.