AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Google’s Megastore

Papers from the 2011 Conference on Innovative Data Systems Research (CIDR) have been posted and one that is particularly interesting is the Google paper detailing their design and development of Megastore. Megastore is a storage system developed to meet the requirements of today’s interactive online services. According to the paper “Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability.” The system’s underlying datastore is Google’s Bigtable but it additionally provides for serializable ACID semantics and fine-grained partitions of data.

Here is the link to the paper for you to read all the details yourself but I thought I’d point out a couple things that I found interesting about the design.

Data Split
The Megastore design is what AKF would call a Z-axis split of the data. Google does this because partitioning allows for the synchronous replication of each write across a wide area network (between datacenters) with reasonable latency. The key being the smaller the amount of data the faster the replication. The paper states “…data for most Internet services can be suitably partitioned (e.g., by user) to make this approach viable.”

Joins in Code
While most of our clients are likely to never require extreme scaling on a relational database but if you’re one of the lucky ones, the way to do so is to minimize the use of relational features. This would include things like joins. While joining in the DB is terribly efficient from a coding perspective, by joining in the code you remove load from the DB. You can scale by adding web servers much easier and cheaper than you can add relational databases. Google’s paper states that normalized relational schemas that rely on joins at query time were not the right model for Megastore because high-volume interactive workloads benefit more from predictable performance, reads dominate writes in most web applications so it pays to move work from read time to write time, and key-value stores make querying hierarchical data very simple.

Paxos and Two-Phase Commit
Google’s Megastore utilizes two algorithms that I personally thought would not scale at very large transaction volumes. The first is the Paxos algorithm, which is a way to reach consensus among a group of replicas on a single value. It allows up to F faults with 2F + 1 replicas by essentially voting among the replicas which is notoriously slow. The second algorithm is Two-Phase Commit which allows for atomic updates across entity groups. The paper does admit that these transactions have “much higher latency and increase the risk of contention.” That, in my opinion, is very understated but they do offer the discouragement of applications from using the feature in favor of queues.

I highly recommend you put this paper and some of the others from CIDR on your instapaper list for reading on your next flight or while bored during your next meeting.