AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Lazy Summertime

Now that summer has officially arrived, it’s time to talk about how we can justify being lazy. One of my favorite chapters in Scalability Rules is “Use Caching Aggresively.” The reason I like it so much is that it reminds me to be lazy. Yes for all you slackers here is your excuse to do as little work as possible.

Another Guincho perspective

Our first justification for being lazy comes under the category of “how to avoid work.” The best way to scale is to avoid the traffic in the first place. One way to avoid traffic is for users to never come to your site but this isn’t very desirable. The prefered solution to avoiding traffic is to utilize the many layers of caching between your persistent storage (usually a relational database) and the users’ browsers. A few of these possible caches that you can leverage are: O/S DNS cache, Browser cache, CDNs, Reverse Proxy, and Object Cache.

If one reason wasn’t enough here is another excuse to be lazy. The best way to avoid errors is that do as little work as possible. The less you do the less you can screw up. In order to do as little as possible you need to automate or script simple tasks. If you find yourself doing something repetitively such as installing packages, resetting data, or making copies consider these tasks for automation. Consider these few commands:

dd bs=65536 if=/dev/sda1 of=/dev/sdd
fsck /dev/sdd
mkdir /root/ebs-vol
mount /dev/sdd /root/ebs-vol

It’s much easier and less prone to errors to kick off a shell script than to type all these commands over and over, day after day.

So, there you go, two reasons for you to remain lazy all summer and hopefully enjoy the warm weather.


Comments Off

Alternative Solutions to Old Problems

Are you like @devops_borat and not a fan of DevOps? Or, maybe you think deploying dozens of time each day to production is ludicrous. I’m actually a fan of both DevOps and continuous deployment but if you’re not don’t worry these are just new solutions to old problems and there are alternatives.

devops_borat

The Problems
As long as people have been divided into separate organizations there has existed strife and competition between the teams. In the technology field this is no place more apparent than between development and operations. In at least 50+% of the companies that we meet with they have problems getting these teams to work together. If you’ve been around for a few years you’ve surely heard one team pointing to the other as the problem, whether that problem is an outage or slow product development.

A solution to this problem is DevOps. Wikipedia states that DevOps “relates to the emerging understanding of the interdependence of development and operations in meeting a business’ goal to producing timely software products and services.”

Another common tech problem is that large changes are risky. It is called “Big Bang” for a reason…things go bang! If you’ve been part of an ERP implementation that took months if not years to prepare for you know how risky these large changes are.

A solution to this problem is to make small changes more frequently. According to Eric Ries, co-founder and former CTO of IMVU, continuous deployment is a method of improving software quality due to the discipline, automation, and rigorous standards that are required in order to accomplish continuous deployment.

Alternative Solutions
Admittedly, DevOps and continuous deployment are somewhat extreme for some teams. For those or for teams that just don’t believe that these are the solutions, don’t fret there are alternatives.

JAD/ ARB – For improving the coordination between development and operations, we’ve recommend the JAD and ARB processes. These are very lightweight processes that force the teams to work together for better architected and better supported solutions.

Progressive Rollout – For reducing risk by making smaller changes, we recommend progressive rollout. This is a simple concept that involves first pushing code to a very small set of servers, monitoring for issues, and then progressively increasing the percentage of servers that receive the new code. The time between rollouts can be 30 min to 24 hours depending on how quickly you are likely to detect problems. We often suggestion the percentage of servers in the progressive rollout to be 1%, 5%, 20%, 50%, 100%.

The bottom line is something technologists know – there are almost always multiple ways to solve a problem. If you don’t like the current or new solution look for an alternative.


Comments Off

Multi-paradigm Data Storage Architectures

We often have clients ask about one or more of the NoSQL technologies as potential replacements for their RDBMS (typically MySQL) to simplify scaling. What I think makes much more sense with regard to these NoSQL and SQL storage systems is an AND instead of an OR discussion. Consider implementing a multi-paradigm data storage layer that provides the appropriate persistent storage system for the different types of data in your application. This approach is similar to our RFM approach to data storage. Consider questions such as how often do you need the data, how quickly do you need it, how relational is the data, etc. There are at least four benefits of this multi-paradigm approach: simpler scaling, improved application performance, easier application development, and reduced cost.

Scaling
The AKF Scale Cube provides a straightforward way to scale any relational database through the three axes but we know that splitting data relationships once they’ve been established isn’t easy. It requires work and lots of coordination between teams. By limiting what gets stored relationally to only the minimum that is required means fewer splits along any axis. Many of the NoSQL technologies provide auto sharding and asynchronous replication. Re-indexing keys across another node is much simpler than migrating tables into another database.

Performance
While relational databases can have great performance, unless the table is pinned in memory or the query results are cached in memory, an in memory data store will always outperform SQL. In many applications we could make use of in memory solutions like Memcache or MongoDB to improve performance of retrieving high value data.

Application Development
As Debasish Ghosh states in his article Multiparadigm Data Storage for Enterprise Applications, storing data the way it is used in an application, simplifies programming and makes it easier to decentralize data processing. If the application treats the data as a document why break it apart to store it relationally when we have viable document storage engines. Storing the data in a more native format allows for quicker development.

Cost
For data that’s not needed often, cache it in other places (such as a CDN) or lazy load it from a low cost storage tier such as Amazon’s S3. This might work well for applications hosted in the cloud. The benefit of this a lower cost per byte stored, especially when considering all costs including administrators for the more complex data storage systems such as relational databases.

A final step in implementing a multi-paradigm data storage layer is an asynchronous message queue for data that needs to move up and down the stack. Implementing ActiveMQ or RabbitMQ to asynchronously move data from one layer to another as needed relieves the application of this burden. As an example consider an application that routes picking baskets for inventory in a warehouse. This is typically thought of as a graph with bins of inventory as nodes and the aisles as edges. For faster retrieval you could store this in a graph database such as Neo4J for ease of development and performance reasons. You could then asynchronously persist these maps in a MySQL database for reporting and older versions into an S3 bucket for historic archiving. This combination provides faster performance, easier development, simpler scaling, and reduced cost.


Comments Off

Cascading Failures

I was chatting with Nanda Kishore (@nkishore) the ShareThis CTO about the recent problems Amazon had in one of their zones. Even though ShareThis is 100% in the cloud, because they have properly architectured their system, these regional outages didn’t affect ShareThis services at all. Of course kudos to Nanda and his team for their design and implementation but more interesting was our discussion about this being a cascading failure in which one small problem cascades into a much bigger problem. A few days later Amazon provided a bit of a postmortem confirming that a simple error during a network change started the problem. The incorrect traffic shift left the primary and secondary EBS nodes isolated, each thinking the other had failed. When they were reconnected they rapidly searched for free space to re-mirror, which exhausted spare capacity and led to a “re-mirroring storm.”

As we were discussing the Amazon issue, I brought up another recent outage of a major service, Facebook. In Sep 2010 they had a several hour outage for many users caused by an invalid configuration value in their cahcing tier. This caused every client that saw the value to attempt to fix it, which involved a query to the database. The DBs were quickly overwhelmed by hundreds of thousands of queries per second.

Both of these are prime examples of how in complex systems, small problems can cascade into large incidents. Of course there has been a good deal of research on cascading failures, including models of the probability distributions of outages to predict their occurrence. What I don’t believe exists and should is a framework to prevent them. As Chapter 9 in Scalability Rules states the most common scalability related failure is not designing to scale and the second most common is not designing to fail. Everything fails, plan for it! Of course utilizing swim lanes or fault isolation zones will certainly minimize the impact of any of these issues but there is a need for handling this at the application layer as well.

As an example, say we have a large number of components (storage devices, caching services, etc) that have a failsafe plan such as refreshing the cache or re-mirroring the data. Before these actions are executed, the component should check in with an authority that determines if the request should be executed or if too many other components are doing similar tasks. Alternatively, a service could monitor for these requests over the network and throttle/rate limit them much like we do in an API. This way a small problem that causes a huge cascade of reactions can be paused and handled in a controlled and more graceful manner.


Comments Off

Rules for Surviving an Amazon Outage

Because of recent issues with Amazon’s services there is a lot of interest in why some companies are able to keep their site up despite their IaaS or PaaS providers experiencing issues. Here is an InformIT article we wrote, outlining a few rules for surviving an Amazon or other cloud provider outage.


4 comments

Don’t Interrupt the Doers

We get called in occasionally because a company’s leaders don’t feel that their product development is happening rapidly enough. They recall how fast the product evolved when the company was first started and they want that pace again. There are many reasons for the pace of development to have slowed. Certainly one of the more popular catch phrases that people use is technical debt, which is a metaphor to explain the eventual consequences of fast paced development. As you incur technical debt, your pace of development slows.

I think there is another factor that is equally or possibly more responsible for slowing the pace of development, interruptions. Engineers need large blocks of uninterrupted time to think, design, plan, code, and test. Disrupting an engineer during these tasks often require a wholesale reset of their thought process. There have been lots of studies that support this one such study found that when tasks were interrupted people require upwards of 27% more time to complete, commit twice the number of errors, and experience twice the increase in anxiety as compared to uninterrrupted tasks. And, as a recent CNN article explained, this problem of disruptions affecting our productivity gets worse as we get older.

So what is interrupting engineers? I’d wager it’s predominantly meetings. While communicating, coordinating, interviewing, etc are all very important for engineers to participate in, doing so in a haphazard manner can be devasting to productivity. In this competitive hiring environment, interruptions might just be driving your engineers out the door. Try a few of these suggestions to reduce interruptions for engineers:

  • Have at least one day per week where meetings are not allowed
  • Only allow meetings at the beginning or end of the day
  • Require all meetings to have agendas and goals
  • Question standing meetings to ensure all participants are necessary

While measuring productivity is incredibly difficult most organizations can feel when the pace of development has slowed. Reduce the interruptions of your engineers and see if this doesn’t help increase the pace again.


Comments Off

Scalability Rules – Released This Week

Our newest book, Scalability Rules, has just been released. Here are a few places you can purchase the book:

You can also help us get the word out about this book by liking and sharing the book’s Facebook page or the book’s official website, where we’ll keep up to date information about reviews and speaking engagements.

Scalability Rules brings together 50 rules that are grounded in experience garnered from over a hundred companies such as eBay, Intuit, PayPal, Etsy, Folica, and Salesforce. Put together and organized to be easily read and referenced for rapid application to nearly any technical environment. The rules are technology agnostic and have been applied to LAMP, .net, and even midrange system architectures.

We are very thankful for everyone’s help in making this project come together and here are just a few of those folks:

    Technical Reviewers – Robert Guild, Geoffrey Weber, and Jeremy Wright
    Pre-reviewers – Chad Dickerson, Chris Lalonde, Jonathan Heiliger, Jerome Labat, and Nanda Kishore.
    Senior Acquisitions Editor – Trina MacDonald
    Development Editor – Songlin Qiu
    Project Editor – Anne Goebel

We dedicated this book to to our friend and partner Tom Keeven who in our mind is the originator of many of these concepts and has helped countless companies in his nearly 30 years in the business.


Comments Off

Federated Cloud

In an interesting paper in the IBM Journal of Research and Development, the concept of a federated cloud model is introduced. This model is one in which computing infrastructure providers can join together to create a federated cloud. The advantages pointed out in the article include cost savings due to not over provisioning for spikes in capacity demand. To me the biggest advantage of this federated model is the lack of reliance on a single vendor and likely higher availability due to greater distribution of computing resources across different infrastructure. One of our primary aversions to a complete cloud hosting solution is the reliance on a single vendor for the entire availability of your site. A true federated cloud would eliminate this issue.

However, as the article aptly points out there are many obstacles in the way of achieving such a federated cloud. Not the least of which are technical challenges to architect applications in such a modular manner as to be able to start and stop components in different clouds as demand requires. Other issues include administrative control and monitoring of multiple clouds and security concerns over allowing direct access to hypervisors by other cloud providers.

As we’ve prognosticated, pure VM based clouds like AWS have had to offer dedicated servers for those high intensity IO systems like large relational databases. We’ve also predicted that with double digit growth in cloud services predicted for the next several years, providers will resist the commoditization of their offerings through service differentiation. This attempt at differentiation will come in the form of add-on features and simplification across the entire PDLC. This unfortunately makes the likelihood of a federated cloud offering happening in the next couple of years very unlikely.


2 comments

Newsletter – Spring 2011

Below is part of our Fall 2010 Newsletter.  If you haven’t subscribed yet, click here to do so.

In this newsletter:

Scalability Rules

Scalability Rules: 50 Principles For Scaling Websites is available for presale. We are just a few short weeks away from the release date and are very excited about this project. This book is meant to serve as a primer, a refresher, and a lightweight reference manual to help engineers, architects, and managers develop and maintain scalable Internet products. It is laid out in a series of rules, each of them bundled thematically by different topics. Most of the rules are technically focused, while a smaller number of them address some critical mindset or process concern – each of which is absolutely critical to building scalable products.

It is available for preorder from these sites:

You can also help us get the word out about this book by liking and sharing the book’s Facebook page or the book’s official website, where we’ll keep up to date information about reviews and speaking engagements.

With the success of The Art of Scalability, we’ve been asked by a few folks, why write another book on scale? Our answer is that there simply aren’t many good books on scalability on the market yet, and Scalability Rules is unique in its approach in this sparse market.  Also, this is the first book to address the topic of scalability in a rules-oriented fashion. One of our most-commented-on blog posts is on the need for scalability to become a discipline. We and the community of technologists that tackle scalability problems believe that scalability architects are needed in today’s technology organizations. This book will help scalability architects, scalability evangelists, and the like to share their knowledge with others in scaling their systems.  See More…

Our first book The Art of Scalability is still available at these retailers:

 

Most Popular Posts

We know everyone is busy and often our RSS readers get filled with too many interesting articles to keep up with.  Here are summaries of a few of our posts and some by other authors that we particularly enjoyed.

Why A Technology Leader Should Code
The military teaches that a leader should be “technically and tactically” proficient. Military leaders owe it to their subordinates to understand the equipment that the unit employed and the basic combat tactics that would be followed. This concept is transferable to technology companies; the CTO owes it to their subordinates to understand the technology. They also owe it to the business to understand the economic aspects of the business and be able to straddle these two worlds. Additionally, periodically having to code a feature and deploy it will provide the engineering manager a better understanding and appreciation for what her engineers go through on a daily basis. Read more

What Is That Delay Costing?
Most technologists know that the slower the page the more likely the user will flee the page or the transaction flow and not make the purchase.  Research is teaching us that it may be less important to reduce actual delay rather than create a system where users will be less likely to attribute the delay to the site. An example that we sometimes see is to give the user the option of selecting a low or high graphic site in order to provide the users with the control. Users will likely perceive this as an active effort on the part of the SaaS provider to minimize download time and thus attribute delays to themselves, their computer, their ISP, etc but not the site. Read more

DevOps
DevOps is an umbrella concept that refers to anything that smoothes out the interaction between development and operations and is a response to the growing awareness of the disconnect between development and operations. There is an emerging understanding of the interdependence of development and operations in meeting a business’ goals. While not a new concept, we’ve been living and suggesting ARB and JAD as cornerstones of this coordination for years, DevOps has recently grown into a discipline of its own. Read more

Google Megastore
Google provided a paper detailing their design and development of “Megastore.” This is a storage system developed to meet the requirements of today’s interactive online services and according to the paper it blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, providing strong consistency and high availability. The system’s underlying datastore is Google’s Bigtable but it additionally provides for serializable ACID semantics and fine-grained partitions of data. Read more

Scalability at the Cost of Availability
One subtle concept that is sometimes misunderstood is that if not careful an increase in scalability can actually decrease your availability. The reason for this is the multiplicative affect of failure with items in series.  If two pieces of hardware independently have 99.9% uptime, when combined into a single system that relies on both to respond to requests, the availability of the system to go down to 99.9% x 99.9% = 99.8%. Read more

8 Lessons We Can Learn From The MySpace Incident
Robert Scoble wrote a case study, MySpace’s death spiral: insiders say it’s due to bets on Los Angeles and Microsoft, in which he reports that MySpace insiders blame the Microsoft stack on why they lost to Facebook.  Some lessons can be gleaned from this including All computer companies are technology companies first and Enterprise Programming != Web Programming and Intranet != Intranet. Read more

Aztec Empire Strategy: Use Dual Pipes For High Availability
The Aztecs built the great aqueduct 600 years ago but even then thought about uninterrupted supply.  This post states that the purpose of the twin pipes was to keep water flowing during maintenance.  When one pipe got dirty, the water was diverted to the other pipe while the dirty pipe was cleaned. Read more

 

Research Update and Request for Help
Marty and Mike will both be presenting their research at the 2011 Academy of Management Conference. Marty’s research deals with tenure based conflct and Mike’s research is focused on social contagion (a.k.a. viral growth). You can read the abstracts and full text for both papers here.

We are continuing our research and could use your help. Please consider completing one or both surveys.

HELP!
If you are an executive team member at a startup, please take this survey and pass it along to your colleagues within your company.

If you participate in any of the following social networks (Facebook, Friendster, LinkedIn, Twitter, MySpace, Ning, Orkut, or Yahoo!360), please take this survey and pass it along to your friends or colleagues.

Thanks for your support!


Comments Off

Battle Captains and Outage Managers

The other day at a client, we were trying to describe what an outage manager does and a term from my time in the military came back to me, battle captain. The best description I could come up with for an outage manager was that they perform the same duties during an outage that a battle captain does for a unit in battle. For those non-military types, a battle captain resides in the tactical operations center (TOC) of a unit and take care of tasks such as tracking the battle, enforcing orders, managing information, and making decisions based on commander’s intent when the commander is unavailable. This is exactly what an outage manager does for an outage – keep track of the outage (timeline), follow up with people to make sure tasks are completed (i.e. investigate logs for errors), makes sure information is retained and passed along, and when the VP of Ops or CTO is briefing the CEO or on the phone with a vendor, the outage manager makes decisions.

From an atricle What Now, Battle Captain? The Who, What and How of the Job on Nobody’s Books, but Found in Every Unit’s TOC by CPT Marcus F. de Oliveira, Deputy Chief, Leaders’ Training Program, JRTC here is the definition of the role:

The battle captain should be capable of assisting the command group in controlling the brigade or battalion. Remember, the commander commands the unit, and the XO is the chief of staff; BUT, those officers and the S3 must rest. They will also get pulled away from current operations to plan future operations, or receive orders from higher headquarters. The battle captain’s role then is to serve as a constant in the CP, someone who keeps his head in the current battle, and continuously assists commanders in the command and control of the fight.

A great battle captain can provide a tactical advantage to units in combat. If you have a great outage manager or have seen one work, you know how important they can be in reducing the duration of the outage. Most outage managers have primary jobs such as managing a shift in the NOC or managing an ops team but when an outage occurs they jump into the role of an outage manager. If you don’t currently have an outage manager junior military officers (JMO) just leaving the service often make great ones.


Comments Off