AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Newsletter – Firesheep

Below is part of our Fall 2010 Newsletter.  If you haven’t subscribed yet, click here to do so.

In this newsletter:

Scalability Rules

In between working with some terrific new clients this year we’ve been busy writing the second book, Scalability Rules.  With the help of some terrific technical reviewers we feel the book is taking shape very nicely and the first five chapters are now available on Safari Rough Cuts for those interested in helping review.  Scalability Rulesbrings together 50 rules that we have gathered from our experiences working with over a hundred hyper-growth companies. This format of practical rules of scalability should make it ideal for use as reference manual in formal meetings and informal discussions.

See More…

If you haven’t picked up your copy of The Art of Scalability or have technologist on your holiday gift list here are a couple links for you:

Putting Out Firesheep (Protecting Your Users’ Cookies)

One of our recent blog posts that we found most interesting was submitted by a guest blogger.  Randy Wigginton is a seasoned technologist who after a discussion with us about the security risks, brought to light recently by the Firefox plugin called Firesheep, came up with a solution that we thought should be shared.  Randy’s solution is ideal for companies who want to protect their user session data (login, browsing history, etc) but doesn’t want to be encumbered by the overhead of running their entire site behind SSL.

We also have a simple demo setup for those interesting in testing this solution.  The way it works is when a user logs in, TWO cookies are dropped.  In the demo, one is called “session”, the other is called “authenticate”.  These two cookies are identical except for a single attribute: “authenticate” is a secure cookie.  We authenticate users on non-secure pages by including a reference to a secure javascript at the top of each page.  At the top of pages requiring authentication is this simple line of code:

<script type=”text/javascript” src=”https://verify.akfdemo.com/authenticate.php“>
</script>

See More…

Lot18′s Series A

Lot18, a membership-by-invitation marketplace for wine from renowned producers at fantastic values, announced that it has completed a $3 million Series A round of funding led by FirstMark Capital, a New York City-based venture capital firm. Lot18 was founded by Kevin Fortuna and Philip James. Philip was the founder of Snooth.com, the world’s largest wine website, and Kevin was most recently a partner at AKF partners.

Lot18 Screen Shot

See More…


Comments Off

DevOps

What do you call a set of processes or systems for coordination between development and operations teams? Give up? Try “DevOps”. While not a new concept, we’ve been living and suggesting ARB and JAD as cornerstones of this coordination for years, but it has recently grown into a discipline of its own. Wikipedia states that DevOps “relates to the emerging understanding of the interdependence of development and operations in meeting a business’ goal to producing timely software products and services.” Tracking down the history of the DevOps Wikipedia page, shows that this topic is a recent entry.

There are a lot of other resources on the web that many not have been using this exact term but have certainly been dealing with the development and operations coordination challenge for years.  Dev2Ops.org is one such group and posted earlier this year their definition of DevOps “an umbrella concept that refers to anything that smoothes out the interaction between development and operations.”  They continue in their post highlighting that concept of DevOps is in response to the growing awareness of a disconnect between development and operations. While I think that is correct I think it’s only partially the reason for the recent interest in defining DevOps.

With ideas such as continuous deployment and Amazon’s two-pizza rule for highly autonomous dev/ops teams there is a blurring of roles between development and operations. Another driver of this movement is cloud computing. Developers can procure, deploy, and support virtual instances much easier than ever before with the advent of GUI or API based cloud control interfaces. What used to be clearly defined career paths and sets of responsibilities are now being blended to create a new, more efficient and highly sought after technologist. A developer who understands operations support or a system administrator who understands programming are utility players that are very valuable.

While perhaps DevOps is a new term to an old problem, it is promising to realize that organizations are taking interest in the challenges of coordination between development and operations. It is even more important that organizations pay attention to this topic given the blurring of roles.


Comments Off

How To Restore Service in Less Than 5 Minutes

What’s the first thing you do when your site is down? For most people they pull up Nagios, or the like, and check all the servers, databases, and storage systems. Someone else might start tail’ing or grep’ing the log files. Tech executives by now are answering phone calls or sending email updates about the outage and expected downtime. Software developers are called in go over the log files in more detail and network engineers are asked to jump on devices to make sure they are responding properly.

What’s missing from the above scenario? Nobody looked up the last change that went into production. In our experience, 90+% of the problems in production are caused by the latest change, be it a code release, firewall change, or applying DDL or DML to the database. And it’s a sure bet that latest change is the problem if the person who made it says “That couldn’t have caused the outage.” In fact there is probably a high degree of correlation between how emphatically they make their statement and the probability that it is the cause of the incident.

Just the other day one of our friends had an outage call where the network security team was arguing that their latest change could not have possibly caused the outage. Guess what caused the outage…that’s right the firewall change.

So, how do you solve 90+% of your problems in less than 5 minutes? You immediately rollback the last change you made to your production environment. You might be saying to yourself “But how can I do that when I don’t know all the changes that are happening in my production environment?” And that (as Paul Harvey used to say) is the rest of the story.

You have to keep track of every single change that takes place in your production environment. This is called “change tracking” and is different from “change management”. Change tracking is simply keeping track, in any format, of all the changes that happen in production. These changes can be kept in a word document, spreadsheet, database, IRC channel, or even an unmonitored email account. Anything that 1) allows fast entry, so people have no excuse to not use it, and 2) can be retrieved immediately when needed during an outage.


1 comment

Designing for Rollback

We’ve several times made reference to the need for organizations to design for rollback to be successful as a SaaS company.  Put simply, given the speed with which we want to make releases, it is critical that we limit our risk in delivering any given release by being able to easily roll back these releases.

Here are some hints on how to develop systems such that they can be easily rolled back in the event of a problem in production.

  • Database changes must only be additive – Columns or tables should only be added, not deleted, until a version of code is released that deprecates the dependency on those columns.  Once these standards are implemented every release should have a portion dedicated to cleaning up data from previous releases that is no longer needed.
  • DDL & DML scripted and tested – DBMS changes for a release must be scripted ahead of time instead of applied by hand.  This should include the script used to rollback any changes.  The two reasons for this are that:
  1. The team needs to test the rollback process in QA or staging in order to validate that they have not missed something that would prevent rolling back and
  2. The script needs to be tested under some amount of load to ensure it can be executed while the application is utilizing the database.
  • Restricted SQL queries in the application – The development team needs to disambiguate all SQL by removing all SELECT * queries and adding column names to all UPDATE statements.
  • Semantic changes of data – The development team must not change the definition of data within a release.  An example would be a column in a ticket table that is currently being used as a status semaphore indicating three values such as assigned, fixed, or closed.  The new version of the application cannot add a fourth status until code is first released to handle the new status and then code can be released to utilize the new status.
  • Wire On / Wire Off – The application should have a framework added that allows code paths and features to be accessed by some user and not by others, based on an external configuration.  This setting can be in a configuration file or a database table and should allow for both role based access as well as random percentage based.  This framework allows for beta testing of features with a limited set of users and allows for quick removal of a code path in the event of a major bug in the feature, without rolling the entire code base back.

Comments Off

Slaying Firesheep

This is a guest post by Randy Wigginton that started from a conversation about how to better secure cookies. Randy has an incredibly impressive career being one of the earliest employees at Apple and holding Distinguished Engineer and Architect titles at companies such as eBay, Quigo, and Google. Nowadays he is spending most of his time on personal projects that grab his attention such as this issue with unsecured cookies. Randy can be reached directly at this email.

The browser extension Firesheep has deservedly attracted a great deal of attention.  This extension has made it painfully obvious that many major Internet sites have not adequately protected user’s information.  In this article, we present a simple approach that will substantially improve user authentication security, and render Firesheep and other session sidejacking tools mostly useless.

There are at three different levels of security used on the web:

  1. No security.  Generally used for pages with static content, open for all.
  2. Some security.  Useful for sites with login and customization, such as Facebook or Amazon.  The information on the pages is not particularly sensitive.  This is the majority case for websites.
  3. Full security.  Financial and other sites where all information must be kept confidential.

For #1, any http server is sufficient. For #3, all pages, images and communications must be encrypted. Case #2 is a hybrid.

For most users, there are asymmetrical aspects to logged-in, customized websites.  While I do not care if anyone sniffs the network to get my status updates or discover what I am shopping for, I do NOT want anyone else claiming to be me or buying items on my behalf!  The traditional IT response has been “The only way to be secure is to put all pages and images under SSL”.  The problem with that approach is that SSL is slower and more costly; sites switching to all SSL will need to increase their server farms substantially.  This can be extremely expensive.

Here is a demo of a very simple site.  This site consists of a starting page, a secure login, then two non-secure pages that require users to be logged in.  Here you will find an extension script for FireSheep (right click and ‘save as’); it captures session cookies from the demo domain.  If you attempt to hijack a session on the akfdemo.com domain, you will be redirected to the sidejacking page.

How are sidejackers recognized?  When a user logs in, TWO cookies are dropped.  In our case, one is called “session”, the other is called “authenticate”.  These two cookies are identical except for a single attribute: “authenticate” is a secure cookie.  We authenticate users on non-secure pages by including a reference to a secure javascript at the top of each page.  At the top of pages requiring authentication is this line:

<script type=”text/javascript” src=”https://verify.akfdemo.com/authenticate.php“></script>

The authenticate.php script is:


<?php
// If this is the original user, they will have one secure and one non-secure cookie
// Both are set to username:password
// A real implementation should encrypt values.  This is for demonstration purposes.
if (strlen($_COOKIE['session'])==0) {
// They have not logged in.
echo “window.location = ‘http://”.$_SERVER['HTTP_HOST'].”/landing.html’”;
} else if ($_COOKIE['authenticate'] == $_COOKIE['session']) {
// The secure cookie is identical to the non-secure cookie.  Let the user stay.
} else {
// They do not have the secure cookie we require.  This must be a hacker!
echo “window.location = ‘http://”.$_SERVER['HTTP_HOST'].”/sidejacked.html’”;
}
?>

If the user has no session cookie, they have not logged in; send them to the starting page.  If the user has a session cookie that matches the secure authentication cookie, they are allowed through.  In the last case, they have a session cookie (which could have been obtained from Firesheep or other), but they do not possess the matching authenticate cookie.  This is the sidejacking case; in such a situation, we direct the browser to the sidejacked.html page.

It is best to think of the secure cookie as a checksum, or verification, of all the plain non-secure cookies.  With this technique, we improve user security at a fraction of the cost of using full SSL for all resources.  This technique should be used in conjunction with other security best practices to provide a complete security solution for a website.

Another security approach that consumer based internet companies should consider is using HTTP for the base page, any non-personal information, while collecting and displaying personal user information via HTTPS AJAX calls.   This way the user info is protected, the entire page does not require the overhead of HTTPS, and the browsers don’t alert users of mixed content.

If you haven’t installed Firesheep but are curious how it works, here is what it looks like running (click to enlarge the picture).

You can see on the left side the that it has captured several cookies from Yahoo, Google, Facebook, Twitter, and our AKF Demo site.  When you click on any one of those captured cookies (except for the AKF Demo) it logs you in to that person’s account. Below is what happens when you try it on the AFK Demo site with Randy’s code.

Notice that it cannot login to the demo site and is actually identified as a possible sidejacker!


7 comments

Simultaneous Discovery

The Paleolithic Era (Old Stone Age) lasted roughly from 2.5M to 10,000 years ago. During this time humans moved around in small bands as hunter/gatherers. Sometime around the Neolithic Age (New Stone Age) humans invented or discovered farming. While turning unedible crops like wheat into food is impressive, what’s even more impressive is that humans separately invented farming at least three times and possibly as many as seven times. Different civilizations from Eastern Mediterranean to China to Mexico all came up with the idea of farming, presumably without sharing this knowledge in any way.

While the discovery of farming might seem an evolutionary necessity for long term survival the coincidental simultaneous invention by disparate individuals is apparently not uncommon at all.  In 1611, sun spots were discovered at least four different times, in 1869 both Cros and du Hauron invented color photography, and one that you might be more familiar with the invention of the phone by Bell, Gray, and la Cour to name a few of the individuals involved.  Napier and Briggs are credited with logarithms but Burgi also invented them a few years earlier.  Another popular one is the theory of natural selection being developed independently but simultaneously by Wallace and Darwin. There are so many of these simultaneous discoveries or inventions that William F. Ogburn and Dorothy Thomas published a paper “Are Inventions Inevitable? A Note on Social Evolution” in 1922 that documented 148 of these simultaneous discoveries.

No one is really sure why this happens. Some believe in a sort of efficient-market hypothesis, which in financial markets means that information is ubiquitous and therefore you cannot consistently beat the market because everyone knows the same information almost simultaneously. Ogburn and Thomas postulated in their paper that because there are very few completely new discoveries, most inventions are inevitable.  Inventions are built on top of other inventions such as the steam boat being dependent on boats and steam engines being invented prior.

While a curiosity, you’re probably wondering how this applies to hyper growth startups. The key takeaway is that while you’re coming up with a great idea so is everyone else. The ability to iterate quickly on ideas is more critical than ever. Combine this absolute need for quick iterations with the requirement for measuring results of effort, lest it be completely wasted and you have A/B testing on features that are launched in weekly sprints. SaaS companies have no excuse for not releasing in very short sprints (if not continuously), watching user behavior to learn what works and what doesn’t, then iterating again.

Despite the plethora of articles and books to the contrary, there are very few million dollar ideas, just million dollar executions of ideas. If investors are looking for key attributes about a team that make them more likely to succeed or not, I’d suggest looking for a team that can deliver quickly and knows the importance of measuring success.


2 comments

Scalability as a Discipline

Just as we discussed in an earlier post about the evolution of roles in technology startups, we’ve seen the same thing in the technology discipline as a whole. Computer science as a discipline started in mathematics with Kurt Gödel’s incompleteness theorem.  From there Alan Turing and Alonzo Church formalized the notion of an algorithm and the concept of a Turing machine. The first computer that could run stored programs, based on the Turing machine model, was built in 1948 and called the Manchester Baby.

In the beginning there were only programmers, then came system operators, and DBA’s, and architects, etc. We now have many different disciplines that one can specialize in for either part or all of their careers. One of the missing disciplines, in my opinion, is the scalability architect or scalability as a discipline.

While understanding the rules, patterns, and principles of scalability are completely achievable by anyone in the technology organization, this does not mean that they are widely known. Scalability architects would be more like evangelist and teachers rather than the gatekeepers of secret knowledge. Unlike DBA’s or network engineers, whose jobs really aren’t to educate any other technology person on how to create an index or open a port, the scalability architect would educate tech people. All other disciplines from software developers to DBA’s could benefit from additional knowledge about scaling.

If you’re serious about scaling is it time that you looked for or anointed a scalability architect?


5 comments

Evolution of Roles in a Startup

We often see in the life cycle of startups that the organization starts with a couple of engineers who handle all aspects of technology and as the team grows specialization starts to be required. At some point, QA engineers are hired, sys admins take over deploying and maintaining hardware, and DBA’s are brought on board to tune databases. This is a very natural evolutionary process but does require some adjustment by the individuals as they are forced to give up responsibility and become more specialized. One of the toughest hurdles to overcome is getting engineers to relinquish their access to the production environment. Taking control or responsibility away from someone is very hard on people’s egos.

Another often seen necessity in hyper growth startups is to upgrade leaders. A leader who was capable of leading and managing five engineers isn’t necessarily capable of running a 50 person tech organization. Often people in particular leadership roles don’t scale with the fast pace growth rate of the organization. In these cases the individuals either need to relinquish their roles or be replaced in order to continue to scale the company. This doesn’t mean pushing them out but more likely it means finding a more suitable role for them. A great role for many CTO’s who need to step aside is to remain in a leadership and technical role as chief architect.

The key to being successful in this evolution is to be open and address people’s fears and concerns. It is much better to speak openly during reviews about an individual’s capabilities rather than have that person worry about their future. The same goes for engineers being asked to relinquish control of the production environment. Be open, talk to them, and listen to their concerns. An open dialogue about why the organization needs to change at this particular time in order to continue to grow and scale is usually accepted very well.


1 comment

Defining Pods, Shards and Swim Lanes

In the course of our engagements we often have to pause for a few minutes to acquaint everyone with a few terms that we use. It is often the case that they have heard or even use some terms common in the industry. Three of these that are often used and/or confused are pods, shards, and swim lanes. Let’s start by defining each one and then explaining the differences

Shards
According to Merriam-Webster a shard is a small piece or part. Wikipedia defines a database shard as “…a method of horizontal partitioning in a database or search engine.” The term horizontal partitioning refers to a database design principle whereby rows of a database table are separated possibly onto physically distinct database servers.

A shard to AKF is an Z-axis split on the AKF Scale Cube. This involves splitting the tables in the database between two or more database servers based on some appropriate key such as customer ID or sales items. An X-axis split involves replicas such as read-only slaves or standbys that are complete copies of the primary database. The Y-axis splits are one done by service, which usually aligns to a sub-set of tables. An example of this would be pulling session off the primary database an onto it’s own database server.

Pods
One of our clients, Salesforce.com, uses the term pods especially for its Force.com software-as-a-service platform. Pods are self-contained sets of functionality that can consist of an app server or database. If a pod goes down because the platform isn’t running it, only the customers on that pod will be effected. Salesforce executives claimed that it delivered 99.95 percent uptime last year.

Swim Lanes
AKF uses the term “swim lane” to describe a failure domain or fault isolation architecture. A failure domain is a group of services within a boundary such that any failure within that boundary is contained and failures do not propagate outside. The benefit of such a failure domain is two-fold:

  1. Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.
  2. Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. As such, and depending upon approach only a portion of users or a portion of functionality of the product is affected.

Between swim lanes synchronous calls are absolutely forbidden because any synchronous call between failure domains, even with appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures. An example of how this happens is in your database when one long running query slows down all the other queries competing for locks or resources.

Similarity and Differences
All of these terms describe similar architectures (splitting by customers or similar key) but they are done for different purposes. Shards are very specific to databases and don’t imply whether or not the application tier is sharded or not. The purpose of shards are to scale an RDBMS onto many different servers instead of larger hardware. Pods and Swim Lanes aim to achieve both scalability of the overall system (application and database) as well as achieve fault isolation.


1 comment

Book Review – Web Operations

Web Operations: Keeping the Data On Time By John Allspaw and Jesse Robbins, is a collection of essays and interviews dealing specifically with web operations. The book’s stated goals are to explain the skills needed in web operations, demonstrate why it’s important to gather metrics, describe common approaches to database architectures, and define what to do after a problem occurs. I think they succeeded and would recommend this book to any technologist responsible for a highly available system. As one would expect, I enjoyed some essays more than others but overall found myself nodding my head in agreement with many of the authors.

The authors John Allspaw and Jesse Robbins, in addition to a long list of contributors such as Eric Ries, Paul Hammond, and Justin Huff, have terrific CV’s that demonstrate their first hand knowledge of what it takes to run large scale web operations. John is currently a Technical Advisor at Etsy and was formerly the Engineering Manager of Flickr Operations at Yahoo!. Jesse is the CEO & Co-founder of Opscode and worked at Amazon.com with a title of “Master of Disaster”.

Unlike other collection of essay books such as 97 Things Every Programmers Should Know, which I enjoyed but found disorganized (see my full review here) Web Operations is well organized starting from general overview discussions to specific and actionable examples. The first chapter is an overview of web operations from a career perspective and the book continues with chapters discussing such topics as continuous deployment, infrastructure as code, community involvment, dev and ops collaboration, relational databases, and noSQL databases.

Put this book on your reading list or download it to your Kindle/iPad to read on your next flight. Be prepared to bookmark or highlight many of the authors’ insights that you’ll want to remember and share with your team.

For people interested in more books that we recommend, check them out at our Amazon store.


Comments Off