AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Designing for Rollback

We’ve several times made reference to the need for organizations to design for rollback to be successful as a SaaS company.  Put simply, given the speed with which we want to make releases, it is critical that we limit our risk in delivering any given release by being able to easily roll back these releases.

Here are some hints on how to develop systems such that they can be easily rolled back in the event of a problem in production.

  • Database changes must only be additive – Columns or tables should only be added, not deleted, until a version of code is released that deprecates the dependency on those columns.  Once these standards are implemented every release should have a portion dedicated to cleaning up data from previous releases that is no longer needed.
  • DDL & DML scripted and tested – DBMS changes for a release must be scripted ahead of time instead of applied by hand.  This should include the script used to rollback any changes.  The two reasons for this are that:
  1. The team needs to test the rollback process in QA or staging in order to validate that they have not missed something that would prevent rolling back and
  2. The script needs to be tested under some amount of load to ensure it can be executed while the application is utilizing the database.
  • Restricted SQL queries in the application – The development team needs to disambiguate all SQL by removing all SELECT * queries and adding column names to all UPDATE statements.
  • Semantic changes of data – The development team must not change the definition of data within a release.  An example would be a column in a ticket table that is currently being used as a status semaphore indicating three values such as assigned, fixed, or closed.  The new version of the application cannot add a fourth status until code is first released to handle the new status and then code can be released to utilize the new status.
  • Wire On / Wire Off – The application should have a framework added that allows code paths and features to be accessed by some user and not by others, based on an external configuration.  This setting can be in a configuration file or a database table and should allow for both role based access as well as random percentage based.  This framework allows for beta testing of features with a limited set of users and allows for quick removal of a code path in the event of a major bug in the feature, without rolling the entire code base back.

Comments Off

Slaying Firesheep

This is a guest post by Randy Wigginton that started from a conversation about how to better secure cookies. Randy has an incredibly impressive career being one of the earliest employees at Apple and holding Distinguished Engineer and Architect titles at companies such as eBay, Quigo, and Google. Nowadays he is spending most of his time on personal projects that grab his attention such as this issue with unsecured cookies. Randy can be reached directly at this email.

The browser extension Firesheep has deservedly attracted a great deal of attention.  This extension has made it painfully obvious that many major Internet sites have not adequately protected user’s information.  In this article, we present a simple approach that will substantially improve user authentication security, and render Firesheep and other session sidejacking tools mostly useless.

There are at three different levels of security used on the web:

  1. No security.  Generally used for pages with static content, open for all.
  2. Some security.  Useful for sites with login and customization, such as Facebook or Amazon.  The information on the pages is not particularly sensitive.  This is the majority case for websites.
  3. Full security.  Financial and other sites where all information must be kept confidential.

For #1, any http server is sufficient. For #3, all pages, images and communications must be encrypted. Case #2 is a hybrid.

For most users, there are asymmetrical aspects to logged-in, customized websites.  While I do not care if anyone sniffs the network to get my status updates or discover what I am shopping for, I do NOT want anyone else claiming to be me or buying items on my behalf!  The traditional IT response has been “The only way to be secure is to put all pages and images under SSL”.  The problem with that approach is that SSL is slower and more costly; sites switching to all SSL will need to increase their server farms substantially.  This can be extremely expensive.

Here is a demo of a very simple site.  This site consists of a starting page, a secure login, then two non-secure pages that require users to be logged in.  Here you will find an extension script for FireSheep (right click and ‘save as’); it captures session cookies from the demo domain.  If you attempt to hijack a session on the akfdemo.com domain, you will be redirected to the sidejacking page.

How are sidejackers recognized?  When a user logs in, TWO cookies are dropped.  In our case, one is called “session”, the other is called “authenticate”.  These two cookies are identical except for a single attribute: “authenticate” is a secure cookie.  We authenticate users on non-secure pages by including a reference to a secure javascript at the top of each page.  At the top of pages requiring authentication is this line:

<script type=”text/javascript” src=”https://verify.akfdemo.com/authenticate.php“></script>

The authenticate.php script is:


<?php
// If this is the original user, they will have one secure and one non-secure cookie
// Both are set to username:password
// A real implementation should encrypt values.  This is for demonstration purposes.
if (strlen($_COOKIE['session'])==0) {
// They have not logged in.
echo “window.location = ‘http://”.$_SERVER['HTTP_HOST'].”/landing.html’”;
} else if ($_COOKIE['authenticate'] == $_COOKIE['session']) {
// The secure cookie is identical to the non-secure cookie.  Let the user stay.
} else {
// They do not have the secure cookie we require.  This must be a hacker!
echo “window.location = ‘http://”.$_SERVER['HTTP_HOST'].”/sidejacked.html’”;
}
?>

If the user has no session cookie, they have not logged in; send them to the starting page.  If the user has a session cookie that matches the secure authentication cookie, they are allowed through.  In the last case, they have a session cookie (which could have been obtained from Firesheep or other), but they do not possess the matching authenticate cookie.  This is the sidejacking case; in such a situation, we direct the browser to the sidejacked.html page.

It is best to think of the secure cookie as a checksum, or verification, of all the plain non-secure cookies.  With this technique, we improve user security at a fraction of the cost of using full SSL for all resources.  This technique should be used in conjunction with other security best practices to provide a complete security solution for a website.

Another security approach that consumer based internet companies should consider is using HTTP for the base page, any non-personal information, while collecting and displaying personal user information via HTTPS AJAX calls.   This way the user info is protected, the entire page does not require the overhead of HTTPS, and the browsers don’t alert users of mixed content.

If you haven’t installed Firesheep but are curious how it works, here is what it looks like running (click to enlarge the picture).

You can see on the left side the that it has captured several cookies from Yahoo, Google, Facebook, Twitter, and our AKF Demo site.  When you click on any one of those captured cookies (except for the AKF Demo) it logs you in to that person’s account. Below is what happens when you try it on the AFK Demo site with Randy’s code.

Notice that it cannot login to the demo site and is actually identified as a possible sidejacker!


7 comments

Simultaneous Discovery

The Paleolithic Era (Old Stone Age) lasted roughly from 2.5M to 10,000 years ago. During this time humans moved around in small bands as hunter/gatherers. Sometime around the Neolithic Age (New Stone Age) humans invented or discovered farming. While turning unedible crops like wheat into food is impressive, what’s even more impressive is that humans separately invented farming at least three times and possibly as many as seven times. Different civilizations from Eastern Mediterranean to China to Mexico all came up with the idea of farming, presumably without sharing this knowledge in any way.

While the discovery of farming might seem an evolutionary necessity for long term survival the coincidental simultaneous invention by disparate individuals is apparently not uncommon at all.  In 1611, sun spots were discovered at least four different times, in 1869 both Cros and du Hauron invented color photography, and one that you might be more familiar with the invention of the phone by Bell, Gray, and la Cour to name a few of the individuals involved.  Napier and Briggs are credited with logarithms but Burgi also invented them a few years earlier.  Another popular one is the theory of natural selection being developed independently but simultaneously by Wallace and Darwin. There are so many of these simultaneous discoveries or inventions that William F. Ogburn and Dorothy Thomas published a paper “Are Inventions Inevitable? A Note on Social Evolution” in 1922 that documented 148 of these simultaneous discoveries.

No one is really sure why this happens. Some believe in a sort of efficient-market hypothesis, which in financial markets means that information is ubiquitous and therefore you cannot consistently beat the market because everyone knows the same information almost simultaneously. Ogburn and Thomas postulated in their paper that because there are very few completely new discoveries, most inventions are inevitable.  Inventions are built on top of other inventions such as the steam boat being dependent on boats and steam engines being invented prior.

While a curiosity, you’re probably wondering how this applies to hyper growth startups. The key takeaway is that while you’re coming up with a great idea so is everyone else. The ability to iterate quickly on ideas is more critical than ever. Combine this absolute need for quick iterations with the requirement for measuring results of effort, lest it be completely wasted and you have A/B testing on features that are launched in weekly sprints. SaaS companies have no excuse for not releasing in very short sprints (if not continuously), watching user behavior to learn what works and what doesn’t, then iterating again.

Despite the plethora of articles and books to the contrary, there are very few million dollar ideas, just million dollar executions of ideas. If investors are looking for key attributes about a team that make them more likely to succeed or not, I’d suggest looking for a team that can deliver quickly and knows the importance of measuring success.


2 comments

Scalability as a Discipline

Just as we discussed in an earlier post about the evolution of roles in technology startups, we’ve seen the same thing in the technology discipline as a whole. Computer science as a discipline started in mathematics with Kurt Gödel’s incompleteness theorem.  From there Alan Turing and Alonzo Church formalized the notion of an algorithm and the concept of a Turing machine. The first computer that could run stored programs, based on the Turing machine model, was built in 1948 and called the Manchester Baby.

In the beginning there were only programmers, then came system operators, and DBA’s, and architects, etc. We now have many different disciplines that one can specialize in for either part or all of their careers. One of the missing disciplines, in my opinion, is the scalability architect or scalability as a discipline.

While understanding the rules, patterns, and principles of scalability are completely achievable by anyone in the technology organization, this does not mean that they are widely known. Scalability architects would be more like evangelist and teachers rather than the gatekeepers of secret knowledge. Unlike DBA’s or network engineers, whose jobs really aren’t to educate any other technology person on how to create an index or open a port, the scalability architect would educate tech people. All other disciplines from software developers to DBA’s could benefit from additional knowledge about scaling.

If you’re serious about scaling is it time that you looked for or anointed a scalability architect?


5 comments

Evolution of Roles in a Startup

We often see in the life cycle of startups that the organization starts with a couple of engineers who handle all aspects of technology and as the team grows specialization starts to be required. At some point, QA engineers are hired, sys admins take over deploying and maintaining hardware, and DBA’s are brought on board to tune databases. This is a very natural evolutionary process but does require some adjustment by the individuals as they are forced to give up responsibility and become more specialized. One of the toughest hurdles to overcome is getting engineers to relinquish their access to the production environment. Taking control or responsibility away from someone is very hard on people’s egos.

Another often seen necessity in hyper growth startups is to upgrade leaders. A leader who was capable of leading and managing five engineers isn’t necessarily capable of running a 50 person tech organization. Often people in particular leadership roles don’t scale with the fast pace growth rate of the organization. In these cases the individuals either need to relinquish their roles or be replaced in order to continue to scale the company. This doesn’t mean pushing them out but more likely it means finding a more suitable role for them. A great role for many CTO’s who need to step aside is to remain in a leadership and technical role as chief architect.

The key to being successful in this evolution is to be open and address people’s fears and concerns. It is much better to speak openly during reviews about an individual’s capabilities rather than have that person worry about their future. The same goes for engineers being asked to relinquish control of the production environment. Be open, talk to them, and listen to their concerns. An open dialogue about why the organization needs to change at this particular time in order to continue to grow and scale is usually accepted very well.


1 comment

Defining Pods, Shards and Swim Lanes

In the course of our engagements we often have to pause for a few minutes to acquaint everyone with a few terms that we use. It is often the case that they have heard or even use some terms common in the industry. Three of these that are often used and/or confused are pods, shards, and swim lanes. Let’s start by defining each one and then explaining the differences

Shards
According to Merriam-Webster a shard is a small piece or part. Wikipedia defines a database shard as “…a method of horizontal partitioning in a database or search engine.” The term horizontal partitioning refers to a database design principle whereby rows of a database table are separated possibly onto physically distinct database servers.

A shard to AKF is an Z-axis split on the AKF Scale Cube. This involves splitting the tables in the database between two or more database servers based on some appropriate key such as customer ID or sales items. An X-axis split involves replicas such as read-only slaves or standbys that are complete copies of the primary database. The Y-axis splits are one done by service, which usually aligns to a sub-set of tables. An example of this would be pulling session off the primary database an onto it’s own database server.

Pods
One of our clients, Salesforce.com, uses the term pods especially for its Force.com software-as-a-service platform. Pods are self-contained sets of functionality that can consist of an app server or database. If a pod goes down because the platform isn’t running it, only the customers on that pod will be effected. Salesforce executives claimed that it delivered 99.95 percent uptime last year.

Swim Lanes
AKF uses the term “swim lane” to describe a failure domain or fault isolation architecture. A failure domain is a group of services within a boundary such that any failure within that boundary is contained and failures do not propagate outside. The benefit of such a failure domain is two-fold:

  1. Fault Detection: Given a granular enough approach, the component of availability associated with the time to identify the failure is significantly reduced. This is because all effort to find the root cause or failed component is isolated to the section of the product or platform associated with the failure domain.
  2. Fault Isolation: As stated previously, the failure does not propagate or cause a deterioration of other services within the platform. As such, and depending upon approach only a portion of users or a portion of functionality of the product is affected.

Between swim lanes synchronous calls are absolutely forbidden because any synchronous call between failure domains, even with appropriate timeout and detection mechanisms, is very likely to cause a cascading series of failures. An example of how this happens is in your database when one long running query slows down all the other queries competing for locks or resources.

Similarity and Differences
All of these terms describe similar architectures (splitting by customers or similar key) but they are done for different purposes. Shards are very specific to databases and don’t imply whether or not the application tier is sharded or not. The purpose of shards are to scale an RDBMS onto many different servers instead of larger hardware. Pods and Swim Lanes aim to achieve both scalability of the overall system (application and database) as well as achieve fault isolation.


1 comment

Book Review – Web Operations

Web Operations: Keeping the Data On Time By John Allspaw and Jesse Robbins, is a collection of essays and interviews dealing specifically with web operations. The book’s stated goals are to explain the skills needed in web operations, demonstrate why it’s important to gather metrics, describe common approaches to database architectures, and define what to do after a problem occurs. I think they succeeded and would recommend this book to any technologist responsible for a highly available system. As one would expect, I enjoyed some essays more than others but overall found myself nodding my head in agreement with many of the authors.

The authors John Allspaw and Jesse Robbins, in addition to a long list of contributors such as Eric Ries, Paul Hammond, and Justin Huff, have terrific CV’s that demonstrate their first hand knowledge of what it takes to run large scale web operations. John is currently a Technical Advisor at Etsy and was formerly the Engineering Manager of Flickr Operations at Yahoo!. Jesse is the CEO & Co-founder of Opscode and worked at Amazon.com with a title of “Master of Disaster”.

Unlike other collection of essay books such as 97 Things Every Programmers Should Know, which I enjoyed but found disorganized (see my full review here) Web Operations is well organized starting from general overview discussions to specific and actionable examples. The first chapter is an overview of web operations from a career perspective and the book continues with chapters discussing such topics as continuous deployment, infrastructure as code, community involvment, dev and ops collaboration, relational databases, and noSQL databases.

Put this book on your reading list or download it to your Kindle/iPad to read on your next flight. Be prepared to bookmark or highlight many of the authors’ insights that you’ll want to remember and share with your team.

For people interested in more books that we recommend, check them out at our Amazon store.


Comments Off

RAC Rant

We’ve written about trying to use vendor features to scale but given how often we run across companies that have been convinced by vendors to rely on them, this topic is worth revisiting. To state it as directly as possible, every major SaaS company that has relied on a vendor, software or hardware, to scale them through hyper-growth has failed and had to solve the scale problem themselves.

Since Oracle World took place recently I’ve decided to use Oracle RDBMS as an example of failing to scale with vendor features. We have nothing against using Oracle as an RDBMS, even though there are open source options that can scale just as well, but let’s use one of their scalability features, Real Application Clusters (RAC), as an example. In Oracle’s own words RAC “…enables a single database to run across a cluster of servers, providing unbeatable fault tolerance, performance, and scalability with no application changes necessary.” A nice concept – to scale with “no application changes” – but this isn’t realistic with hyper-growth companies. One large reason is that RAC does not scale across multiple data centers, which is a requirement for hyper-growth companies since everything fails eventually including data centers. Even with the “Extended Distance Clusters” for RAC nodes, they only extend to 25 kilometers using Dark Fiber (DWDM or CWM) technology.

The use of RAC for increased availability is fine but you should review our post on the downside of using vendor features and how to negotiate with vendors. In particular you should be aware that by using this feature you have weakened your position during renewal negotiations. If you think your sales person is being nice by throwing in the RAC feature for a low price, think again. As soon as you start using this feature they have the upper hand in negotiations.

Enough of the RAC rant, especially since this is just one example of many that are out there. Hardware vendors, both servers and storage, are just as guilty of trying to convince SaaS companies to rely on them for scalability. Keep your destiny in your own hands and resist relying on short term solutions to long term problems.


Comments Off

Asking For Help

There was a study by Viswanath Venkatesh and Michael G. Morris “Why Don’t Men Ever Stop to Ask for Directions? Gender, Social Influence, and Their Role in Technology Acceptance and Usage Behavior” that looked at 342 workers over five months to observer usage and adoption of technology. The researchers’ results were that men considered perceived usefulness to a greater extent than women in making decisions about the use of new technology. On the other hand, perceived ease of use was more important to women compared to men even after initial training. What’s perhaps even more interesting was that men’s view of ease of use was that it went up after using the system while women’s view remained unchanged. Additionally subject norms were considered much more by women than men. What this suggests is that women are much more balanced in their technology decisions with regards to perceived usefulness, ease of use, and subject norms.

Another study by Fiona Lee “When the Going Gets Tough, Do the Tough Ask for Help? Help Seeking and Power Motivation in Organizations” revealed that individuals do not seek help, even when it is needed and available, because doing so implies incompetence, dependence, and powerlessness.  There is even a whimsical study from insurer Sheilas’ Wheels that claims that the average male drives around lost 276 miles each year costing over $3,000 in fuel.

Our experience with hundreds of clients has been that technologist in general hate to ask for help but would much rather struggle in an attempt to solve the problem themselves. Unfortunately they do so even at the risk of being very inefficient and wasting organizational resources. Whether the reason is genetic or fear of being seen as incompetent the problem is costly to organizations.

Asking for help or advice from someone who has been down the road you’re traveling in my opinion is not only the fastest way to get there but also shows great courage and confidence. You have to be completely comfortable with what you know in order to admit in front of peers or bosses what you don’t know. Some of this confidence comes from experience but some of it also comes from the culture of the organization. If you’ve built or inherited an organizational culture where everyone pretends to know everything and is afraid to ask for help, you’re guaranteed to be very inefficient. If you find yourself faced with this situation be the leader and step forward to show people how to ask for help. Start by calling this issue out at your next all hands meeting or staff meeting. Then show people it’s more important to be unbelievably curious and passionate about your craft than to appear like you know everything.


Comments Off

“Internal Customer”: The “C” Word of SaaS Companies

If you are a technology organization within a Software as a Service (SaaS) company, there is no such thing as an “internal customer”.

If you are a technology organization within a Software as a Service (SaaS) company, there is no such thing as an “internal customer”.  We often see this anachronistic IT phrase thrown around in web X.0 companies by executives and engineers who simply have not adopted the new SaaS mindset.  Do you think you’ll hear the left offensive tackle of an NFL team refer to the quarterback as his “internal customer”?  The quarterback consumes services (energy to block opponents) of the left tackle – so why wouldn’t he be a customer?  The answer is simple – because the notion of a customer relationship is different than the notion of a relationship within teammates.

The first reason why your teammate isn’t your customer is because he or she is, well, your TEAMMATE.  Customers are someone for whom you produce a service or product and teammates are someone with whom you work to accomplish a goal.  The difference between working FOR someone and working WITH someone is HUGE.  This difference creates a contextually activated identity that forces you to think about customers in a different light than you would a teammate.  Very often, as we’ve written before, this can result in affective (role based or bad) conflict between teams.  Affective conflict is bad and it destroys shareholder value.  Working as a team is important and customers aren’t part of your team.

The next reason that your teammate isn’t your customer is that the customer is always right.  Your teammate isn’t always right.  You need to debate certain points as a team to come to better solutions.  This isn’t affective conflict, it is cognitive conflict and if handled properly it is good and helps to create shareholder value.

The most important reason there isn’t a customer relationship here is that your teammate isn’t paying!   “Servicing” your teammate (uggh…that’s an ugly term) doesn’t create shareholder value.  Working as  a team to delivery a service or product to your  “real” customer is what creates shareholder value.  One design, one approach, one ruthless drive as a team to get across the goal line is what is necessary to thrive and succeed.

So stop using the ugly “internal” C word in your SaaS company.  It doesn’t have a place there.  Let the old world, internal IT folks continue to provide services to their internal customers.  Start acting like a team, designing and building services rather than software.


2 comments