AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Category » Engineering

Google’s Megastore

Papers from the 2011 Conference on Innovative Data Systems Research (CIDR) have been posted and one that is particularly interesting is the Google paper detailing their design and development of Megastore. Megastore is a storage system developed to meet the requirements of today’s interactive online services. According to the paper “Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability.” The system’s underlying datastore is Google’s Bigtable but it additionally provides for serializable ACID semantics and fine-grained partitions of data.

Here is the link to the paper for you to read all the details yourself but I thought I’d point out a couple things that I found interesting about the design.

Data Split
The Megastore design is what AKF would call a Z-axis split of the data. Google does this because partitioning allows for the synchronous replication of each write across a wide area network (between datacenters) with reasonable latency. The key being the smaller the amount of data the faster the replication. The paper states “…data for most Internet services can be suitably partitioned (e.g., by user) to make this approach viable.”

Joins in Code
While most of our clients are likely to never require extreme scaling on a relational database but if you’re one of the lucky ones, the way to do so is to minimize the use of relational features. This would include things like joins. While joining in the DB is terribly efficient from a coding perspective, by joining in the code you remove load from the DB. You can scale by adding web servers much easier and cheaper than you can add relational databases. Google’s paper states that normalized relational schemas that rely on joins at query time were not the right model for Megastore because high-volume interactive workloads benefit more from predictable performance, reads dominate writes in most web applications so it pays to move work from read time to write time, and key-value stores make querying hierarchical data very simple.

Paxos and Two-Phase Commit
Google’s Megastore utilizes two algorithms that I personally thought would not scale at very large transaction volumes. The first is the Paxos algorithm, which is a way to reach consensus among a group of replicas on a single value. It allows up to F faults with 2F + 1 replicas by essentially voting among the replicas which is notoriously slow. The second algorithm is Two-Phase Commit which allows for atomic updates across entity groups. The paper does admit that these transactions have “much higher latency and increase the risk of contention.” That, in my opinion, is very understated but they do offer the discouragement of applications from using the feature in favor of queues.

I highly recommend you put this paper and some of the others from CIDR on your instapaper list for reading on your next flight or while bored during your next meeting.


Comments Off

How to Setup a Failover Server on EC2

We started working on Amazon’s EC2 instances several years ago. Eventually we moved several of our hosted environments to the cloud and used scripts to backup the MySQL DB’s and file systems to S3. While the EC2 instances are pretty stable, like everything else they do occasionally fail. Since Amazon offers an elastic load balancer solution I started there. The setup is incredibly simple through the AWS UI and the cost is pretty reasonable at $0.025 per hour and $0.008 per GB. The problem with Amazon’s elastic load balancer solution is that you can’t associate an IP with it and can only address it by the domain name that Amazon has assigned. This prevents Amazon’s elastic LB from being able to be used for a primary domain. You can only use Amazon’s LB solution for sub-domains. This wasn’t acceptable so I started looking at at alternatives.

HAProxy was top of my list for an open source LB because of it’s ease of configuration, performance, and wide adoption. What I didn’t like this solution is that because it is in the path of traffic it requires two servers setup in HA mode, lest I cause more issues than I solve. This unfortunately doubles the cost of server instances. Additionally several environments that I was considering load balancing were running CMS systems not designed for active-active so without some hacking they would be running in active-passive mode. I started thinking about an alternative solution.

What I came up with was setting up a failover server with a script to monitor and control the failover execution. I believe this solution balances cost, complexity, and availability for small sites that are not critical, i.e. a company’s blog. If your site IS your business then you need to move forward with a properly load balanced, active-active solution.

The first thing you’ll need to do is to setup two additional servers. One is your replica or failover server that you’ll host your site/DB from when the primary fails. The second server is for monitoring and controlling the failover. For my failover server I used MySQL master-slave replication, which is pretty straight forward to setup and not going to be covered here. On the monitoring server my plan was to rely on Amazon’s AWS API tools to disassociate my IP and re-associate it with my failover server. In order to use these tools you need a JRE on your monitoring server. For setting this up I followed the instructions on this site.

Once you’ve setup the replica and monitoring servers, you need a script to monitor and control the failover. I used a bash shell script that curl’s the desired test page and greps for something that I know loads at the bottom of the page such as a Google analytics ID. If the load fails the script inserts the current timestamp into a file. If the page loads successfully it empties the file. The reason for this is that I didn’t want to alert or failover just because of one missed page load or because of missed page loads that were not sequential.

#!/bin/sh
FILE=akf_blog_err_cnt.txt
if curl -s http://mysite.com/ | grep -c UA-12345 > /dev/null 2>&1
then echo > $FILE
else echo $(date) >> $FILE
fi

The next step is to add the logic for counting the number of timestamps in the file.

ERR_CNT=0
while [ $ERR_CNT -lt $(cat $FILE | wc -l) ]
do
let ERR_CNT=ERR_CNT+1
done

Now compare that count to a maximum allowable number of failures. In my case if I don’t get a successful page response in 5 attempts I want to initiate the alert and failover. Since this script is designed to run via cron periodically and not as a persistent service, I’ve added a semaphore file to identify if the site has failed over. This will prevent the script from continuously trying to failover.

The actual failover control has a few steps. The first is to send out an email alert so that I know something has gone awry. The next is to stop the MySQL slave on the failover server. Since this is going to start taking traffic I don’t want it applying any more logs from the master. I’m using SSH with a key to execute a remote command. The last two steps are to disassociate the IP from the failed server and re-associate it to the failover server. These commands are part of the AWS API tool.

MAX_ERR=5
FAILED_FLAG=akf_blog_fail.txt
if [ $ERR_CNT -gt $MAX_ERR ]&&[ ! -f FAILED_FLAG ]
then
# Send email about failure
echo “The page did not loading more than $MAX_ERR times. Shifting to backup server.” | /bin/mail -s “Site NOT Loading” michael@akfpartners.com
# Stop slave
echo $( ssh -i /key.pem user@ec2-IP-address.amazonaws.com ‘mysql -Bse “stop slave” ‘)
# Shift IP to secondary server
echo $(ec2-disassociate-address 50.72.23.173)
echo $(ec2-associate-address 50.72.23.173 -i i-3950994)
# Mark as failed over
echo $(touch $FAILED_FLAG)
else echo “The test page has less than $MAX_ERR errors”
fi

Now, place this script in your cron jobs to run every minute. That’s it for setting up the failover monitor and control script. Because this monitoring server is not in the direct route of traffic I don’t need it setup as HA. A total failure of the system would require both the monitoring server and the primary site server to fail simultaneously. But because I’m pretty paranoid I do have an external monitoring service watching over the site and the monitoring server.


1 comment

Setting Up CloudFront with an Origin Server

We have a couple of sites hosted on Amazon’s EC2 and I wanted to implement the CDN product from Amazon called CloudFront to see what performance improvements we could achieve. Having setup other CDN’s for sites I figured this would be a pretty straightforward setup, not worthy of a post. Unfortunately, this turned out to not be the case and thus I thought I should write something up for anyone else interested in a similar setup.

As background, a CDN (Content Delivery Network) is used to host mostly static content (files that don’t change often) on what are called “edge servers” instead of just your servers, called origin servers. Typically there are many hundreds or thousands of edge servers that are geographically distributed across multiple backbone providers. This makes them much closer to your customers resulting in faster download of your files to their browsers and thus better page performance while on your site.

CloudFront is designed to use Amazon’s S3 storage as its source for objects (static files like images or videos). I didn’t want to pay for the additional storage, although it is very cheap, but most importantly I did not want another failure point in the architecture. This setup might also be useful for sites not hosted on EC2 but wanting to use CloudFront. Wanting CloudFront to pull objects directly from my server I went looking for how others had solved this problem. It turns out this is possible to setup a CloudFront “distribution” (a term Amazon uses to refer to an implementation) using an origin server instead of S3 but only through Amazon’s CloudFront API, documentation here. Once the distribution is setup you can adminster it from the AWS web interface.

I started playing with the API using CURL but realized after a few attempts that the process was a little more complicated and in order to have something repeatable I’d need to write a little code. Since I had already borrowed the HMAC-SHA1 function, required for API authorization, from here which was in PHP, I continued with PHP. Here is the complete program if you’re interested but below are the major steps.

Major Steps
Here are the major steps in the program.

1) Define XML Payload: Using the “DistributionConfig” method, you set the “CustomOrigin” instead of “S3″ and define the following variables:

  • DNSName – this is the domain you are setting up the CDN for.
  • HTTPPort – what ports are your secure and unsecure traffic on?
  • CNAME – what subdomain will you use in DNS to refer to the CDN? I used “cdn1.akfpartners.com” because I planned on changing all my references to static items (images, js, css, etc) to call this subdomain.
  • Enabled – do you want this enabled right away?
  • CallerReference – this is an ID to keep your requests unique.
  • DefaultRootObject – this is the default file that will be requested if no file is explicitly called.

2) Encode Authorization String: The CloudFront API requires that you encode the date formatted as such “Thu, 30 Dec 2010 16:05:21 EST” using HMAC-SHA1 with your secret access key.

3) Set Headers: The most important header is the “Authorization” header that requires the following format “Authorization: AWS public_access_key:encoded_date”.

4) Set CURL Options: There are a few CURL options that are required

  • URL – the URL to be called is “https://cloudfront.amazonaws.com/2010-11-01/distribution/”
  • POST – the API is a REST so you need to set the CURL to POST
  • TIMEOUT – how long before the request times out

5) Execute API Request: I wrapped the request in microtime calls to see how long the transaction took and captured the results of the request.

6) Parse Results: If successful the result will be a 201 reference meaning “created”. Otherwise there are a bunch of errors that can be sent back.

Once your program is ready just execute it and hopefully you get back a 201. Once you’re successful jump into the AWS console and you should see your distribution being created. It usually takes about 5 minutes until the distribution is completely ready.

DNS & Application Changes
The next step is to setup your DNS to use this CloudFront distribution. In the AWS console you will see the URL that Amazon has assigned to your CDN distribution, something like “d75x0jxgmx7op.cloudfront.net”. Simply take that URL and create a CNAME through your DNS provider to point your subdomain to the Amazon URL. My entry looked like this:

cdn1.akfpartners.com Alias (CNAME) d75x0jxgmx7op.cloudfront.net

Once you have DNS setup and propogated, remember that depending on your DNS provider’s TTL this might take 24 hrs or more, then you can change your application’s reference to static images. For the sites that I was implementing this for we used MediaWiki, Expression Engine (EE), and WordPress. The wiki just required a change to one PHP file, LocalSettings.php. For EE it took a change to a CSS file and in several templates replacing the {site_url} with a reference to the CNAME. For WordPress there is a plugin that helps with this reference replacement if you don’t want to hack the file by hand.

That’s it! Your site should now be up and running with Amazon’s CloudFront CDN.

Was There a Peformance Improvement?

This is really the big question, was this exercise, slightly more than point & click that I thought it would be, worth it? Well the wiki that I set this up for was ridiculously fast already and it had almost no images so the results weren’t that impressive. Our site, akfpartners.com, was already pretty fast as well but it does contain numerous images, JS, and CSS files. Using webpagetest.org, I ran the test several times averaging the results. The table below shows the results.

Here is a screenshot of the output of WebPageTest.org for a run with CloudFront enabled. Notice that it assigned us an “A” for use of a CDN whereas before we received an “F”.

(Click to Enlarge)

A 6.1% improvement doesn’t seem like that much until you consider Google’s statement that decreasing web search latency from 400 ms to 100 ms increases the daily number of searches per user by up to 0.6%. Increasing your site’s speed by just a small amount can have significant increases in repeat visitors and time on site.

Good luck with your CloudFront implementation.


Comments Off

DevOps

What do you call a set of processes or systems for coordination between development and operations teams? Give up? Try “DevOps”. While not a new concept, we’ve been living and suggesting ARB and JAD as cornerstones of this coordination for years, but it has recently grown into a discipline of its own. Wikipedia states that DevOps “relates to the emerging understanding of the interdependence of development and operations in meeting a business’ goal to producing timely software products and services.” Tracking down the history of the DevOps Wikipedia page, shows that this topic is a recent entry.

There are a lot of other resources on the web that many not have been using this exact term but have certainly been dealing with the development and operations coordination challenge for years.  Dev2Ops.org is one such group and posted earlier this year their definition of DevOps “an umbrella concept that refers to anything that smoothes out the interaction between development and operations.”  They continue in their post highlighting that concept of DevOps is in response to the growing awareness of a disconnect between development and operations. While I think that is correct I think it’s only partially the reason for the recent interest in defining DevOps.

With ideas such as continuous deployment and Amazon’s two-pizza rule for highly autonomous dev/ops teams there is a blurring of roles between development and operations. Another driver of this movement is cloud computing. Developers can procure, deploy, and support virtual instances much easier than ever before with the advent of GUI or API based cloud control interfaces. What used to be clearly defined career paths and sets of responsibilities are now being blended to create a new, more efficient and highly sought after technologist. A developer who understands operations support or a system administrator who understands programming are utility players that are very valuable.

While perhaps DevOps is a new term to an old problem, it is promising to realize that organizations are taking interest in the challenges of coordination between development and operations. It is even more important that organizations pay attention to this topic given the blurring of roles.


Comments Off

Designing for Rollback

We’ve several times made reference to the need for organizations to design for rollback to be successful as a SaaS company.  Put simply, given the speed with which we want to make releases, it is critical that we limit our risk in delivering any given release by being able to easily roll back these releases.

Here are some hints on how to develop systems such that they can be easily rolled back in the event of a problem in production.

  • Database changes must only be additive – Columns or tables should only be added, not deleted, until a version of code is released that deprecates the dependency on those columns.  Once these standards are implemented every release should have a portion dedicated to cleaning up data from previous releases that is no longer needed.
  • DDL & DML scripted and tested – DBMS changes for a release must be scripted ahead of time instead of applied by hand.  This should include the script used to rollback any changes.  The two reasons for this are that:
  1. The team needs to test the rollback process in QA or staging in order to validate that they have not missed something that would prevent rolling back and
  2. The script needs to be tested under some amount of load to ensure it can be executed while the application is utilizing the database.
  • Restricted SQL queries in the application – The development team needs to disambiguate all SQL by removing all SELECT * queries and adding column names to all UPDATE statements.
  • Semantic changes of data – The development team must not change the definition of data within a release.  An example would be a column in a ticket table that is currently being used as a status semaphore indicating three values such as assigned, fixed, or closed.  The new version of the application cannot add a fourth status until code is first released to handle the new status and then code can be released to utilize the new status.
  • Wire On / Wire Off – The application should have a framework added that allows code paths and features to be accessed by some user and not by others, based on an external configuration.  This setting can be in a configuration file or a database table and should allow for both role based access as well as random percentage based.  This framework allows for beta testing of features with a limited set of users and allows for quick removal of a code path in the event of a major bug in the feature, without rolling the entire code base back.

Comments Off

Slaying Firesheep

This is a guest post by Randy Wigginton that started from a conversation about how to better secure cookies. Randy has an incredibly impressive career being one of the earliest employees at Apple and holding Distinguished Engineer and Architect titles at companies such as eBay, Quigo, and Google. Nowadays he is spending most of his time on personal projects that grab his attention such as this issue with unsecured cookies. Randy can be reached directly at this email.

The browser extension Firesheep has deservedly attracted a great deal of attention.  This extension has made it painfully obvious that many major Internet sites have not adequately protected user’s information.  In this article, we present a simple approach that will substantially improve user authentication security, and render Firesheep and other session sidejacking tools mostly useless.

There are at three different levels of security used on the web:

  1. No security.  Generally used for pages with static content, open for all.
  2. Some security.  Useful for sites with login and customization, such as Facebook or Amazon.  The information on the pages is not particularly sensitive.  This is the majority case for websites.
  3. Full security.  Financial and other sites where all information must be kept confidential.

For #1, any http server is sufficient. For #3, all pages, images and communications must be encrypted. Case #2 is a hybrid.

For most users, there are asymmetrical aspects to logged-in, customized websites.  While I do not care if anyone sniffs the network to get my status updates or discover what I am shopping for, I do NOT want anyone else claiming to be me or buying items on my behalf!  The traditional IT response has been “The only way to be secure is to put all pages and images under SSL”.  The problem with that approach is that SSL is slower and more costly; sites switching to all SSL will need to increase their server farms substantially.  This can be extremely expensive.

Here is a demo of a very simple site.  This site consists of a starting page, a secure login, then two non-secure pages that require users to be logged in.  Here you will find an extension script for FireSheep (right click and ‘save as’); it captures session cookies from the demo domain.  If you attempt to hijack a session on the akfdemo.com domain, you will be redirected to the sidejacking page.

How are sidejackers recognized?  When a user logs in, TWO cookies are dropped.  In our case, one is called “session”, the other is called “authenticate”.  These two cookies are identical except for a single attribute: “authenticate” is a secure cookie.  We authenticate users on non-secure pages by including a reference to a secure javascript at the top of each page.  At the top of pages requiring authentication is this line:

<script type=”text/javascript” src=”https://verify.akfdemo.com/authenticate.php“></script>

The authenticate.php script is:


<?php
// If this is the original user, they will have one secure and one non-secure cookie
// Both are set to username:password
// A real implementation should encrypt values.  This is for demonstration purposes.
if (strlen($_COOKIE['session'])==0) {
// They have not logged in.
echo “window.location = ‘http://”.$_SERVER['HTTP_HOST'].”/landing.html’”;
} else if ($_COOKIE['authenticate'] == $_COOKIE['session']) {
// The secure cookie is identical to the non-secure cookie.  Let the user stay.
} else {
// They do not have the secure cookie we require.  This must be a hacker!
echo “window.location = ‘http://”.$_SERVER['HTTP_HOST'].”/sidejacked.html’”;
}
?>

If the user has no session cookie, they have not logged in; send them to the starting page.  If the user has a session cookie that matches the secure authentication cookie, they are allowed through.  In the last case, they have a session cookie (which could have been obtained from Firesheep or other), but they do not possess the matching authenticate cookie.  This is the sidejacking case; in such a situation, we direct the browser to the sidejacked.html page.

It is best to think of the secure cookie as a checksum, or verification, of all the plain non-secure cookies.  With this technique, we improve user security at a fraction of the cost of using full SSL for all resources.  This technique should be used in conjunction with other security best practices to provide a complete security solution for a website.

Another security approach that consumer based internet companies should consider is using HTTP for the base page, any non-personal information, while collecting and displaying personal user information via HTTPS AJAX calls.   This way the user info is protected, the entire page does not require the overhead of HTTPS, and the browsers don’t alert users of mixed content.

If you haven’t installed Firesheep but are curious how it works, here is what it looks like running (click to enlarge the picture).

You can see on the left side the that it has captured several cookies from Yahoo, Google, Facebook, Twitter, and our AKF Demo site.  When you click on any one of those captured cookies (except for the AKF Demo) it logs you in to that person’s account. Below is what happens when you try it on the AFK Demo site with Randy’s code.

Notice that it cannot login to the demo site and is actually identified as a possible sidejacker!


7 comments

How To Say "No"

What do you do when your largest customer asks for a special feature or tells you that they cannot be upgraded to your next release?

It’s not uncommon to hear from our clients that they have multiple versions of their application running, many of them dedicated to a single “special customer”. While such implementations destroy the economics of a SaaS company, sometimes we must bend a few rules and accept certain risks early in our company’s lives to stay afloat. A startup that is desperate for cash will do just about anything to attract and retain their bigger customers, especially when this customer interaction is lead by sales staff unfamiliar with the hidden costs of customizations.

As technologist, maintaining multiple versions of our SaaS software makes us cringe. We all know that this is a recipe for a maintenance nightmare. Multiple versions require – bug fixes, patches, and features to be tested and deployed on different versions as well as dedicated hardware or virtual instances to run the different versions. The cost of this can quickly double a technology organization’s budget when additional QA, developers, and operations staff are accounted for as well as the environments. In fact, when running a SaaS platform with dedicated instances (remember the original ASP days?) one can quickly create the worst of both the hosted and packaged software worlds and sink a company.  So how do you know when to say “No” and how do you do it?

The way to make decisions, whether build vs buy or feature prioritization or customizations, is to gather the appropriate data to establish a well reasoned argument without exhibiting “analysis paralysis”. For customizations this general includes the upside in terms of future revenue, the downside in terms of development, environment, and maintenance costs, and the risk. No matter what the sales reps are saying there is no guarantee that with the customization you will win the business or without it you won’t. This is the reason for adding a risk variable (beta) that attempts to quantify the likelihood of success or failure.

If the decision is to say yes, then spend some time thinking about what level of customization will occur and how you can possibly reuse the customization. According to authors Pine and Gilmore in “The Experience Economy” there are four types of customizations, transparent, collaborative, adaptive, and cosmetic. Transparent is when the customer isn’t aware that their version is different from others, such as internationalization. Collaborative is when the product or service works with the customer to achieve a custom solution. Adaptive is allowing the customers to change the product or service themselves, such as choosing personal feeds. Cosmetic is re-skinning the product without modifying the functionality.

If you’ve decided to say no to the customer’s request for customization don’t just say “no”. Think about how you can use a carrot and/or stick approach to guide the customer to where you would like them. Sticks are things like additional costs. If a customer wants to stay on a previous version and you’ve decided that’s not acceptable to you consider charging them progressively higher rates until they either leave or make the upgrade. If they are going to leave any way then the temporary additional revenue will help offset the loss. A carrot approach might be refusing to add new features to a custom solution. Eventually the customer, hopefully, will want the newest features and acquiesce on the customization. Instead of just saying “no” emply carrots and sticks in an attempt to move your customer to where you both will be happy.


1 comment

Moving from Packaged Software to SaaS

You can be successful both shipping software and delivering services through software. But you can't be successful at both without distinct architectures.

It’s probably no surprise to our readers that many old packaged software companies are attempting to take their software and hence their business models “online”.  And why not?  The model is attractive and benefits accrue to both the providers of service through software and those who outsource portions of what was once bothersome internally hosted software.  The providers benefit from economies of scale in hosting that generate attractive profits for the provider and savings for the customer, lower maintenance costs resulting from custom customer deployments, predictable revenue streams fostered through closer customer contact, more frequent and smaller releases that reduce risk and faster implementation times that result in faster profit recognition.  Customers benefit from outsourcing non-core IT functions, providers who specialize in delivering specific services, lower capital expenditures and faster deployment times.  SaaS is both a desert topping and a floor wax!  It’s the cure for cancer and the answer to the riddle of life!

But what many of these companies don’t realize is that the way one architects a product and runs a company focused on service delivery is simply different than the approach of a company focused on delivering software.  Customers expect that you are going to give them higher availability and fewer headaches.  Software alone simply won’t meet this goal; it is imperative that one design SaaS systems holistically which in turn requires skills in both infrastructure and software architecture (or “systems” architecture).   The cost leverage necessary to both increase profit margins and decrease customer cost typically requires multi-tenancy which has its own share of headaches.  Fault isolation and rollback capabilities are a must to minimize customer impact and mitigate rapid deployment risks.

It is not enough to simply bundle up an application in a hosted fashion and label yourself a “SaaS” company.  If you don’t work aggressively to increase availability and decrease your cost of operations, someone with greater experience will come along and simply put you out of business.  After all, your business is now about SERVICE – not SOFTWARE.  This is a fundamental mind-shift that some companies simply can’t overcome or maybe simply don’t recognize.  This isn’t to say that a good engineer or product manager can’t be equally good at developing packaged and SaaS applications, but it does mean that the approach is completely different.

Stop trying to figure out how to leverage your existing assets with minimal work and start thinking about having two different products.  Or, determine which business you want and kill the other one off.  If you decide to keep both products alive, you can share services and code between these platforms, but you should not do so at the expense of optimizing your SaaS solution.  Attempting to satisfy both with a single architecture will likely result in you failing at both.


2 comments

Evolving Architecture And Software

Is your software and architecture aligned? Ensuring that they are aligned is one of the key elements in managing complex software systems.

When asked by a team what they should prepare for an engagement with us, we usually tell them to not prepare anything. Instead of PowerPoint slides showing the architecture, network, etc we prefer for people to jump to the white board and draw. One of the primary reasons is that we often find people debating how the architecture actually exists. How does your architecture diagrams or institutional knowledge reflect reality of the software?

In the May issue of Computer, there is an article, “Evolving Software Architecture Descriptions of Critical Systems”, by Tom Mens, Jeff Magee, and Bernhard Rumpe, in which the authors’ state:

An explicit architecture description is important but not sufficient to manage the complexity of developing, maintaining, and evolving a critical software-intensive system.

The authors continue explaining that the architecture description must be accurate and traceably linked to the actual implementation in software so that changes in the architecture are reflected in implementation and vice versa.

If your team has spent a bunch of time creating an architecture that will scale, all that effort is wasted when the software implementation doesn’t abide by the architecture. Because of the ever evolving nature of complex software systems it is admittedly difficult to keep the architecture description and software artifacts aligned.  The authors of the article suggest that evolving architecture descriptions requires co-evolution of different viewpoints such as the structural and behavioral. To this I completely agree but they address the solution to this issue from the aspect of Architecture Description Languages (ADLs). The problem with this approach is that I don’t know of many, if any, SaaS companies using ADLs. Therefore, in order to accomplish this co-eveolution of software artifacts and architecture descriptions we have to seek a different solution.

To ensure that architectural changes are reflected in the software we typically suggest that companies rely on architecture principles. We’ve dedicated an entire chapter in The Art of Scalability to this subject but I’ll try to summarize it here. Architectural principles are a set of ideas that the team has determined when used as guidelines during the design and development of the software will yield a scalable, available, and cost effective system. Principles should help influence the behavior and the culture of the team. We often use the SMART acronym to describe good principles as being Specific, Measurable, Achievable, Realistic, and Testable.

So how about the other direction, how do we ensure the architecture description accurately reflects the software? By using JAD and ARB processes, which we’ve covered in detail before on this blog as well as in the book, we can help ensure that software artifacts that deviate from the established architecture are discussed and noted by the appropriate individuals and teams.

Remember that the co-evolution of the software as well as the architecture design is critical in order to manage the development and maintenance of complex, critical software systems. Implement simple but efficient processes to ensure these remain synchronized.

1 comment

Scalability Warning Signs

Is your system trying to tell you that you're going to have scalability problems? We like to think that we couldn't have predicted problems at 10x our last year's traffic but there are often warning signs that we can heed if we know what to look for.

Unless you’re one of the incredibly lucky sites where the traffic spikes 100x overnight, scalability problems don’t sneak up on you. They give you warning signs that if you are able to recognize and react to appropriately, allow you to stay ahead of the issues. However, we’re often so head down getting the next release out the door that we don’t take the time to realize we’re experiencing warning signs until they become huge problems staring us in the face.  Here are a few of the warnings that we’ve heard teams talk about in the past couple of months that were clearly signs of problems on the horizon.

Not wanting to make changes – If you find yourself denying request for changes to certain parts of your system, this might be a warning sign that you have scalability issues with that component. A completely scalable system has components that can fail without catastrophic impact to customers. If you’re avoiding changes to a component because of the risk of problems this is a warning sign that you need to re-architect to eliminate or at least mitigate the risk.

Performance creep – If after each release you need to add hardware to a subsystem or you accept a performance degradation in a service you could have a scaling issue approaching quickly. Consistently increasing consumption of CPU or memory resources in a service with each release will lead you into an unsustainable situation. If today you’re comfortably sitting at 40% CPU utilization and you allow a modest 10% degradation in each release you have less than nine releases before you are well above 100% but the reality is you won’t get close to that without significant issues.

Investigating larger hardware – If you’ve started asking your vendors or VAR about bigger hardware you’re heading down the path of scalability problems. The scale of more computational resources per dollar is not linear, it’s closer to cubic or even exponential scales. Purchasing more expensive hardware might seem like the economical way out when you compare the cost of the first hardware upgrade versus developer time but run the calculation out several iterations. When you get to a Sun Fire™ E25K Server with Oracle Database 10g at a $6M price tag you might feel differently about the decision.

Asking vendors for advanced features – When you start exploring advanced options of your vendor’s software you’re likely heading down the path of increased complexity and this is a clear warning sign of scalability problems in your future. Besides potentially locking you into a vendor which lowers your negotiating power it puts the fate of your company in someone else’s hands, which wouldn’t make me sleep very well at night. See our post on using vendor features to scale for more information.

Watch out for these or similar warning signs that scalability problems are looming on the horizon. Dealing with the problems today while you have time to plan properly might not get you an award for being a firefighter but you’ll more likely deliver a quality product without costly interruption.


1 comment