Archive for the ‘Engineering’ Category

Engineering or Craftsmanship

Monday, July 20th, 2009

Having gone through a computer science program at a school that required many engineering courses such as mechanical engineering, fluid dynamics, and electrical engineering as part of the core curriculum, I have a good appreciation of the differences between classic engineering work and computer science. One of the other AKF partners attending this same program along with me and we often debate whether our field should be considered an engineering discipline or not.

Jeff Atwood posted recently about how floored he was reading Tom DeMarco’s article in IEEE Software, where Tom stated that he has come to the conclusion that “software engineering is an idea that whose time has come and gone.” Tom DeMarco is one of the leading contributors on software engineering practices and has written such books as Controlling Software Projects: Management, Measurement, and Estimation, in which it’s first line is the famously quoted “You can’t control what you can’t measure.” Tom has come to the conclusion that:

For the past 40 years, for example, we’ve tortured ourselves over our inability to finish a software project on time and on budget. But as I hinted earlier, this never should have been the supreme goal. The more important goal is transformation, creating software that changes the world or that transforms a company or how it does business….Software development is and always will be somewhat experimental.

Jeff concludes his post with this statement, ”…control is ultimately illusory on software development projects. If you want to move your project forward, the only reliable way to do that is to cultivate a deep sense of software craftsmanship and professionalism around it.”

All this reminded me of a post that Jeffrey Zeldman made about design management in which he states:

The trick to great projects, I have found, is (a.) landing clients with whom you are sympatico, and who understand language, time, and money the same way you do, and (b.) assembling teams you don’t have to manage, because everyone instinctively knows what to do.

There seems to be a theme among these thought leaders that you cannot manage your way into building great software but rather you must hone software much like a brew-master does for a micro-brew or a furniture-maker does for a piece of furniture. I suspect the real driver behind this notion of software craftsmanship is that it if you don’t want to have to actively manage projects and people you need to be highly selective in who joins the team and limit the size of the team. You must have management and process in larger organizations, no matter how professional and disciplined the team. There is likely some ratio of professionalism and size of team where if you fall below this your projects breakdown without additional process and active management. As in the graph below, if all your engineers are apprentices or journeymen and not master craftsmen they would be lower on the craftsmanship axis and you could have fewer of them on your team before you required some increased process or control.

craftsmanship

Continuing with the microbrewery example, you cannot provide the volume and consistency of product for the entire beer drinking population of the US with four brewers in a 1,000 sq ft shop. You need thousands of people with management and process. The same goes for large software projects, you eventually cannot develop and support the application with a small team.

But wait you say, how about large open source projects? Let’s take Wikipedia, perhaps the largest open project that exists. Jimbo Wales, the co-founder of Wikipedia states “…it turns out over 50% of all the edits are done by just .7% of the users – 524 people. And in fact the most active 2%, which is 1400 people, have done 73.4% of all the edits.” Considering there are almost 3 million English articles in Wikipedia, this means each team working on a article is very small, possibly a single person.

Speaking of Wikipedia, one of those 524 people defines software engineer as “a person who applies the principles of software engineering to the design, development, testing, and evaluation of the software and systems that make computers or anything containing software, such as chips, work.” To me this is too formulaic and doesn’t describe accurately the application of style, aesthetics, and pride in one’s work. I for one like the notion that software development is as much craftsmanship as it is engineering and if that acknowledgement requires us to give up the coveted title of engineer so be it. But I can’t let this desire to be seen as a craftsman obscure the concept that the technology organization has to support the business as best as possible. If the business requires 750 engineers then there is no amount of proper selection or craftsmanship that is going to replace control, management, measurement, and process.

Perhaps not much of a prophecy but I predict the continuing divergence among software development organizations and professionals. Some will insist that the only way to make quality software is in small teams that do not require management nor measurement while others will fall squarely in the camp of control, insisting that any sizable project requires too many developers to not also require measurements. The reality is yes to both, microbrews are great but consumers still demand that a Michelob purchased in Wichita taste the same as it does in San Jose. Our technological world will be changed dramatically by a small team of developers but at the same time we will continue to run software on our desktops created by armies of engineers.

What side of the argument do you side with engineer or craftsman?

Tuning versus Scaling

Wednesday, July 1st, 2009

We often talk to clients about the differences between tuning an application or platform and scaling it.  Often the clients intuitively see them as different things, but oddly still refer to them similarly as in “We scaled the database last year by tuning the top SQL queries and getting 20% better performance”.

Tuning is something that you absolutely must do, but we stress that it’s more of an operational endeavor than an architectural endeavor.  In the ideal world, as a typical part of your operation, you would focus some time monthly or quarterly on identifying the slowest portions of your product or the most costly in terms of hardware, engineering time or clock cycles.  These would then be prioritized based on cost to fix and expected return with changes being made to them as engineering allocations allow.  This process is just good housecleaning; throwing out old clothes, and books or getting rid of the motorcycle and lawn mower that no longer work to make room in a garage bay for other “stuff”.

And that’s just what you’re doing with tuning – throwing out old stuff to make room for new stuff.  Just as with the house analogy, you aren’t really making your house “bigger”, you are making the current house capable of holding different and ideally more valuable “stuff”.   If the occupants of your house aren’t growing (i.e. your business isn’t growing quickly), as with a new addition to a family or several new roommates, then this disposition of old less valuable stuff (old code that is seldom executed but costly, often executed code that is very costly) for newer more valuable stuff (new functionality or more efficient code) may be all you ever need to do.

Scaling, however, is an architectural effort that implies significant modifications to your current structure.  In the house analogy, it might mean adding a second floor or expanding the house.  Structurally the house is different and is capable of holding more “stuff” because the total floor space has increased, as compared to the tuning example where you just freed up existing floor space.  Scaling involves architectural effort in redesign of the underlying fundamentals of your system such as adding replicated read databases or splitting databases along data boundaries while tuning is just “spring cleaning” of your platform.

If you are experiencing rapid or hyper growth, you need to both tune your system for the sake of efficiency and performance and scale it for future growth.  Scaling is what allows you to achieve orders of magnitude of improvement (10x, 100x, etc), while tuning should afford linear growth and more efficient operation over time.

Continuous Deployment

Monday, June 22nd, 2009

You probably have heard of continuous integration that is the practice of checking code into the source code repository early and often.  The goal of which is to ease the often very painful process of integrating multiple developer’s code after weeks of independent work. If you have never had the pleasure of experiencing this pain, let me give you another example that we have experienced recently. In the process of writing The Art of Scalability, we have seven editors including an acquisition editor, a development editor, and five technical editors who all provide feedback on each chapter. Our job is to take all of this separate input and merge it back into a single document, which at times can be challenging when editors have different opinions for the direction of certain parts of the chapter. The upside of this process is that it does make the manuscript much better for having gone through the process. Luckily software engineering has developed the process of continuous integration designed to reduce wasted engineering effort. In order to make this process the most effective the automation of builds and smoke tests are highly recommended. For more information on continuous integration there are a lot of resources such as books and articles.

The topic of this post is taking continuous integration to an extreme and performing continuous deployment. And it is exactly what it sounds like, all code that is written for an application is immediately deployed into production. If you haven’t heard of this before you’re first thought is probably that this is the ultimate in Cowboy Coding but it is in use by some household technology names like Flickr and IMVU. If you don’t believe this check out code.flickr.com and look at the bottom of the page, last time I checked it said:

Flickr was last deployed 20 hours ago, including 1 change by 1 person.

In the last week there were 34 deploys of 385 changes by 17 people.

Eric Ries, co-founder and former CTO of IMVU, is a huge proponent of continuous deployment as a method of improving software quality due to the  discipline, automation, and rigorous standards that are required in order to accomplish continuous deployment. Other folks at IMVU also seem to be fans of the continuous deployment methodology as well from the post by Timothy Fitz. Eric suggest a 5 step approach for moving to a continuous deployment environment.

The topic of this post is taking continuous integration to an extreme and performing continuous deployment. And it is exactly what it sounds like, all code that is written for an application is immediately deployed into production. If you haven’t heard of this before you’re first thought is probably that this is the ultimate in ‘Cowboy Coding’ but it is in use by some household technology names like Flickr and IMVU. If you don’t believe this check out code.flickr.com and look at the bottom of the page, last time I checked it said:
Flickr was last deployed 20 hours ago, including 1 change by 1 person.
In the last week there were 34 deploys of 385 changes by 17 people.
Eric Ries, CTO of IMVU, is a huge proponent of continuous deployment as a method of improving software quality due to the  discipline, automation, and rigorous standards that are required in order to accomplish continuous deployment. Eric suggest a 5 step approach for moving to a continous deployment environment.
  1. Continuous Integration – Obviously before moving beyond integration into full deployment, this is a prerequisite that must be in place.
  2. Source Code Commit Checks – This feature which is available in almost all modern source code control systems,  allows the process of checking in code to halt if one of the tests fail.
  3. Simple Deployment Script – Deployment must be automated and have the ability to rollback, which we wholeheartedly agree with here and here.
  4. Real-time altering – Bugs will slip through so you must be monitoring for problems and have the processes in place to react quickly
  5. Root Cause Analysis – Eric recommends the Five Why’s approach to find root cause, whatever the approach, finding and fixing the root cause of problems is critical to stop repeating them.

Admittedly, this concept of developers pushing code straight to production scares me quite a bit, since I’ve seen the types of horrific bugs that can make their way into pre-production environments. However, I think Eric and the other continuous deployment proponents are onto something that perhaps the reason so many bugs are found by weeks of testing is a self-fulfilling prophecy. If engineers know their code is moving straight into production upon check in they might be a lot more vigilant about their code, I know I would be. How about you, what do you think about this development model?

Monitoring Strategies

Monday, June 15th, 2009

What questions do each of your system monitors answer? You probably think they answer questions such as “Is there a problem?” and if so “Where is the problem?” Most likely this is not the case and instead of telling you “Is there a problem?” it really only tells you “Where” or “What” the problem might be. Before we continue this, first a quick detour to discuss metrics, which while different than monitoring are very similar in many ways.

Eric Ries, co-founder and CTO of IMVU, posted an article about the difference between vanity metrics and actionable metrics.  The entire article and accompaning video are worth a read and listen but the take away is that most people are using and looking for metrics that are great soundbites but do not offer any definable actions.  One example is the total number of hits to a website.  Eric ask the questions “Now what? Do you really know what actions you took in the past that drove those visitors to you, and do you really know which actions to take next?” This makes total sense to me as we  often see teams misusing monitoring in an attempt to determine what actions to take with their systems.

Back to our discussion of what question your monitoring is attempting to answer. We think there are five evolutionary questions that monitoring should answer:

  1. Is there a problem?
  2. Where is the problem?
  3. What is the problem?
  4. Why is there a problem?
  5. Will there be a problem?

Where most people fail is using a monitoring tool that is designed to answer “Where” or “What” and try to use it to answer “Is”. For example, if you are monitoring all of your servers vitals such as CPU, memory, and I/O what is the appropriate action for your team to take when the CPU utilization goes to 100%? The reason that might be a tough question is that you are missing the vital piece of information “Is this affecting my customers?”.  The “Is there a problem” is intended to be a proxy for customer impact in order to help determine the degree and speed of escalation of the issue.

If you have monitoring services in place now it is worthwhile to determine what question each one answers. If you are missing a monitor for a particular question, the time to remedy it is before you need that question answered.

Scaling & Availability Anti-patterns

Tuesday, May 12th, 2009

Most of you are familiar with patterns in software development. If you are not a great reference is Patterns of Enterprise Application Architecture by Martin Fowler or Code Complete by Steve McConnell. The concept of a pattern is a reusable design that solves a particular problem. No sense in having every software engineer reinvent the wheel. There is also the concept of an anti-pattern, which as the name implies, is a design or behavior that appears to be useful but results in less than optimal results and thus something that you do not want engineers to follow. We’ve decided to put a spin on the anti-pattern concept by coming up with a list of anti-patterns for scaling and availability. These are practices or actions that will lead to pitfalls for your application or ultimately your business.

  1. SPOF – Lots of people still insist on deploying single devices.  The rationale is that either it cost too much to deploy in pairs or that the software running on it is standalone.  The reality is that hardware will always fail it’s just a matter of time and when originally deployed that software might have been standalone but in the releases since then other features might depend on it.  Having a single point of failure is asking for that failure to impact your customer experience or worse bring down the entire site.
  2. Synchronous calls – Synchronous calls are unavoidable but engineers should be aware of the potential problems that this can cause.  Daisy chaining applications together in a serial fashion decreases the availability due to the multiplicative affect of failure.  If two independent devices both have 99.9% expected availability, connecting them through synchronous calls causes the overall system to have 99.9% x 99.9% = 99.8% expected availability.
  3. No ability to rollback – Yes, building and testing for the ability to rollback every release can be costly but eventually you will have a release that causes significant problems for your customers. You’ve probably heard the mantra “Failing to plan is planning to fail.”  Take it one step further and actually plan to fail and then plan a way out of that failure back to a known good state.
  4. Not logging - We’ve written a lot about the importance of logging and reviewing these logs.  If you are not logging and not periodically going through those logs you don’t know how your application is behaving.
  5. Testing quality into the product – Testing is important but quality is designed into a product. Testing validates that the functionality works as predicted and that you didn’t break other things while adding the new feature. Expecting to find performance flaws, scalability issues, or terrible user experiences in testing and have them resolved is a total waste of time and resources as well as likely to fail at least 50% of the time.
  6. Monolithic databases – If you get big enough you will need the ability to split your database on at least one axis as described in the AKF Scale Cube. Planning on Moore’s law to save you is planning to fail.
  7. Monolithic application – Ditto for your application. You will eventually need to scale one or more of your application tiers in order to continue growing. 
  8. Scaling through 3rd parties – If you are relying on a vendor for your ability to scale such as with a database cluster you are asking for problems. Use clustering and other vendor features for availability, plan on scaling by dividing your users onto separate devices, sharding.
  9. No culture of excellence – Restoring the site is only half the solution. If your site goes down and you don’t determine the root cause of the failure (after you get the site back up not while the site is down) then you are destined to repeat that failure. Establish a culture of not allowing the same mistake or failure to happen twice and introduce processes such as root cause analysis and postmortems for this. 

Anti-patterns can be incorporated into design and architecture reviews to compliment a set of architectural principles.  Together, they form the “must not’s” and “must do’s” against which all designs and architectures can be evaluated.