Morning Operations Meeting

August 29th, 2010 by Wabb

You get paid to deliver a service.  You want to deliver that service to the level of your customers’ expectations, or at least to some internally defined level.  So how often do you meet to discuss your service delivery quality?

In our experience, most companies meet only when there is a problem.  Day in and day out many Software as a Service companies will operate their services throughout the day and simply not take the time to step back and look at the last day’s worth of issues, all open issues that are not yet resolved and diagnose service delivery problems.  How could this be you ask?  Well, we honestly don’t know!

As we’ve written before, if you are a SaaS company your business is predicated first and foremost on SERVICE DELIVERY!  Developing software is important – but what makes you money is the delivery of a service.  Get this straight folks, because it is a major mind shift.

In our view, it is absolutely critical to start the business day with a review of the past day’s service delivery.  We call this the “Morning Operations Meeting” or “Morning Operations Review”.  Every day we ask our clients to review major issues from the previous day, overall service quality (response times, availability, major interruptions or bugs live on the site, etc), and all major open issues identified in past days.  Ideally the notion of an incident (a thing that happens in production and causes customer complaint) and the notion of a problem (a thing that causes an incident) are separated in this meeting.  Both should be discussed – but they are really two separate things.

Ideally this meeting will have representatives from your customer support organization, technical operations and infrastructure teams and software development teams.  Inputs to the meeting are a representation of customer complaints, complaints regarding service within the company, manual identification of issues, automated identification of issues (such as through a monitoring system to include Service Level metrics), predictive identification of future problems (such as might be the case from a capacity management team) and all appropriate service level information.

Open incidents and problems from the issue tracking system are discussed, updated, etc.  Owners are assigned to new incidents and problems (if they haven’t been already) and new issues are updated if any were missed from the previous day’s operations.

Outputs from the meeting are updated service level reports, scheduling of post mortems for large incidents, updated problem reports and data for monthly or quarterly look backs or reviews (more on this later).

If done well, the morning meeting helps inform architectural changes that are necessary in the scalability summits or in other product development and architecture meetings.  Recurring problems should be easily identified within the issue management as a result of heightened oversight and analysis of the system.

Delayed Replication

August 22nd, 2010 by Fish

Recently on the MySQL Performance Blog they had a post that did a great job explaining a problem that we often try to warn our clients about. The crux of the problem is that if you are relying only on a replica for disaster recovery then you are going to lose data when something bad happens.

For minimizing the impact of eventual consistency in our BASE applications, we want our replicas to be very near real time. This unfortunately can be unintended consequences in a disaster. Whether you’re relying on MySQL’s statement-based replication or Oracle’s redo apply replicating at the block-level, both are vulnerable to data corruption.

Any scenario resulting in data corruption on the primary will immediately be replicated to the standby. If a DBA drops a table by the time he stops cursing the drop table has been replicated to the standby. Storage subsystem or HA failover both can corrupt data files which can get propagated to the standby.

The solution to this problem is to create a standby or replica that has a delay on applying the log files. We recommend between 6 – 12 hours delay which gives you plenty of time to catch a logical corruption and stop the replication. You don’t need a large production sized server for this since you’ll never use this database in production but simply recover the database from it. Do this simple thing and it might save your data.

Book Number 2: Scalability Rules

August 15th, 2010 by Wabb

Thanks to everyone who has supported The Art of Scalability with purchases, reviews, tweets, posts, and more. Because of the interest shown in this topic of scalability we’ve put our heads together with our publisher, Addison-Wesley, and have come up with an idea that we think is pretty exciting. Fish and I have just signed a contract for our second book with the working title of Scalability Rules. We are looking forward to working with the talented team of editors and reviewers that our publisher has put together.  The book will be a short, technically focused book offering the most common principles we use with our clients to help them scale their hyper growth technology platforms.

We expect the book to come in at about 200 pages and sell for considerably less than The Art of Scalability.  Our intent is to make it useful both as a primer on scalability as well as a reference manual for technical organizations.

We’ll keep you posted on our progress and we’d love to hear your ideas on topics that should be covered. Scalability Rules should be available mid to late 2011.  Now – back to writing our new book!

Attitude #Fail

August 9th, 2010 by Fish

Most of the time the individuals that we interact with during engagements are intelligent, intellectually curious, and open to suggestions for improving their service offering’s scalability and availability.  On rare occasions, however, we run into an individual who either argues just for the sake of arguing or thinks he/she can’t learn anything from anyone. While these people might be brilliant they are likely going to ultimately fail and in doing so negatively impact the company. A rule to live by is that an architecture designed by one person is much poorer than one designed by a diverse group of individuals with different skill sets. This is one of the driving principles behind the JAD and ARB.

This is not to say that all conflict is bad. As Marty mentioned in his post on Team Conflict, cognitive conflict is desired. It is the affective conflict that is not good for the team. An easy way to think about the difference between cognitive (good) and affective (bad) is that arguing over “what” to do is good, arguing over “who” should do it (territorial) or “how” it should be done (micromanagement) can be harmful if not carefully kept in check. Once an someone has been assigned as the “R” (see Chapter 2 in The Art of Scalability) let them own the project.

We’ve posted a lot on our blog about hiring A players, tending your team like a garden, and building high performance teams. Allowing someone who displays an attitude of arrogance and superiority in a leadership position is more than just annoying but harmful to the team. Junior engineers will not push back on this person’s decisions for fear of humiliation, younger leaders are being taught to act this way in order to succeed, and the experiential chasm between this person and other executives is only widening.  No matter how brilliant this person is, they are causing more problems than they are solving. Take steps today to remove them from the organization as quickly as possible.

Please Be Quiet

August 2nd, 2010 by Fish

I’ve noticed lately that more companies are putting up signs in hallways and cube farms requesting that people avoid having conversations in these areas. While having a nice quiet work environment makes sense to me as a developer, doesn’t this completely void having people work beside each other? The ad hoc/hallway/water cooler/coffe machine conversations or ones overheard when cube mates are chatting about a new feature are one of the primary benefits of having people work in small open environments.

I haven’t done any sort of scientific study but it seems that these sort of “please be quiet” signs are more prevalent at larger companies. These are the same ones that are trying to mimic the small startup with agile development processes or open work spaces to compete in a fast moving SaaS marketplace. Imitating the actions without understanding the purpose or allowing old school corporate policies to overrule are surefire ways to tank the initiative.

A parallel to the “please be quiet” sign is allowing corporate IT to dictate the architecture of the SaaS offering based on a corporate standard that works for the ERP system. Running Oracle ERP on a 16-way system might be the vendor recommended, preferred approach but for scaling a SaaS offering this is a quick way to run up the costs and ensure lower availability. We often use the analogy of goldfish and thoroughbreds for comparing small, cheap 1U servers with large, expensive multi-processor boxes with lots of memory. The goldfish (small, cheap servers) are inexpensive to purchase and replaceable while the thoroughbred (large servers) are expensive to purchase/maintain and cause big impacts when they go down.

The take away to all of this is that if your part of a corporate initiative to run an internal startup or deliver a Software as a Service from inside a larger organization, don’t allow corporate policies to prevent your success. The differences in approaches, architectures, organizations, and offices have a purpose and should not be discounted as non-critical to the success of your initiative.