AKF Partners

Abbott, Keeven & Fisher PartnersPartners In Hyper Growth

Category » Operations

VP of Operations

One of the most common questions we get from individuals is “what is the path to becoming a CTO?” We posted about this before and focused on the skill sets required as opposed to the path to get there.  We highlighted 1) good knowledge of business in general 2) great technical experience 3) great leadership 4) great manager 4) great communicator and 5) willing to let go.  This time we’re going to one of the jobs that is often a stepping stone to the CTO job.

The VP of Operations is the person who leads the Technology Operations or Production Operations team.  This team has responsibility for running the hardware and software systems of the company. For SaaS or Web2.0 companies this is the revenue generating systems. For corporate IT this is the ERP, CRM, HRM, etc. This team is often comprised of project managers, operations managers, and technical leads. As the head of the Operations team the VP of Operations has responsibility for monitoring, escalating, managing issues, and reporting on availability, capacity, and utilization. Incident and problem management as well as root cause analysis (postmortem) are some of the most important jobs that their team accomplishes. In order to perform this role well the VP of Operations must have good process skills, a strong leadership presence, able to remain calm under fire, and goof overal knowledge of the system.

The VP of Operations is often also responsible for the Infrastructure team. This team is usually comprised of system administrators, database administrators, and network engineers. This team procures, deploys, maintains, and retires systems. As the head of this team the VP of Operations has requirements for budgeting, balancing time between longer term projects and daily operations on the systems. This team understands the system holistically and are often the most useful when performing scalability summits. In order to perform this role well, the VP of Operations must have a good understanding of each of the technical roles that this team is responsible for, including the databases, operating systems, and the network. This doesn’t mean in order to succeed in this role a person must be able do each of these jobs but they do need a good, solid understanding in order to converse, brainstorm, debate, and make decisions in each of these technical realms.

If you compare this list of skills that we mentioned at the top of this post with those mentioned as necessary to succeed as the VP of Operations you’ll see they overlap a good deal. Great technical experience, great leadership, and great management skills will serve you well as the head of operations and will also go a long way to developing most of the skills you will need as a CTO.

We’re approaching the end of the year, a time that many people and organizations use to reflect on what they have accomplished and what they want to accomplish next year.  A good idea as part of your personal growth is to use the list above and score yourself as honestly as possible in terms of skills.  If you’re missing some of them make sure you have some goals in place that help you acquire a few more of these each year. Do this and not only will succeed one of the important jobs that lead to the CTO job but when you do arrive at the CTO position you will be one of the successful ones.


Comments Off

Splunk

This time we have a guest post from a long time friend and colleague, Chris Lalonde. During a conversation a couple of weeks ago Chris told us what he was doing with a product called Splunk, to provide monitoring, alerting, and visibility into log files. Given the importance that we place on logging and monitoring, what Chris was doing sounded interesting. We asked Chris if he would be willing to share a little about his implementation. Chris has over 15 years of experience in information technology and has provided technology solutions for Fortune 1000 companies such as eBay, TD Bank as well as government agencies like DoD, RCMP and USSS. He holds a bachelor of mechanical engineering with a concentration in robotics from Carleton University. Chris also has three patents for authentication systems and has several others pending. He was the recipient of the Director’s Award from the United States Secret Service. And now from Chris:

Having worked in technology for 15+ years I understand how challenging it can be to get visibility across your entire platform, apparently the folks who started Splunk understood this as well. What is Splunk, well it’s a tool you can use to collect data from virtually any server in virtually any format and carry out alerting, ad-hoc searches and large scale reporting.

You install the Splunk client on each of the servers you want to monitor. Currently Splunk supports Windows, Linux, OSX, Solaris, BSD and AIX so the vast majority of platforms are covered. The client typically installs in a few minutes and if you spend a little time pre-configuring things it’s easy to automate the install. Once it’s installed you have a few options you can either run a full server on each machine or turn them into lightweight forwarders and consolidate your logs. I’d recommend the latter since it gives you the log aggregation you more than likely want and the light weight clients use fewer resources than the full blown server. Additionally you can point traditional data sources at Splunk, have lots of syslogs running around why not point them at your central Splunk server and get them indexed and searchable as well.

Once you’ve got Splunk installed on your servers the uses are pretty much endless. The simplest and most obvious is system/log monitoring and alerting. You can just point Splunk at a log directory and it will automatically analyze and index all the files in that directory. Spunk will also attempt to parse the elements of the files so you can do searches on virtually every element of your file. Need to know how many times that URL was called with variable “foo” in it? Just go into the search app and type in something like “http://” “foo” and all the results are shown in two ways 1) in a graphical timeline showing the count per unit of time (hrs, days, min) and 2) in detail below the graph where you can expand out and literally get the section of the specific log file with those results.

That’s the simple version, let’s try something more interesting. Say you automatically roll code to your site and that you’d like a way to verify that code rolling tool did its job. Well just point Splunk at the directory your code rolls and configure Splunk to monitor changes to that directory and bingo once the code is in place Splunk will see the file changes. Now you can either manually search for those files or have Splunk send you an alert with the list of files that have changed in those directories.

Not enough? Splunk has a PCI compliance suite that covers all twelve PCI DSS requirements and all 228 sub-requirements including live controls monitoring, process workflow, checklists and reporting. How about indexing firewall logs, yes. How about sending those IDS logs to Splunk, sure it’ll swallow those as well. Would you like to get an alert when say you suddenly get a large increase in logs from a firewall or and IDS? Sure no problem.

Ok well that’s all great for the System Administrators, Security folks and build and release but how about the network folks. Sure, Splunk has you covered as well. Have some Cisco gear? Splunk has a Cisco suite that covers that. How about something for the DBAs, yes, MySQL and Oracle are covered as well. Again all this data is indexed and is now not only searchable, but you can create customer dashboards and alerts off of all that data.

You are probably saying “Yes I’m sure you can do that but it’s probably a nightmare to configure” in fact nothing could be further from the truth. You can use the UI to configure the report in less than 5min. If you don’t like to use a GUI, the same thing can be done via the CLI in about the same amount of time.

Ok for the bad news, Splunk is free up to the first 500MB of indexes per day after that it starts costing money also the free version is missing some of the reporting extras. 500Mb sounds like a lot until you actually start indexing things, also by default Splunk indexes many, many things that you might not want so you need to be very clear about what you do and what you do not want it to index. There are several ways of dealing with the indexes such as either by aging data out more quickly or limiting the size of the indexes. Currently I’m restricting index sizes so that I’m keeping about 30days worth of data and I haven’t had any issues.

Having said all of this I haven’t yet explored the limits of Splunk in my current configuration I’m only indexing data from about 50 servers and I’ve not run into any issues. I should note that Splunk is designed to scale horizontally and I know of people indexing data from thousands of boxes so I’m not expecting there to be a scale issue. In my experience the data is indexed in less than 5 min so I am currently using Splunk as part of our monitoring and alerting system and it has helped identify issues that would have otherwise been hidden in the logs.

Splunk saved me having to build a centralized logging infrastructure plus all the tools one needs to monitor and manage that infrastructure it’s allowed me to instantly search across all our logs and systems to identify system issues, db issues, and code issues. Something that would have taken me weeks took me 1 day to install and get working on 30 boxes and within 1 week I’d found critical events that saved my company $$.

NOTE: Thanks to Steve from Splunk for identifying that it is 500MB of indexes per day.


2 comments

A Lightweight Post Mortem Process

We discussed the need to perform post mortems or AARs in our post entitled “After Action Reviews”.   Our new book includes a description of how these meetings should be run, but given the amount of time we spend teaching companies our light weight post mortem process we thought it useful to describe it in a blog post as well.

First, please understand that we think onerous processes result in the death of an organization.  We’ve often said that the point at which a company begins to hire “process engineers” is the point at which processes have gotten a bit too far.  Startups need light, adaptable processes that can grow as their needs grow over time.  The post mortem process described here is one such process.

Ideally everyone will be gathered in a single room and the room will have whiteboards that can be used during the process.  Attendees should include everyone involved with the issue or crisis and who can contribute either to a complete and accurate timeline or contribute to issues identified within the timeline.  Managers who might be assigned action items, be they process, organizational or technical should also attend the post mortem.  A single person should be identified as the Post Mortem process facilitator.

Our post mortem process consists of three phases:

  1. Phase 1 focuses on generating a timeline of the events leading up to the issue or crisis.  Nothing is discussed other than the timeline during this first phase.  The phase is complete once everyone in the room agrees that there are no more items to be added to the timeline.  We typically find that even after we’ve completed the timeline phase, people will continue to remember or identify timeline worthy events in the next phase of the post mortem.
  2. Phase 2 of the post mortem consists of issue identification.  The process facilitator walks through the timeline and works with the team to identify issues.  Was it OK that the first monitor identified customer failures at 8 AM but that no one responded until noon?  Why didn’t the auto-failover of the database occur as expected?  Why did we believe that dropping the user_authorization table would allow the application to start running again?  Each and every issue is identified from the timeline, but no corrections or actions are allowed to be made until the team is done identifying issues.  Invariably, team members will start to suggest actions but it is the responsibility of the process facilitator to focus the team on issue identification during Phase 2.
  3. Phase 3 of the post mortem focuses on actions.  Each item should have at least one action associated with it.  The process facilitator walks down the list of issues and works with the team to identify an action, an owner, an expected result and a time by which it should be completed.  Using the SMART principles, each action should be specific, measurable, attainable, realistic and timely.  A single owner should be identified, even though the action may take a group or team to accomplish.

5 comments

Scaling and Monitoring the Clouds

The usefulness of Amazon’s EC2 cloud took a step forward recently with the introduction of Amazon’s own real-time monitoring, auto scaling, and load balancing product offerings. Most of these were already services offered by third parties built ontop of Amazon’s and other provider’s clouds, such as Mosso, or through custom implementations of HAProxy, etc. However, the integration should allow for easier administration and better support.     

There continue to be reservations by many companies over the feasibility of running critical systems or placing sensitive data on third party clouds. While we have not lost a major cloud computing provider yet, undoubtedly because they are all still so new, other third party storage providers have recently shutdown as noted in PCWorlds article Will Your Data Disappear When Your Online Storage Site Shuts Down?  Granted that these storage providers’ business models were very different, often giving away storage for free in hopes of up selling users on other products such as printing of photos. ISP’s and hosting providers do go out of business all the time, leaving customers in the lurch. Failure is not reserved for small businesses as we’ve seen recently with banks and car companies. As Alan Williamson, co-founder of AW2.0 a cloud computing firm, stated “Users cannot absolve themselves from being 100 percent responsible for their own data.” The cloud computing offerings are becoming more mature but they still require companies to understand the pros and cons in order to make wise decisions and plans in the event of service outages or business failures.


Comments Off

Incidents and Problems

On 19 April 1951, MacArthur gave a farewell speech to Congress upon being relieved of his command in Korea. It included the following: “But once war is forced upon us, there is no other alternative than to apply every available means to bring it to a swift end. War’s very object is victory, not prolonged indecision. In war there is no substitute for victory.” Reading this recently, I was reminded of how tech teams should approach service outages. Too often teams get confused about the priority of restoring service versus finding the root cause. We will be the first ones to tell you that you need to instill a culture of excellence that does not allow mistakes or issues to happen twice. However, during the outage, the first priority should be to restore service as quickly as possible. If you have time to gather data, like core dumps, that later will be valuable for determining root cause, great, but focus on getting the site or service restored. 

The Information Technology Infrastructure Library does a great job explaining the differences between what they refer to as Incidents and Problems. An Incident is “an event which is not part of the standard operation of a service and which causes or may cause disruption to or a reduction in the quality of services…” While a Problem is “the unknown root cause of one or more existing or potential Incidents.” The ITIL has different processes for managing each. The goal of Incident Managment is to “restore normal operations as quickly as possible…” while the goal of Problem Management is “to minimize the impact of problems…”

As you can imagine their is often conflict between these two goals. A possible solution offered by the ITIL is to form a plan of attack for the next occurrence of the problem that outlines the following:

  • What diagnostics to collect
  • How long to allow for diagnostics before service is restored
  • Prepare the necessary resources (people, process, and technology) prior to the incident
  • Communicate the plan to the stakeholders

If you like this topic you’ll enjoy Chapters 8 and 9 of The Art of Scalability, where the management of issues and crisises are discussed in detail.


Comments Off

After Action Review

Is your company a “learning organization”, committed to continuous improvement and not willing to repeat mistakes?  If not you should be and if you are you should be performing postmortems or After Action Reviews (AAR) on all your projects and releases. Before we get into the purpose of the AAR we should address those organizations that are not dedicated to learning.  If your organization continues to stumble in the dark stubbing its proverbial toe on the same piece of furniture but refusing to move it, stop and move the furniture!  If your site continues to have availability issues then apologize, mean it, and fix it. Sooner or later your customers will leave frustrated at your inability to learn from your mistakes.

If you’re part of the other type of organization that strives to learn from mistakes and not repeat them, After Action Reviews are for you.  As covered in the Inc.com article Leadership: Armed with Data, companies repeat mistakes because they either fail to figure out what went wrong or they fail to institutionalize the fix.

In a typical AAR, the project’s stated goals or objectives are compared with observed results by the project team and a discussion is conducted to identify why the results differed. Sometimes the team can identify what went wrong and why. Other times the team will know what went wrong but perhaps not the reason why. It is okay to leave with open action items for investigations. It is not okay to leave without people assigned to document, implement, study, or report to the team later on the how the discrepancy is going to be improved. If the team only identifies the problem but doesn’t do something to keep from experiencing it again, they are only half way done.  Don’t let these lessons learned drop out of your organization’s collective memory. Complete the process by institutionalizing the solution.

There are lots of resources for learning how to perform an effective AAR.  Get in the habit of conducting them after projects and institutionalizing the solutions.


1 comment

Checklists

The “Annals of Medicine: The Checklist” is an article from the New Yorker in Dec 2007.  Besides reminding us that we really want to avoid a trip to the Intensive Care Unit, it also spells out how checklists are important when performing complex tasks, even if they tend to be routine.  One study showed the implementation of a 5 step process, that was strictly adhered to, prevented eight deaths in just over a years time.  The article states “Checklists established a higher standard of baseline performance.”

Another article “Study: A Simple Surgery Checklist Saves Lives” in Time, describes similar studies and findings.  In the study described, death rates dropped from 1.5% to 0.8%.  Both articles mention the use of checklists by pilots, due to the complexity of the systems and machines that they operate.

Your system, including the application as well as the entire development and deployment process is likely to be very complex. The lesson we should take away from these articles for all our technology teams is that checklists are important, they reduce the number of problems caused by human error.  You don’t need hundreds of steps, the few key steps are all that is required, and then strict adherence to it.  When you finish the release at 2am you’re probably not thinking as clearly as you normally do, don’t rely on your memory for checking the site.  Have a checklist for critical parts of the application to verify before you head to bed.


Comments Off

Sounds of hard drives dying

If your laptop starts making any of these sounds you are probably about to have a bad day.  What happens when your servers start making these sounds?  Pulling a server our of rotation for a hard drive failure and then having to ghost, jumpstart, or kickstart it again is a waste of time and resources.  

Depending on your application it might make sense to consider ordering your next set of app servers with solid state drives (SSD) instead of hard disk drives (HDD). When the disk storage is used primarily for the operating system, web/app server, and code these configurations start to make sense. The higher price might very well be offset with the faster speed and lower heat as well as  less energy consumption and greater mean time between failures.  


Comments Off

Storage Headaches 2

We posted a blog last week about how many of our clients decided a year or two ago that as part of their product offering they would provide storage of user data. We pointed out that this occurred with no foresight or cost calculations and so these companies decided that this was either unlimited in amount, perpetual in duration, or worse, both. Today these companies are scrambling to figure out ways to lower the storage cost or charge customers.  I received this notice in my in box today, while Yahoo is not a client of ours it looks like they are facing the same problem.

 

 

 Yahoo!
       

Dear Yahoo! Briefcase user,

We will be officially closing Yahoo! Briefcase on March 30, 2009. Until then, we are offering you the opportunity to download your files back to your computer. You will need to take action before we close, after which any files remaining on Yahoo! Briefcase will be deleted and no longer accessible.

To access your Yahoo! Briefcase account, click the link below:

 

 

 

 


Comments Off

Storage Headaches

There are numerous companies who decided a year or two ago that as part of their product offering to provide storage of user data.  Usually this occurred with no foresight or cost calculations and so these companies decided that this was either unlimited in amount, perpetual in duration, or worse, both.  Fast forward to the present and these companies are scrambling to figure out ways to lower the storage cost or charge customers for this service.  Of course, hindsight is 20/20 but in our opinion this should be taken as a lesson to all companies that product roadmaps without consideration of the revenue versus cost equation is more than likely to result in future problems of features either not being used by customers or the use of the feature not generating enough revenue to cover the cost.  

 

 

For companies with data storage problems our recommendations are very dependent on their business model, user agreements, customer contracts, etc. So unfortunately there is no panacea or one size fits all solution. In general we usually walk down the follow steps attempting to achieve an acceptable solution:

  1. Delete what data you can
  2. Archive to very low cost storage data that is not being accessed
  3. Establish tiers of storage based on speed, reliability, and availability

Consider situations in which you have a significant amount of archival data such as former employees or customers who are no longer active.  The cost of keeping this on your primary storage is not only the space on your fastest and most expensive storage but also the backup and archiving of this data that occurs every day even though it never changes.  Incremental backups help this but more than likely you have full backups periodically as well.  If this data is in a primary database, you are likely to have one or more standby databases as well as a tape backup.  All of that unchanging and rarely accessed data continues to take up storage and bandwidth to move it around.  

Possible storage alternatives include the myriad of SAN offerings, NAS devices, open source storage, SATA drive farms, tape, and cloud storage.  We recommend that you implement one or more of these in your solution depending upon your particular needs.  We also encourage you to consider ahead of time your need for scalability and availability.  For a sample architecture of a scalable read or search subsystem check out our previous article.


1 comment