What questions do each of your system monitors answer? You probably think they answer questions such as “Is there a problem?” and if so “Where is the problem?” Most likely this is not the case and instead of telling you “Is there a problem?” it really only tells you “Where” or “What” the problem might be. Before we continue this, first a quick detour to discuss metrics, which while different than monitoring are very similar in many ways.
Eric Ries, co-founder and CTO of IMVU, posted an article about the difference between vanity metrics and actionable metrics. The entire article and accompaning video are worth a read and listen but the take away is that most people are using and looking for metrics that are great soundbites but do not offer any definable actions. One example is the total number of hits to a website. Eric ask the questions “Now what? Do you really know what actions you took in the past that drove those visitors to you, and do you really know which actions to take next?” This makes total sense to me as we often see teams misusing monitoring in an attempt to determine what actions to take with their systems.
Back to our discussion of what question your monitoring is attempting to answer. We think there are five evolutionary questions that monitoring should answer:
- Is there a problem?
- Where is the problem?
- What is the problem?
- Why is there a problem?
- Will there be a problem?
Where most people fail is using a monitoring tool that is designed to answer “Where” or “What” and try to use it to answer “Is”. For example, if you are monitoring all of your servers vitals such as CPU, memory, and I/O what is the appropriate action for your team to take when the CPU utilization goes to 100%? The reason that might be a tough question is that you are missing the vital piece of information “Is this affecting my customers?”. The “Is there a problem” is intended to be a proxy for customer impact in order to help determine the degree and speed of escalation of the issue.
If you have monitoring services in place now it is worthwhile to determine what question each one answers. If you are missing a monitor for a particular question, the time to remedy it is before you need that question answered.