A Framework for Maturing SaaS Monitoring
Far too often we see clients attempting to implement monitoring solutions intended to tell them the root cause of any potential problem they might be facing. This sounds great, but this monitoring panacea rarely works and the failures are largely attributed to two issues:
1) The systems they are attempting to monitor aren’t designed to be monitored.
2) The company does not approach monitoring in a methodical evolutionary fashion.
Designing Systems to be Monitored
Honestly, you should not expect a monitoring system to correctly identify the faults within your platform if you did not design your platform to be monitored with near-real time fault detection in mind. This goes beyond logging events and errors; it is something that we often refer to as “real time application monitoring”.
The best designed SaaS systems build the monitoring of their platform into their code and systems. As an example, world class real time monitoring solutions have the capability to log the times and errors for each internal call to a service. Here the service may be a call to a data store or another web service that exposes account information, etc. The resulting times, rates and types of errors might be plotted in real time in a statistical process control chart (SPC) with out of bound conditions highlighted as an alert on some sort of monitoring panel. The mean of the SPC chart may be calculated by the previous 30 similar calendar days (for instance the previous 30 Mondays) for that time of the day (say 12:10 PM).
Additionally, world class teams include an architectural principle addressing the need to be monitored as a criterion for release for any new functionality. ARB is a process or meeting in which the criterion is evaluated. Questions such as “How will we know the system is functioning properly” are asked, and a bad answer is one that sounds like “Because we log errors to a log file” whereas a good answer might be “Because we plot the rate of errors and timeliness of responses in real time and alert on statistically significant anomalies”.
While having “Designed to be Monitored” as a architectural principle is necessary to be world class, it is not sufficient if you really want to resolve issues quickly. The only silver bullet for monitoring solutions that help quickly identify and resolve issues is a combination of time, planning and a reaction to past events.
First you should plan a system that identifies that something is wrong from the perspective of your customer. In this step you are answering the question of “Is there a problem my customers can see?” Far too many companies bypass this step. Incorporate a real time, third party system that interacts with your platform in the same fashion as your customers – from the “last mile” – and performs your most critical transactions. Throw an alert when the system is outside of your internally generated SLAs.
The next step is to implement systems that answer the question of “which systems are causing the problem”?. In the ideal world you will have developed a fault isolative architecture to create “failure domains” that will isolate failures and help you determine the systems causing the problem. Failing that, you need monitoring that can help indicate the rough areas of concern. These are typically aggregated system statistics and monitoring similar to the real time application monitoring above (susbsystem X is throwing errors at a rate 3 standard deviations above normal) or aggregated load, cpu, etc for a group of systems (rather than a single system). You want to ensure that this level of monitoring does not create a level of noise that forces your team to ignore the alerts.
The third step is to answer the question of “What exactly is the problem”. This is the step that everyone immediately jumps to when they implement a host of alarms and monitors on everything from individual application logs to individual load, cpu utilization, memory utilization, port utilization, etc. The problem with this is that these alerts have a high degree of false positive and aren’t necessarily useful in determining that there is a problem that needs to be resolved RIGHT NOW – they are more useful in helping to isolate and determine what the problem is. If you alert based on aggregate subsystem and customer perceived data, you will have less noise in general and you can use this level of data to help pinpoint the problem, perform capacity analysis, etc.
The final step is to implement monitoring systems that help you identify that there will be a problem in the future. This is the most mature step, but one that should be tackled only after you’ve implemented the prior three steps to include real time application monitoring. These systems are predictive in nature and should use data collected from the third level of maturity (discrete and granular system monitoring) to feed into a modeling program that can ultimately help plan capacity, determine system break points, etc.