Archive for the ‘Operations’ Category

Storage Headaches

Saturday, February 21st, 2009

There are numerous companies who decided a year or two ago that as part of their product offering to provide storage of user data.  Usually this occurred with no foresight or cost calculations and so these companies decided that this was either unlimited in amount, perpetual in duration, or worse, both.  Fast forward to the present and these companies are scrambling to figure out ways to lower the storage cost or charge customers for this service.  Of course, hindsight is 20/20 but in our opinion this should be taken as a lesson to all companies that product roadmaps without consideration of the revenue versus cost equation is more than likely to result in future problems of features either not being used by customers or the use of the feature not generating enough revenue to cover the cost.  

 

 

For companies with data storage problems our recommendations are very dependent on their business model, user agreements, customer contracts, etc. So unfortunately there is no panacea or one size fits all solution. In general we usually walk down the follow steps attempting to achieve an acceptable solution:

  1. Delete what data you can
  2. Archive to very low cost storage data that is not being accessed
  3. Establish tiers of storage based on speed, reliability, and availability

Consider situations in which you have a significant amount of archival data such as former employees or customers who are no longer active.  The cost of keeping this on your primary storage is not only the space on your fastest and most expensive storage but also the backup and archiving of this data that occurs every day even though it never changes.  Incremental backups help this but more than likely you have full backups periodically as well.  If this data is in a primary database, you are likely to have one or more standby databases as well as a tape backup.  All of that unchanging and rarely accessed data continues to take up storage and bandwidth to move it around.  

Possible storage alternatives include the myriad of SAN offerings, NAS devices, open source storage, SATA drive farms, tape, and cloud storage.  We recommend that you implement one or more of these in your solution depending upon your particular needs.  We also encourage you to consider ahead of time your need for scalability and availability.  For a sample architecture of a scalable read or search subsystem check out our previous article.

A Framework for Maturing SaaS Monitoring

Tuesday, September 9th, 2008

Far too often we see clients attempting to implement monitoring solutions intended to tell them the root cause of any potential problem they might be facing.  This sounds great, but this monitoring panacea rarely works and the failures are largely attributed to two issues:
1) The systems they are attempting to monitor aren’t designed to be monitored.
2) The company does not approach monitoring in a methodical evolutionary fashion.

Designing Systems to be Monitored
Honestly, you should not expect a monitoring system to correctly identify the faults within your platform if you did not design your platform to be monitored with near-real time fault detection in mind.  This goes beyond logging events and errors; it is something that we often refer to as “real time application monitoring”. 

The best designed SaaS systems build the monitoring of their platform into their code and systems.  As an example, world class real time monitoring solutions have the capability to log the times and errors for each internal call to a service.  Here the service may be a call to a data store or another web service that exposes account information, etc.  The resulting times, rates and types of errors might be plotted in real time in a statistical process control chart (SPC) with out of bound conditions highlighted as an alert on some sort of monitoring panel.  The mean of the SPC chart may be calculated by the previous 30 similar calendar days (for instance the previous 30 Mondays) for that time of the day (say 12:10 PM).

Additionally, world class teams include an architectural principle addressing the need to be monitored as a criterion for release for any new functionality.  ARB is a process or meeting in which the criterion is evaluated.  Questions such as “How will we know the system is functioning properly” are asked, and a bad answer is one that sounds like “Because we log errors to a log file” whereas a good answer might be “Because we plot the rate of errors and timeliness of responses in real time and alert on statistically significant anomalies”.

Maturing Monitoring
While having “Designed to be Monitored” as a architectural principle is necessary to be world class, it is not sufficient if you really want to resolve issues quickly.  The only silver bullet for monitoring solutions that help quickly identify and resolve issues is a combination of time, planning and a reaction to past events.

First you should plan a system that identifies that something is wrong from the perspective of your customer.  In this step you are answering the question of “Is there a problem my customers can see?”  Far too many companies bypass this step.  Incorporate a real time, third party system that interacts with your platform in the same fashion as your customers – from the “last mile” – and performs your most critical transactions.   Throw an alert when the system is outside of your internally generated SLAs. 

The next step is to implement systems that answer the question of “which systems are causing the problem”?.  In the ideal world you will have developed a fault isolative architecture to create “failure domains” that will isolate failures and help you determine the systems causing the problem.  Failing that, you need monitoring that can help indicate the rough areas of concern.  These are typically aggregated system statistics and monitoring similar to the real time application monitoring above (susbsystem X is throwing errors at a rate 3 standard deviations above normal) or aggregated load, cpu, etc for a group of systems (rather than a single system).  You want to ensure that this level of monitoring does not create a level of noise that forces your team to ignore the alerts.

The third step is to answer the question of “What exactly is the problem”.  This is the step that everyone immediately jumps to when they implement a host of alarms and monitors on everything from individual application logs to individual load, cpu utilization, memory utilization, port utilization, etc.  The problem with this is that these alerts have a high degree of false positive and aren’t necessarily useful in determining that there is a problem that needs to be resolved RIGHT NOW – they are more useful in helping to isolate and determine what the problem is.  If you alert based on aggregate subsystem and customer perceived data, you will have less noise in general and you can use this level of data to help pinpoint the problem, perform capacity analysis, etc.

The final step is to implement monitoring systems that help you identify that there will be a problem in the future.  This is the most mature step, but one that should be tackled only after you’ve implemented the prior three steps to include real time application monitoring.  These systems are predictive in nature and should use data collected from the third level of maturity (discrete and granular system monitoring) to feed into a modeling program that can ultimately help plan capacity, determine system break points, etc.