To Log or Not To Log?
That is the question that has caused debate for many years among operations and engineering staffs. We’ve recently read a couple very well written and well thought out articles on this topic and wanted to offer our ideas on the debate. The first article is by Todd Hoff from HighScalability.com who advocates in Log Everything All the Time, as the title implies, that everything should be logged for potential use. Todd has another article describing Facebook’s open source Scribe, Product: Scribe – Facebook’s Scalable Logging System, where he observes the fact that Facebook must agree with his logging approach by virtue of their development of this product. The other article titled, The Problem With Logging by Jeff Atwood of CodingHorror.com, argues for a more tempered approach. Jeff summarizes his position as “Start small and simple, logging only the most obvious and critical of errors.”
Our position is squarely in the camp of log everything but with a few caveats. These ignore-at-your-application’s-peril cautions are 1) logging must not impede the performance of the application 2) use a common framework and 3) look at the data. Let’s go through these one at a time.
1) Logging must not impede the performance of the application – As Jeff points out, “logging isn’t free” and we agree with that but we would add that the potential benefit of the data outweighs the resource cost, unless it negatively affects performance. Get ready for one of our repeating themes: Do it if the BUSINESS benefit of logging outweighs the cost of logging. Most web / application servers are not utilized completely because most teams don’t know precisely the performance parameters and resource constraints of their application, especially as it changes with each release. If you are fortunate enough to be in an organization that really understands the bottlenecks and performance of the application on specific hardware, more than likely there is a single resource that is the bottleneck, i.e. memory, i/o, or CPU. Your logging service should not put further demand on a constrained resource, all surpluses are fair game. And what should go without saying is all logging must be done asynchronously. Losing a log event is acceptable but impacting a transactional event is not.
2) Use a common framework – Chose or build a common framework that is used throughout the application and that includes common definitions. Just like definitions of Priorities and Severity for bugs are defined, logging definitions must be determined and adhered to. Code reviews are a way to ensure common usage. Data being sent to five different files in different formats defeat the purpose of logging, common usage, format, gathering, and analysis is where the payoff is realized.
3) Look at the data – Logging tons of data, and when we says tons think of Scribe that claims to handle 10’s of billions of messages each day, looking at this data is completely overwhelming. But looking through some mechanism automated or manual is mandatory for the benefit to be gained. As Todd points out there are products like Hadoop to help process the data into viewable and actionable information. Jeff makes the point that “the more you log, the less you find”, but our point is that by the time you know you have a problem and need to inject logging you’re too late. Properly logging and analyzing of the data will identify the problems early and make diagnosis easier. We think products such as SCAMP application monitoring software are excellent for creating an easy way of seeing inside the application.
As long as you avoid the pitfalls stated above, we feel that logging can be a very beneficial addition to your quality assurance, scalability, availability initiatives. We highly encourage you to read all the articles cited, both HighScalability and CodingHorror are on our must-read list of blogs that we subscribe to. As always let us know what you think. I’m sure we have not heard that last of this great debate.