Posts Tagged ‘logging’

Splunk

Monday, September 21st, 2009

This time we have a guest post from a long time friend and colleague, Chris Lalonde. During a conversation a couple of weeks ago Chris told us what he was doing with a product called Splunk, to provide monitoring, alerting, and visibility into log files. Given the importance that we place on logging and monitoring, what Chris was doing sounded interesting. We asked Chris if he would be willing to share a little about his implementation. Chris has over 15 years of experience in information technology and has provided technology solutions for Fortune 1000 companies such as eBay, TD Bank as well as government agencies like DoD, RCMP and USSS. He holds a bachelor of mechanical engineering with a concentration in robotics from Carleton University. Chris also has three patents for authentication systems and has several others pending. He was the recipient of the Director’s Award from the United States Secret Service. And now from Chris:

Having worked in technology for 15+ years I understand how challenging it can be to get visibility across your entire platform, apparently the folks who started Splunk understood this as well. What is Splunk, well it’s a tool you can use to collect data from virtually any server in virtually any format and carry out alerting, ad-hoc searches and large scale reporting.

You install the Splunk client on each of the servers you want to monitor. Currently Splunk supports Windows, Linux, OSX, Solaris, BSD and AIX so the vast majority of platforms are covered. The client typically installs in a few minutes and if you spend a little time pre-configuring things it’s easy to automate the install. Once it’s installed you have a few options you can either run a full server on each machine or turn them into lightweight forwarders and consolidate your logs. I’d recommend the latter since it gives you the log aggregation you more than likely want and the light weight clients use fewer resources than the full blown server. Additionally you can point traditional data sources at Splunk, have lots of syslogs running around why not point them at your central Splunk server and get them indexed and searchable as well.

Once you’ve got Splunk installed on your servers the uses are pretty much endless. The simplest and most obvious is system/log monitoring and alerting. You can just point Splunk at a log directory and it will automatically analyze and index all the files in that directory. Spunk will also attempt to parse the elements of the files so you can do searches on virtually every element of your file. Need to know how many times that URL was called with variable “foo” in it? Just go into the search app and type in something like “http://” “foo” and all the results are shown in two ways 1) in a graphical timeline showing the count per unit of time (hrs, days, min) and 2) in detail below the graph where you can expand out and literally get the section of the specific log file with those results.

That’s the simple version, let’s try something more interesting. Say you automatically roll code to your site and that you’d like a way to verify that code rolling tool did its job. Well just point Splunk at the directory your code rolls and configure Splunk to monitor changes to that directory and bingo once the code is in place Splunk will see the file changes. Now you can either manually search for those files or have Splunk send you an alert with the list of files that have changed in those directories.

Not enough? Splunk has a PCI compliance suite that covers all twelve PCI DSS requirements and all 228 sub-requirements including live controls monitoring, process workflow, checklists and reporting. How about indexing firewall logs, yes. How about sending those IDS logs to Splunk, sure it’ll swallow those as well. Would you like to get an alert when say you suddenly get a large increase in logs from a firewall or and IDS? Sure no problem.

Ok well that’s all great for the System Administrators, Security folks and build and release but how about the network folks. Sure, Splunk has you covered as well. Have some Cisco gear? Splunk has a Cisco suite that covers that. How about something for the DBAs, yes, MySQL and Oracle are covered as well. Again all this data is indexed and is now not only searchable, but you can create customer dashboards and alerts off of all that data.

You are probably saying “Yes I’m sure you can do that but it’s probably a nightmare to configure” in fact nothing could be further from the truth. You can use the UI to configure the report in less than 5min. If you don’t like to use a GUI, the same thing can be done via the CLI in about the same amount of time.

Ok for the bad news, Splunk is free up to the first 500MB of indexes per day after that it starts costing money also the free version is missing some of the reporting extras. 500Mb sounds like a lot until you actually start indexing things, also by default Splunk indexes many, many things that you might not want so you need to be very clear about what you do and what you do not want it to index. There are several ways of dealing with the indexes such as either by aging data out more quickly or limiting the size of the indexes. Currently I’m restricting index sizes so that I’m keeping about 30days worth of data and I haven’t had any issues.

Having said all of this I haven’t yet explored the limits of Splunk in my current configuration I’m only indexing data from about 50 servers and I’ve not run into any issues. I should note that Splunk is designed to scale horizontally and I know of people indexing data from thousands of boxes so I’m not expecting there to be a scale issue. In my experience the data is indexed in less than 5 min so I am currently using Splunk as part of our monitoring and alerting system and it has helped identify issues that would have otherwise been hidden in the logs.

Splunk saved me having to build a centralized logging infrastructure plus all the tools one needs to monitor and manage that infrastructure it’s allowed me to instantly search across all our logs and systems to identify system issues, db issues, and code issues. Something that would have taken me weeks took me 1 day to install and get working on 30 boxes and within 1 week I’d found critical events that saved my company $$.

NOTE: Thanks to Steve from Splunk for identifying that it is 500MB of indexes per day.

To Log or Not To Log?

Monday, December 8th, 2008

That is the question that has caused debate for many years among operations and engineering staffs.  We’ve recently read a couple very well written and well thought out articles on this topic and wanted to offer our ideas on the debate.  The first article is by Todd Hoff from HighScalability.com who advocates in Log Everything All the Time, as the title implies, that everything should be logged for potential use.  Todd has another article describing Facebook’s open source Scribe, Product: Scribe – Facebook’s Scalable Logging System, where he observes the fact that Facebook must agree with his logging approach by virtue of their development of this product.  The other article titled, The Problem With Logging by Jeff Atwood of CodingHorror.com, argues for a more tempered approach.  Jeff summarizes his position as “Start small and simple, logging only the most obvious and critical of errors.”  

 

Our position is squarely in the camp of log everything but with a few caveats.  These ignore-at-your-application’s-peril cautions are 1) logging must not impede the performance of the application 2) use a common framework and 3) look at the data.  Let’s go through these one at a time.

 

1) Logging must not impede the performance of the application – As Jeff points out, “logging isn’t free” and we agree with that but we would add that the potential benefit of the data outweighs the resource cost, unless it negatively affects performance.  Get ready for one of our repeating themes:  Do it if the BUSINESS benefit of logging outweighs the cost of logging.  Most web / application servers are not utilized completely because most teams don’t know precisely the performance parameters and resource constraints of their application, especially as it changes with each release.  If you are fortunate enough to be in an organization that really understands the bottlenecks and performance of the application on specific hardware, more than likely there is a single resource that is the bottleneck, i.e. memory, i/o, or CPU.  Your logging service should not put further demand on a constrained resource, all surpluses are fair game.  And what should go without saying is all logging must be done asynchronously.  Losing a log event is acceptable but impacting a transactional event is not.

2) Use a common framework – Chose or build a common framework that is used throughout the application and that includes common definitions.  Just like definitions of Priorities and Severity for bugs are defined, logging definitions must be determined and adhered to.  Code reviews are a way to ensure common usage.  Data being sent to five different files in different formats defeat the purpose of logging, common usage, format, gathering, and analysis is where the payoff is realized.

3) Look at the data – Logging tons of data, and when we says tons think of Scribe that claims to handle 10’s of billions of messages each day, looking at this data is completely overwhelming.  But looking through some mechanism automated or manual is mandatory for the benefit to be gained.  As Todd points out there are products like Hadoop to help process the data into viewable and actionable information.  Jeff makes the point that “the more you log, the less you find”, but our point is that by the time you know you have a problem and need to inject logging you’re too late.  Properly logging and analyzing of the data will identify the problems early and make diagnosis easier.  We think products such as SCAMP application monitoring software are excellent for creating an easy way of seeing inside the application.

 

As long as you avoid the pitfalls stated above, we feel that logging can be a very beneficial addition to your quality assurance, scalability, availability initiatives.  We highly encourage you to read all the articles cited, both HighScalability and  CodingHorror are on our must-read list of blogs that we subscribe to.  As always let us know what you think.  I’m sure we have not heard that last of this great debate.