This time we have a guest post from a long time friend and colleague, Chris Lalonde. During a conversation a couple of weeks ago Chris told us what he was doing with a product called Splunk, to provide monitoring, alerting, and visibility into log files. Given the importance that we place on logging and monitoring, what Chris was doing sounded interesting. We asked Chris if he would be willing to share a little about his implementation. Chris has over 15 years of experience in information technology and has provided technology solutions for Fortune 1000 companies such as eBay, TD Bank as well as government agencies like DoD, RCMP and USSS. He holds a bachelor of mechanical engineering with a concentration in robotics from Carleton University. Chris also has three patents for authentication systems and has several others pending. He was the recipient of the Director’s Award from the United States Secret Service. And now from Chris:
Having worked in technology for 15+ years I understand how challenging it can be to get visibility across your entire platform, apparently the folks who started Splunk understood this as well. What is Splunk, well it’s a tool you can use to collect data from virtually any server in virtually any format and carry out alerting, ad-hoc searches and large scale reporting.
You install the Splunk client on each of the servers you want to monitor. Currently Splunk supports Windows, Linux, OSX, Solaris, BSD and AIX so the vast majority of platforms are covered. The client typically installs in a few minutes and if you spend a little time pre-configuring things it’s easy to automate the install. Once it’s installed you have a few options you can either run a full server on each machine or turn them into lightweight forwarders and consolidate your logs. I’d recommend the latter since it gives you the log aggregation you more than likely want and the light weight clients use fewer resources than the full blown server. Additionally you can point traditional data sources at Splunk, have lots of syslogs running around why not point them at your central Splunk server and get them indexed and searchable as well.
Once you’ve got Splunk installed on your servers the uses are pretty much endless. The simplest and most obvious is system/log monitoring and alerting. You can just point Splunk at a log directory and it will automatically analyze and index all the files in that directory. Spunk will also attempt to parse the elements of the files so you can do searches on virtually every element of your file. Need to know how many times that URL was called with variable “foo” in it? Just go into the search app and type in something like “http://” “foo” and all the results are shown in two ways 1) in a graphical timeline showing the count per unit of time (hrs, days, min) and 2) in detail below the graph where you can expand out and literally get the section of the specific log file with those results.
That’s the simple version, let’s try something more interesting. Say you automatically roll code to your site and that you’d like a way to verify that code rolling tool did its job. Well just point Splunk at the directory your code rolls and configure Splunk to monitor changes to that directory and bingo once the code is in place Splunk will see the file changes. Now you can either manually search for those files or have Splunk send you an alert with the list of files that have changed in those directories.
Not enough? Splunk has a PCI compliance suite that covers all twelve PCI DSS requirements and all 228 sub-requirements including live controls monitoring, process workflow, checklists and reporting. How about indexing firewall logs, yes. How about sending those IDS logs to Splunk, sure it’ll swallow those as well. Would you like to get an alert when say you suddenly get a large increase in logs from a firewall or and IDS? Sure no problem.
Ok well that’s all great for the System Administrators, Security folks and build and release but how about the network folks. Sure, Splunk has you covered as well. Have some Cisco gear? Splunk has a Cisco suite that covers that. How about something for the DBAs, yes, MySQL and Oracle are covered as well. Again all this data is indexed and is now not only searchable, but you can create customer dashboards and alerts off of all that data.
You are probably saying “Yes I’m sure you can do that but it’s probably a nightmare to configure” in fact nothing could be further from the truth. You can use the UI to configure the report in less than 5min. If you don’t like to use a GUI, the same thing can be done via the CLI in about the same amount of time.
Ok for the bad news, Splunk is free up to the first 500MB of indexes per day after that it starts costing money also the free version is missing some of the reporting extras. 500Mb sounds like a lot until you actually start indexing things, also by default Splunk indexes many, many things that you might not want so you need to be very clear about what you do and what you do not want it to index. There are several ways of dealing with the indexes such as either by aging data out more quickly or limiting the size of the indexes. Currently I’m restricting index sizes so that I’m keeping about 30days worth of data and I haven’t had any issues.
Having said all of this I haven’t yet explored the limits of Splunk in my current configuration I’m only indexing data from about 50 servers and I’ve not run into any issues. I should note that Splunk is designed to scale horizontally and I know of people indexing data from thousands of boxes so I’m not expecting there to be a scale issue. In my experience the data is indexed in less than 5 min so I am currently using Splunk as part of our monitoring and alerting system and it has helped identify issues that would have otherwise been hidden in the logs.
Splunk saved me having to build a centralized logging infrastructure plus all the tools one needs to monitor and manage that infrastructure it’s allowed me to instantly search across all our logs and systems to identify system issues, db issues, and code issues. Something that would have taken me weeks took me 1 day to install and get working on 30 boxes and within 1 week I’d found critical events that saved my company $$.
NOTE: Thanks to Steve from Splunk for identifying that it is 500MB of indexes per day.