Trialling the ELK stack from Elastic

As part of my ongoing research into big-data visualisation and infrastructure and application management tools it came to give ELK a test run to check if it’s suited to my needs. I’ve already looked at a few others (which I will detail in another article), and so far haven’t found somethings suitable for a SMB to collate and process both “live” information from applications, e.g. from existing databases or APIs, combined with “passive” information from log files.

Some of the applications I work with are modifiable so we can take advantage of APIs to push event-driven data out to analytics or monitoring platforms, but some legacy components are just too hard to upgrade thus log-scraping could be the only viable option. I already use tools at various stack levels such as New Relic (APM), Datadog/AWS (infrastructure), Google Analytics (web/user) and my special interest: custom business-event monitoring. Ideally these various sources could be combined to produce some extremely powerful emergent information, but the tools at these different levels are often specific to the needs of that level such as hardware metrics vs. user events and thus difficult to integrate.

My Experiences Trialling Elastic 

It seemed from an initial look that the Elastic stack was flexible and agnostic enough to be able to provide any of these aspects. But would it be a jack-of-all and master-of-none?

To give it a go, I looked at the three main components separately at first. Simply put:

  1. LogStash – data acquisition pipeline
  2. Elastic Search – search engine & API
  3. Kibana – visualisations dashboard

The provide hosted services but I didn’t feel like committing just yet and didn’t want to be rushed in a limited-time trial, so I downloaded and installed the servers locally. I mentally prepared myself for hours of installation and dependency hell after my experiences with Graphite and Datalab.

But – these were my only notes during the fantastically quick set-up:

  • Too easy to setup, built in Java,  it just ran on my macbook!
  • Tutorial: input from my local apache logs files -> elasticsearch, processed really quick!
  • Logstash Grok filter would be key to parsing our log files…

I just unzipped it and ran it and it worked. I know that’s how the world should work, but this is the first time I’ve experienced that for years.

Interest piqued, I decided to run through the tutorials and then move on to setting up a real-world log import scenario for a potential client. I noted down the things I discovered, hopefully they will help other people on a similar first journey. At least it will help me when I return to this later and have predictably forgotten it all.

LogStash – data acquisition pipeline

I ran through the Apache logs tutorial, after completing the basic demos.

The default index of logstash-output-elasticsearch is “logstash-%{+YYYY.MM.dd}”, this is not mentioned in the tutorials. Thus all Apache logs are indexed under this, hence the default search like http://localhost:9200/logstash-2016.07.11/_search?q=response=200

I don’t think this will be useful in reality – having an index for every day, but I guess we’ll get to that later. Furthermore the timestamp imported is today’s date, i.e. the time of the import, not the time parsed from the logs. [I will address this later, below]

Interesting initial API calls to explore:

http://localhost:9200/_cat/indices?v – all (top level) indices, v = verbose, with headers.

http://localhost:9200/_cat/health?v   – like Apache status

http://localhost:9200/_cat/ – all top level information

Grok – import parser

Grok is one of the most important filter plugins– enabling you to parse any log file format, standard or custom. So I quickly tried to write an info trace log grok filter for some legacy logs I often analyse manually and thus know very well. This would make it easier to evaluate the quality and depth of the tools – “how much more will these tools let me see?”

My first noobish attempt was an “inline” pattern in the Grok configuration. A toe in the water.

filter {
   grok {

        # example 01/03/2016 23:59:43.15 INFO: SENT: 11:59:43 PM DEVICE:123:<COMMAND>
        match => { "message" => "%{DATESTAMP:date} %{WORD:logType}: %{WORD:direction}: %{TIME} %{WORD:ampm} DEVICE:%{INT:deviceId}:%{GREEDYDATA:command}"}
   }
}

I found it hard to debug at first: it wasn’t importing but I saw no errors. This was because it was working! It took me a little while to get the trial-and-error configuration debug cycle right. Tips:

  • This grok-debug tool was good.
  • This one also.
  • Core Grok patterns reference is vital
  • A regex cheat sheet also helps, as Grok is built on it.
  • Start Logstash with -v to get verbose log output, or even more with –debug
  • Restart logstash when you make config changes (duh)
  • The config is not JSON. Obvious but this kept catching me out because most other aspects you’ll need to learn simultaneously are in JSON. (what’s with the funny PHP-looking key => values, are they a thing?)

OK. I got my custom log imported fairly easily – but how to I access it?

Elastic Search – search engine & API

During my first 5 minutes going through the ES tutorials and I noted:

  • Search uses a verbose JSON DSL via POST (not so handy for quick browser hackery)
  • However I found you can do quick GET queries via mini-language
  • Search properties: http://localhost:9200/logstash-2016.07.12/_search?q=deviceId:1234&response=200&pretty
  • To return specific properties e.g.: &_source=logType,direction,command
  • Scores – this is search-engine stuff (relevance, probabilities, distances) as opposed to SQL “perfect” responses.
  • Aggregates (like GROUP) for stats, didn’t try them, POST only and I’m lazy today, but they look good. Would probably explore them more in Kibana

Great – the REST API looks really good, easily explorable with every feature I had hoped for and more. The “scores” aspect made me realise that this isn’t just a data API, this is a proper search engine too, with interesting features such as fuzziness and Levenshtein distances. I hadn’t really thought of using that – from a traditional data accuracy perspective this seemed all a bit too gooey, but perhaps there will be a niche I could use it for.

Kibana – for visualisations

  • Download “installed” tar.gz – again worked perfectly, instantly.
  • Ran on http://0.0.0.0:5601, it set itself up
  • Created default index pattern on logstash-* and instantly saw all the data from above import.

So again great, this was “too easy” to get up and running, literally within 5 minutes I was exploring the data from the Apache tutorial in wonderful technicolor.

So after a broad sweep I was feeling good about this stack so I felt it was time to go a bit deeper.

Mappings

It seems the next (it should have been first!) major task is to map the fields to types, else they all end up as fieldname.raw as strings.

But… you cannot change mappings on existing data – so you must set them up first! However you can create a new index, and re-index the data… somehow, but I found it easier to start again for the moment.

I couldn’t figure out (or there isn’t) a mini-language for a GET to create mappings, so I used the example CURL commands which weren’t as annoying as I’d thought they’d be – except I do use my browser URL history as my memory. It’s just a bit harder to re-edit and re-use SSH shell histories, than in-browser URLs.

curl -XPUT http://localhost:9200/ra-info-tracelog -d '
{
 "mappings": {
   "log": {
     "properties": {
       "date": { "type" : "date" },
       "deviceId": { "type" : "integer" }
       }
     }
   }
 }
}
';

 

Getting the date format right…

The legacy server, from which these logs came, doesn’t use a strict datetime format (sigh), and  Logstash was erroring.

# example 01/03/2016 23:59:43.15

Initially I tried to write a “Custom Pattern” but then I found the Grok date property format should be able to handle it, even with 2 digits of microseconds (default is 3 digits). To figure this out, I had to manually traverse the tree of patterns in the core library from DATESTAMP down through its children. This was actually a good exercise to learn how the match patterns work – very much like an NLP grammar definition (my AI degree is all coming back to me).

So why is Grok erroring when the pattern is correct?

It took me a while to realise it’s because the Grok DATESTAMP pattern is just a regexp to parse the message data into pieces but is more permissive than the default date field mapping specification in the subsequent Elasticsearch output stage. So it tokenises the date syntactically, but it’s the field mapping which then tries to interpret it semantically which fails.

OK, so I felt I should write a custom property mapping to accommodate the legacy format.

    "date": {
        "type" : "date" ,
        "format": "dd/MM/yyyy HH:mm:ss.SS"
    },

Mild annoyance alert: To do these changes I had to keep re-creating the indexes and changing the output spec, restarting Logstash and changing my debug query URLs. So it’s worth learning how to re-index data, or (when doing it for real) get this right first in an IA scoping stage.

Tip: I debugged this by pasting very specific single logs one line at a time into the test.log file which Logstash is monitoring. Don’t just point it a huge log file!

So many formats!

The date mappings are in yet another language/specification/format/standard called Joda. At this point I started to feel a little overwhelmed with all the different formats you need to learn to get a job done. I don’t mind learning a format, but I’m already juggling three or four new syntaxes and switching between them when I realise I need to move a filtering task to a different layer is an off-putting mix of laborious and confusing.

For example I just learned how to make optional matches in Grok patterns, but I can’t apply it here and do “HH:mm:ss.SS(S)?” to cope with log-oddities, which is a frustrating dead-end for this approach. So I have to look again at all the other layers to see how I can resolve this with a more flexible tool.

OK, once the date mapping works… it all imports successfully.

Creating Kibana visualisation

To use this time field you create a new index pattern in Kibana>Settings>Indices and select the “date” field above as the “Time-field name”, otherwise it will use the import time  as the timestamp – which won’t be right when we’re importing older logs. (It will be almost right if logs are being written and processed immediately but this won’t be accurate enough for me).

I loaded in a few hundred thousand logs, and viewed them immediately in Kibana… which looks good! There are immediately all the UI filters from my dreams, it looks like it will do everything I want.

But there’s an easier way!

The Date filter allows you to immediately parse a date from a field into the default @timestamp field. “The date filter is especially important for sorting events and for backfilling old data.” which is exactly what I’m doing.

So I made a new filter config:

filter {
    grok {
        match => { "message" => "%{DATESTAMP:date} ... “}
    }
    date {
        match => ["date", "dd/MM/yyyy HH:mm:ss.SS"]
    }
}

And it turned up like this (note: without using the field mapping above, so the redundant “date” property here is just a string). Also note the UTC conversion, which wqs a little confusing at first especially as I unwittingly chose a to test a log across the rare 29th of Feb!

    "@timestamp" : "2016-02-29T13:00:23.600Z",
    "date" : "01/03/2016 00:00:23.60",

The desired result was achieved: this showed up in Kibana instantly, without having to specify a custom time-field name.

I got it wrong initially, but at least that helped me to understand what these log-processing facilities are saving you from having to do later (possibly many times).

Making more complex patterns

The legacy log format I’m trialling has typical info/warning/error logs, but each type also has a mix of a few formats for different events. To break down these various log types, you need to implement a grammar tree of expressions in custom Grok Patterns.

The first entries should be the component “elements” such as verbose booleans or enumerations

    ONOFF (?:On|Off)
    LOG_TYPE (?:INFO|WARNING|ERROR|DEBUG).

If a log file has various entry types, like sent/received, connection requests and other actions then the next entries should be matching the various log-line variants composed of those custom elements and any standard patterns from the core libraries.
e.g.

# e.g. 01/02/2016 01:02:34.56 INFO: Connection request: device:1234 from 123.45.56.78:32770 ()
INFO_CONNECTION_REQUEST %{DATESTAMP:date} %{LOG_TYPE:logType}: Connection request: device:%{INT:deviceId} from %{HOSTPORT:deviceIP} \(%{HOSTNAME}\)

Then finally you have one log-line super-type which matches any log-line variants
e.g.:

INFO_LINE %{INFO_COMMAND}|%{INFO_CONNECTION_REQUEST}|%{INFO_CONNECTION_CLOSED}|%{INFO_STATUS}

Again the tools mentioned above were crucial in diagnosing the records which arrive in Elasticsearch tagged with _grokparsefailure while you are developing these patterns.

For annoyingly “flexible” legacy log formats I found these useful:

  • Optional element: ( exp )?          e.g. (%{HOSTNAME})?
  • Escape brackets:    \(                     e.g. \(%{INT:userId}\)
  • Non-capturing group: (?: exp ) e.g. (?:INFO|WARNING|ERROR|DEBUG)

Differentiating log variants in the resulting data

Next I wanted to be able to differentiate which log-line variant each log had actually matched. This turned out to be harder than I had thought. There doesn’t seem to be a mechanism within the regular-expression matching capabilities of the Grok patterns such as to be able to “set a constant” when a specific pattern matches.

The accepted method is to use logic in the pipeline configuration file plus the abilities to add_tags or add_fields in the Grok configuration. This approach is sadly a bit wet (not DRY) as you have to repeat the common configuration options for each variant. I tried to find other solutions, but currently I haven’t resolved the repetitions.
e.g.

grok {
    match => { "message" => "%{INFO_CONNECTION_CLOSED}" }
    patterns_dir => ["mypatterns"]
    add_tag => [ "connection" ]
}
grok {
    match => { "message" => "%{RA_INFO_LINE}" }
    patterns_dir => ["mypatterns"]
}

However this can also result in a false _grokfailure tag, because the two configurations are run sequentially, regardless of a match. So if the first one matches, the second will fail.

One solution is to use logic to check the results of the match as you progress.

    grok {
        match => { "message" => "%{INFO_CONNECTION_CLOSED}" }
        patterns_dir => ["mypatterns"]
        add_tag => [ "connection" ]
    }
    if ("connection" not in [tags]) {
        grok {
            match => { "message" => "%{INFO_LINE}" }
            patterns_dir => ["mypatterns"]
        }
    }

This works well, and for these log-line variants, I’m now getting a “connection” tag, which can enable API queries/Kibana to know to expect a totally different set of properties for items in the same index. I see this tag as a kind of “classname” – but I don’t know yet if I’m going down the right road with that OO thought train!

    "@timestamp" : "2016-02-29T13:00:25.520Z",
    "logType" : [ "INFO", "INFO" ],
    "deviceId" : [ "123", "123" ],
    "connection_age" : [ "980", "980" ],
    "tags" : [ "connection" ]

Another method is to “pre-parse” the message and only perform certain groks for specific patterns. But again it still feels like this is duplicating work from the patterns.

    if [message] =~ /took\s\d+/ { grok { ... } }

Even with the conditional in place above, the first filter technically fails before the second one succeeds. This means the first failure will still add a “_grokparsefailure” tag to an eventually successful import!

The final workaround is to manually remove the failure tags in all but the last filter:

    grok {
        match => { "message" => "%{INFO_CONNECTION_CLOSED}" }
        add_tag => [ "connection" ]
        # don't necessarily fail yet...
        tag_on_failure => [ ]
    }

So while I am still very impressed with the ELK stack, I am starting to see coping with real-world complexities isn’t straightforward and is leading to some relatively hacky and unscalable techniques due to the limited configuration language. It’s these details that will sway people from one platform to another, but it’s difficult to find those sticking points until you’ve really fought with it – as Seraph so wisely put it.

Loading up some “big” data

Now I was ready to import a chunk of old logs and give it a good test run. I have a lot – a lot – of potential archive data going back years. It seemed to import fairly quickly even on my old 6yr-old MacBookPro Logstash chewed 200,000 logs into Eleasticsearch in a few minutes. (I know this is tiny data, but it’s all I had in my clipboard at the time.) I’m looking forward to testing millions of logs on a more production tuned server and benchmarking it with proper indexing and schema set up.

Heading back to Kibana, I was able to explore the data more thoroughly now it’s a bit more organised. The main process goes through:

  1. data discovery
  2. to making visualisations
  3. and then arranging them on dashboards.

This process is intuitive and exactly what you want to do. You can explore the data by building queries with the help of the GUI, or hand-craft some of it with more knowledge of the Elasticsearch API, then you can save these queries for re-use later in the visualisation tools.

Even in the default charts, I instantly saw some interesting patterns including blocks of missing data which looked like a server outage, unusual spikes of activity, and the typical camel-humps of the weekend traffic patterns. These patterns are difficult to spot in the raw logs, unless you have Cipher eyes.

I had a quick look at the custom visualisations, particularly the bar-charts, and found you can quite easily create sub-groups from various fields and I started to realise how powerful the post-processing capabilities of Kibana could be in slicing up the resulting data further.

Summary thoughts

In summary I feel the ELK stack can certainly do what I set out to achieve – getting business value out of gigabytes of old logs and current logs without having to modify legacy servers. I feel it could handle both infrastructure level monitoring and the custom business-events both stored in logs and fired from our APIs and via MQs.

The component architecture and exposed REST API is also flexible enough to be able to easily feed into other existing data-processing pipelines instead of Kibana, including my latest pet-project Logline which visualises mashups of event-driven logs from various sources using the Vis.org Timeline.

Next steps

I feel now I’m ready to present this back to the folks at the organisations I consult for and confidently offer it as a viable solution. It offers tools for building a business intelligence analysis platform and with the addition of the monitoring tools such as Watcher potentially bring that post-rational intelligence into real-time.

Beyond that – the next step could even be predicting the future, but that’s another story.

This entry was posted in articles and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *