Proposal: JSON support in Filebeat #1069

tsg · 2016-02-29T22:48:53Z

Played with this while in the air. Basic support for for JSON in Filebeat seems pretty easy to add, although I didn't yet add any tests as I first want to check some design choices with the team.

Design choices

Keys are copied top level into the document, to make it simple for the most common use cases. There is some configuration flexibility in terms of what to do in case of fields conflicts. See the Config section.
The decoding happens after line filtering and before enriching the event with metadata. See the Order of operations section.

Order of operations

With this PR, Filebeat will apply transformations in the following on each log line. The JSON decoding is at step 4:

Encoding decode
Multiline
Line Filtering
JSON decoding
Add custom and metadata fields
Generic filtering

Config

This block can be placed per prospector:

      json_decoder:
        on_unmarshal_error: ignore | add_error_key
        overwrite_keys: true | false
        keep_original: false

The JSON decoder is enabled if the section is un-commented. In case of unmarshaling errors from JSON, you can choose between ignoring the error, which means that the event will still be published with the original message key, and adding a dedicated json_error key with information about the error. Default is ignore.

The overwrite_keys settings decides what to do if a top-level key from the decoded JSON already exists in the dictionary (things like @timestamp, message, offset, etc.). Default is false.

If set to true, the keep_original setting allows removing the message key in case of successful unmarshaling. In case of errors, the original message key is always kept.

Pinging @ruflin @urso @andrewkroh @monicasarbu

andrewkroh · 2016-02-29T23:16:25Z

filebeat/config/config.go

@@ -86,6 +87,12 @@ type MultilineConfig struct {
 	Timeout  string `yaml:"timeout"`
 }

+type JsonDecoderConfig struct {
+	OnUnmarshalError string `yaml:"on_unmarshal_error"`


I would like to see this field validated at startup to guarantee that it is ignore, add_error_key, or empty. This way the user gets an error instead of it just defaulting to ignore.

And maybe even take the value and run strings.ToLower() on it so that it's case insensitive.

andrewkroh · 2016-02-29T23:37:27Z

LGTM

ruflin · 2016-03-01T06:59:15Z

Two notes from my side:

Instead of putting the keys at the top level, I would put in under a namespace, something like "message" which then contains the json document. In case of application logging, not all keys are known in advance, means if every engineer can add its own fields to a log message, it can happen that things are either overwritten or lost which was not expected in advance. I would still offer the choice of using json_under_root (or similar) to put the fields on the top level. By default I would disable it.

The second thing I would allow is to set a mapping for the timestamp. I think the only thing which is common across all log messages is that they contain a timestamp. But not always will the timestamp be called timestamp. So if someone wants to overwrite the timestamp created by filebeat, the following config can be chosen for the example log in this PR:

json_decoder:
  timestamp_field: "timestamp"

The question here is if the @timestamp field will just be overwritten or if filebeat should try to convert it to a default format (I would suggest not to, as this only leads to problems).

ruflin · 2016-03-01T07:01:11Z

filebeat/input/file.go

+		if err != nil {
+			logp.Err("Error decoding JSON: %v", err)
+			if f.jsonDecoder.OnUnmarshalError == "add_error_key" {
+				event["json_error"] = fmt.Sprintf("Error decoding JSON: %v", err)


An other option could be to introduce a general "error" field under beat.error that we can also use in the future for other errors. This would make it possible to then search for all log events with an error.

Question is, what if there are more errors on the same event Do we keep only the latest one? Do we make it an array?

I'm not sure if there are cases with multiple errors as I assume every time we hit an error we would stop executing the next steps and send the event with the error. For example if there is a multiline error we should not continue with JSON decoding.

andrewkroh · 2016-03-02T06:13:21Z

I was thinking about the order of operations and how it interacts with filtering (and possible other future filters). If the configuration were a bit more flexible then you could specify the order of operations in the config. You could also apply the same filter more than once if needed. Essentially you could compose a filter chain and apply it to published events. Consider configuration like this:

# NOT REAL CONFIG, DO NOT COPY
filebeat:
  prospectors:
    - paths: ['log.json']
      filters:
        message_json:
          type: json
          field: message
        drop_debug_events:
          type: drop_event
          condition: level == DEBUG
        request_json_decode:
          # request was a string in the message that also needs decoded (a contrived example)
          type: json
          field: request

The filters and the code to build the filter chain from config could live outside of Filebeat so that all beats could take advantage of it as needed.

urso · 2016-03-02T10:57:53Z

Order of operation I was thinking about:

File encoding decode (json should be utf-8, utf-16 or utf-32)
Parse json (do not depend on any line vs. multiline for parsing)
multiline on configurable field (e.g. for docker support)
Add custom and metadata fields
Generic filtering

ruflin · 2016-03-02T13:32:45Z

@urso In case the json document is spread across multiple lines, wouldn't it be necessary to do multiline before json parsing?

urso · 2016-03-02T13:36:50Z

For me it's a new prospector/harvester type, parsing content as it arrives. Not being line based, but json-document based. No line limits, e.g. have support for multiple events in one line or event split among multiple lines, or have next event start on same line as last one ended in multiline scenario.

ruflin · 2016-03-02T13:39:55Z

I see, so start and end of an event would always be defined by { ... } ?

urso · 2016-03-02T13:47:10Z

@ruflin yes

ruflin · 2016-03-02T13:56:36Z

What about going with the simple implementation first which only supports clean JSON in one line (this PR) and the go into more complex and powerful solutions in a second step?

urso · 2016-03-02T14:05:24Z

I'd like to be a little conservative here and not go with this PR. I'd rather like to see another input_type for json based logs instead of adding to current harvester type.

On the other hand I'm fine with simple solution first, but another prospector type to not generate too much technical debt.

tsg · 2016-03-02T14:19:19Z

I kind of think we'll need both.

It's because for Docker support we'll need to do the JSON decoding twice, because the app inside the container might also use structured logging. I guess this is also why @andrewkroh proposed the fully configurable option, but I worry about the complexity there.

So the order of operations for structured logs from a docker app would be:

Encoding
JSON decoding (outer layer, added by Docker)
Multiline
Line Filtering
JSON decoding (inner layer, added by the app)
Add custom and metadata fields
Generic filtering

Like @urso, I see step 2 as a new harvester type, but step 5 can be a simple addition to the current one. This PR is essentially adding step 5.

I see a challenge that we'll have when adding a new harvester type, that it will need to duplicate a lot of functionality from the current one. I can see the argument of not adding more stuff to the current harvester until we have figure this out. One the other hand, the change is small enough that I don't worry that much.

urso · 2016-03-02T14:28:33Z

In docker case I'd even do:

Encoding
JSON decoding (outer layer, added by Docker)
JSON decoding (inner layer, added by the app)
Multiline
Line Filtering
Add custom and metadata fields
Generic filtering

Maybe multiline and line filtering is not required at all file json logger.

Difference of line/multiline is, line is our event. But for json, the json object itself is our event.

tsg · 2016-03-03T12:18:31Z

@urso I thought of a problem with doing the JSON decoding as a stream, rather than assuming one-per line. If there's any invalid JSON in any of the messages, the whole file is lost. One JSON per line seems more robust to me in general.

I wonder if there are any systems that output JSON, but not in line-by-line fashion.

tsg · 2016-03-17T10:43:16Z

Closing in favor of #1143.

tsg added the Filebeat Filebeat label Feb 29, 2016

andrewkroh reviewed Feb 29, 2016
View reviewed changes

andrewkroh changed the title ~~Proposal: JSON support in Filbeat~~ Proposal: JSON support in Filebeat Mar 1, 2016

ruflin reviewed Mar 1, 2016
View reviewed changes

Tudor Golubenco added 2 commits March 10, 2016 11:56

Basic implementation. No tests yet

56ae6c2

Added a json test file

cfe2713

tsg force-pushed the json_support branch from 63c9940 to cfe2713 Compare March 10, 2016 10:56

tsg mentioned this pull request Mar 12, 2016

Second proposal for JSON support #1143

Merged

tsg closed this Mar 17, 2016

devinrsmith mentioned this pull request Mar 22, 2016

Multiline JSON filebeat support #1208

Closed

tsg deleted the json_support branch August 25, 2016 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: JSON support in Filebeat #1069

Proposal: JSON support in Filebeat #1069

tsg commented Feb 29, 2016

andrewkroh Feb 29, 2016

andrewkroh Feb 29, 2016

andrewkroh commented Feb 29, 2016

ruflin commented Mar 1, 2016

ruflin Mar 1, 2016

tsg Mar 1, 2016

ruflin Mar 1, 2016

andrewkroh commented Mar 2, 2016

urso commented Mar 2, 2016

ruflin commented Mar 2, 2016

urso commented Mar 2, 2016

ruflin commented Mar 2, 2016

urso commented Mar 2, 2016

ruflin commented Mar 2, 2016

urso commented Mar 2, 2016

tsg commented Mar 2, 2016

urso commented Mar 2, 2016

tsg commented Mar 3, 2016

tsg commented Mar 17, 2016

Proposal: JSON support in Filebeat #1069

Proposal: JSON support in Filebeat #1069

Conversation

tsg commented Feb 29, 2016

Design choices

Order of operations

Config

andrewkroh Feb 29, 2016

Choose a reason for hiding this comment

andrewkroh Feb 29, 2016

Choose a reason for hiding this comment

andrewkroh commented Feb 29, 2016

ruflin commented Mar 1, 2016

ruflin Mar 1, 2016

Choose a reason for hiding this comment

tsg Mar 1, 2016

Choose a reason for hiding this comment

ruflin Mar 1, 2016

Choose a reason for hiding this comment

andrewkroh commented Mar 2, 2016

urso commented Mar 2, 2016

ruflin commented Mar 2, 2016

urso commented Mar 2, 2016

ruflin commented Mar 2, 2016

urso commented Mar 2, 2016

ruflin commented Mar 2, 2016

urso commented Mar 2, 2016

tsg commented Mar 2, 2016

urso commented Mar 2, 2016

tsg commented Mar 3, 2016

tsg commented Mar 17, 2016