Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: JSON support in Filebeat #1069

Closed
wants to merge 2 commits into from
Closed

Conversation

tsg
Copy link
Contributor

@tsg tsg commented Feb 29, 2016

Played with this while in the air. Basic support for for JSON in Filebeat seems pretty easy to add, although I didn't yet add any tests as I first want to check some design choices with the team.

Design choices

  • Keys are copied top level into the document, to make it simple for the most common use cases. There is some configuration flexibility in terms of what to do in case of fields conflicts. See the Config section.
  • The decoding happens after line filtering and before enriching the event with metadata. See the Order of operations section.

Order of operations

With this PR, Filebeat will apply transformations in the following on each log line. The JSON decoding is at step 4:

  1. Encoding decode
  2. Multiline
  3. Line Filtering
  4. JSON decoding
  5. Add custom and metadata fields
  6. Generic filtering

Config

This block can be placed per prospector:

      json_decoder:
        on_unmarshal_error: ignore | add_error_key
        overwrite_keys: true | false
        keep_original: false

The JSON decoder is enabled if the section is un-commented. In case of unmarshaling errors from JSON, you can choose between ignoring the error, which means that the event will still be published with the original message key, and adding a dedicated json_error key with information about the error. Default is ignore.

The overwrite_keys settings decides what to do if a top-level key from the decoded JSON already exists in the dictionary (things like @timestamp, message, offset, etc.). Default is false.

If set to true, the keep_original setting allows removing the message key in case of successful unmarshaling. In case of errors, the original message key is always kept.

Pinging @ruflin @urso @andrewkroh @monicasarbu

@tsg tsg added the Filebeat Filebeat label Feb 29, 2016
@@ -86,6 +87,12 @@ type MultilineConfig struct {
Timeout string `yaml:"timeout"`
}

type JsonDecoderConfig struct {
OnUnmarshalError string `yaml:"on_unmarshal_error"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this field validated at startup to guarantee that it is ignore, add_error_key, or empty. This way the user gets an error instead of it just defaulting to ignore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe even take the value and run strings.ToLower() on it so that it's case insensitive.

@andrewkroh
Copy link
Member

LGTM

@andrewkroh andrewkroh changed the title Proposal: JSON support in Filbeat Proposal: JSON support in Filebeat Mar 1, 2016
@ruflin
Copy link
Contributor

ruflin commented Mar 1, 2016

Two notes from my side:

Instead of putting the keys at the top level, I would put in under a namespace, something like "message" which then contains the json document. In case of application logging, not all keys are known in advance, means if every engineer can add its own fields to a log message, it can happen that things are either overwritten or lost which was not expected in advance. I would still offer the choice of using json_under_root (or similar) to put the fields on the top level. By default I would disable it.

The second thing I would allow is to set a mapping for the timestamp. I think the only thing which is common across all log messages is that they contain a timestamp. But not always will the timestamp be called timestamp. So if someone wants to overwrite the timestamp created by filebeat, the following config can be chosen for the example log in this PR:

json_decoder:
  timestamp_field: "timestamp"

The question here is if the @timestamp field will just be overwritten or if filebeat should try to convert it to a default format (I would suggest not to, as this only leads to problems).

if err != nil {
logp.Err("Error decoding JSON: %v", err)
if f.jsonDecoder.OnUnmarshalError == "add_error_key" {
event["json_error"] = fmt.Sprintf("Error decoding JSON: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An other option could be to introduce a general "error" field under beat.error that we can also use in the future for other errors. This would make it possible to then search for all log events with an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question is, what if there are more errors on the same event Do we keep only the latest one? Do we make it an array?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there are cases with multiple errors as I assume every time we hit an error we would stop executing the next steps and send the event with the error. For example if there is a multiline error we should not continue with JSON decoding.

@andrewkroh
Copy link
Member

I was thinking about the order of operations and how it interacts with filtering (and possible other future filters). If the configuration were a bit more flexible then you could specify the order of operations in the config. You could also apply the same filter more than once if needed. Essentially you could compose a filter chain and apply it to published events. Consider configuration like this:

# NOT REAL CONFIG, DO NOT COPY
filebeat:
  prospectors:
    - paths: ['log.json']
      filters:
        message_json:
          type: json
          field: message
        drop_debug_events:
          type: drop_event
          condition: level == DEBUG
        request_json_decode:
          # request was a string in the message that also needs decoded (a contrived example)
          type: json
          field: request

The filters and the code to build the filter chain from config could live outside of Filebeat so that all beats could take advantage of it as needed.

@urso
Copy link

urso commented Mar 2, 2016

Order of operation I was thinking about:

  1. File encoding decode (json should be utf-8, utf-16 or utf-32)
  2. Parse json (do not depend on any line vs. multiline for parsing)
  3. multiline on configurable field (e.g. for docker support)
  4. Add custom and metadata fields
  5. Generic filtering

@ruflin
Copy link
Contributor

ruflin commented Mar 2, 2016

@urso In case the json document is spread across multiple lines, wouldn't it be necessary to do multiline before json parsing?

@urso
Copy link

urso commented Mar 2, 2016

For me it's a new prospector/harvester type, parsing content as it arrives. Not being line based, but json-document based. No line limits, e.g. have support for multiple events in one line or event split among multiple lines, or have next event start on same line as last one ended in multiline scenario.

@ruflin
Copy link
Contributor

ruflin commented Mar 2, 2016

I see, so start and end of an event would always be defined by { ... } ?

@urso
Copy link

urso commented Mar 2, 2016

@ruflin yes

@ruflin
Copy link
Contributor

ruflin commented Mar 2, 2016

What about going with the simple implementation first which only supports clean JSON in one line (this PR) and the go into more complex and powerful solutions in a second step?

@urso
Copy link

urso commented Mar 2, 2016

I'd like to be a little conservative here and not go with this PR. I'd rather like to see another input_type for json based logs instead of adding to current harvester type.

On the other hand I'm fine with simple solution first, but another prospector type to not generate too much technical debt.

@tsg
Copy link
Contributor Author

tsg commented Mar 2, 2016

I kind of think we'll need both.

It's because for Docker support we'll need to do the JSON decoding twice, because the app inside the container might also use structured logging. I guess this is also why @andrewkroh proposed the fully configurable option, but I worry about the complexity there.

So the order of operations for structured logs from a docker app would be:

  1. Encoding
  2. JSON decoding (outer layer, added by Docker)
  3. Multiline
  4. Line Filtering
  5. JSON decoding (inner layer, added by the app)
  6. Add custom and metadata fields
  7. Generic filtering

Like @urso, I see step 2 as a new harvester type, but step 5 can be a simple addition to the current one. This PR is essentially adding step 5.

I see a challenge that we'll have when adding a new harvester type, that it will need to duplicate a lot of functionality from the current one. I can see the argument of not adding more stuff to the current harvester until we have figure this out. One the other hand, the change is small enough that I don't worry that much.

@urso
Copy link

urso commented Mar 2, 2016

In docker case I'd even do:

  1. Encoding
  2. JSON decoding (outer layer, added by Docker)
  3. JSON decoding (inner layer, added by the app)
  4. Multiline
  5. Line Filtering
  6. Add custom and metadata fields
  7. Generic filtering

Maybe multiline and line filtering is not required at all file json logger.

Difference of line/multiline is, line is our event. But for json, the json object itself is our event.

@tsg
Copy link
Contributor Author

tsg commented Mar 3, 2016

@urso I thought of a problem with doing the JSON decoding as a stream, rather than assuming one-per line. If there's any invalid JSON in any of the messages, the whole file is lost. One JSON per line seems more robust to me in general.

I wonder if there are any systems that output JSON, but not in line-by-line fashion.

@tsg
Copy link
Contributor Author

tsg commented Mar 17, 2016

Closing in favor of #1143.

@tsg tsg closed this Mar 17, 2016
@tsg tsg deleted the json_support branch August 25, 2016 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filebeat Filebeat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants