Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple inputs per dataset #346

Merged
merged 7 commits into from
Apr 22, 2020

Conversation

ruflin
Copy link
Contributor

@ruflin ruflin commented Apr 15, 2020

A dataset can support more then one input. For example an nginx access log can come from file or syslog. To support this, each stream definition inside a dataset must be able to reference its own config template file. Because of this, a config option file was added to the dataset manifest. It only has to be specified if there is more then one input, otherwise it falls back to stream.yml as the default.

I don't expect this to have any effects on the data source creation at the moment as this is only additional information and we don't support multiple inputs in the Ingest Manager yet.

@ruflin ruflin requested review from nchaulet and mtojek April 15, 2020 06:42
@ruflin ruflin self-assigned this Apr 15, 2020
util/dataset.go Outdated
@@ -44,6 +44,7 @@ type Stream struct {
Input string `config:"input" json:"input" validate:"required"`
Vars []Variable `config:"vars" json:"vars,omitempty" yaml:"vars,omitempty"`
Dataset string `config:"dataset" json:"dataset,omitempty" yaml:"dataset,omitempty"`
File string `config:"file" json:"file,omitempty" yaml:"file,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if File is the best name here? Also was undecided if I should add the .yml to it or not. Feedback welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does File contain the same content as Template, but inside the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, file is basically the path (should we call it path?) and template is the content of the file.

Copy link
Contributor

@mtojek mtojek Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there is a chance that the same Stream has defined both, File and Template?

If not, then I suggest to consider a following structure:

...
template: {
    type: "text"
    source: "foo: bar"
}
...
...
template: {
    type: "file"
    source: "/foo/bar/baz.yml"
}
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user must define the file path here as File, the Template part will then automatically be read in through the code: #335 So yes, both exist.

@mtojek
Copy link
Contributor

mtojek commented Apr 15, 2020

I might have missed something, but does it mean that all conditions like the following are not supported?

{{#if input == "httpjson"}}
input: httpjson
api_key: {{this.api_key}}
authentication_scheme: {{this.authentication_scheme}}
http_client_timeout: {{this.http_client_timeout}}
http_method: {{this.http_method}}
http_headers: {{this.http_headers}}
http_request_body: {{this.http_request_body}}
no_http_body: {{this.no_http_body}}
interval: {{this.interval}}
json_objects_array: {{this.json_objects_array}}
pagination: {{this.pagination}}
rate_limit: {{this.rate_limit}}
url: {{this.url}}
ssl: {{this.ssl}}
{{else if input == "file"}}
input: log
paths:
{{#each paths}}
  - {{this}}
{{/each}}
exclude_files: [".gz$"]
{{/if}}

If so, then I need to adjust the import-beats script.

@ruflin
Copy link
Contributor Author

ruflin commented Apr 15, 2020

@mtojek Yes, it would change the above. But still debating if it is the right approach to go. Initially I wanted to have both in one file as I was thinking it would simplify building the stream template and we already have it this way. The change was initially triggered by {{#if input == "httpjson"}} not being valid handlebar but it also made me think if having both together is really the best approach or was only built because it is the way it works today in Beats. My conclusion was we better separate them as it makes it much easier to just thinking about 1 stream and see directly all configs. Also it should simplify the generation of the config in the ingest manager.

@@ -265,14 +275,25 @@
}
],
"dataset": "datasources.examplelog2",
"template": "foo: bar\n",
"template.path": "stream.yml",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtojek Got stock on this PR as I realised the . in the output here will probably cause some troubles in Kibana to read it :-( I still like the naming but need to find a way to make sure we expand it here but still have a short version in the yaml for devs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving this as is if it brings more problems that helps ;)

@mtojek
Copy link
Contributor

mtojek commented Apr 15, 2020

BTW I see that the CI is failing, you can convert this PR to draft to let us know when it's ready for review.

@exekias
Copy link

exekias commented Apr 16, 2020

This approach could potentially explode in having all the inputs available per stream isn't it? for example, I could also be consuming nginx logs from Amazon S3.

We had this discussion about having a routing pipeline matching some field (for instance it could be using stream.*). Would that be another option?

@ruflin
Copy link
Contributor Author

ruflin commented Apr 17, 2020

@exekias Yes, there are many options how to consume the same data. My assumption is that each config will look quite different for the different inputs.

For the routing pipeline: I think there are 2 parts here. One is same log data coming from different inputs and having slightly different content. Because of this is needs different processing. In this case I expect a dataset to have multiple ingest pipelines with a routing pipeline that decides on some fields / data on how to process it. This is supported already today for example with json and non json logs.

The other part is around having lots of data coming from a single input (s3 as an example) and containing data for nginx access and error logs in one stream. The above will not solve this. Is that what you are referring to @exekias ?

@exekias
Copy link

exekias commented Apr 17, 2020

Yes, I'm referring to this later case. Just wondering if we should go deeper in this configuration of input and integration together, or work on disconnect this concerns, where the input should make sure that the data reaches ingestion (with the proper fields), and then it gets routed to the integration pipelines

@ruflin
Copy link
Contributor Author

ruflin commented Apr 17, 2020

My current thinking: For the specific service packages like nginx we are in the first case. But there will probably be more generic packages like S3 that basically configure the input and there we potentially not some routing.

Had a quick chat about this @exekias and we need to do some more brainstorming here and then get back to this issue with an update.

ruflin added 7 commits April 22, 2020 14:59
A dataset can support more then one input. For example an nginx access log can come from file or syslog. To support this, each stream definition inside a dataset must be able to reference its own config template file. Because of this, a config option `file` was added to the dataset manifest. It only has to be specified if there is more then one input, otherwise it falls back to `stream.yml` as the default.

I don't expect this to have any effects on the data source creation at the moment as this is only additional information and we don't support multiple inputs in the Ingest Manager yet.
@ruflin ruflin force-pushed the multi-stream-support branch from d4decfc to 4e3bace Compare April 22, 2020 13:03
@ruflin
Copy link
Contributor Author

ruflin commented Apr 22, 2020

@mtojek I slightly modified this PR to for now stick to template for the content and template_path for the path. The reason is I do not want to deal with the . issue directly in this PR and also not break the existing Kibana implementation. But before we release this we need to revisit it and fix this together with Kibana. The internal Golang variables should already have the names that we discussed. Let me know what you think and if it is ok for your to move forward with this.

@mtojek
Copy link
Contributor

mtojek commented Apr 22, 2020

@mtojek I slightly modified this PR to for now stick to template for the content and template_path for the path. The reason is I do not want to deal with the . issue directly in this PR and also not break the existing Kibana implementation. But before we release this we need to revisit it and fix this together with Kibana. The internal Golang variables should already have the names that we discussed. Let me know what you think and if it is ok for your to move forward with this.

Sure, let's move on. Thank you for heads up. I assume that the next step would be adjusting the import-beats script to support multiple stream templates?

@ruflin
Copy link
Contributor Author

ruflin commented Apr 22, 2020

I wonder if it is worth the effort and complexity to support this in the import script. In how many cases do we have this? 2-3? I'll leave it up to you to see if the effort is worth it.

I will also follow up with either an issue or PR to make sure we don't forget about renaming the config options later.

Could you give "approve" PR if ok to merge?

@ruflin ruflin merged commit a79edc3 into elastic:master Apr 22, 2020
@ruflin ruflin deleted the multi-stream-support branch April 22, 2020 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants