Add support for multiple inputs per dataset #346

ruflin · 2020-04-15T06:42:07Z

A dataset can support more then one input. For example an nginx access log can come from file or syslog. To support this, each stream definition inside a dataset must be able to reference its own config template file. Because of this, a config option file was added to the dataset manifest. It only has to be specified if there is more then one input, otherwise it falls back to stream.yml as the default.

I don't expect this to have any effects on the data source creation at the moment as this is only additional information and we don't support multiple inputs in the Ingest Manager yet.

ruflin · 2020-04-15T06:42:55Z

util/dataset.go

@@ -44,6 +44,7 @@ type Stream struct {
 	Input       string     `config:"input" json:"input" validate:"required"`
 	Vars        []Variable `config:"vars" json:"vars,omitempty" yaml:"vars,omitempty"`
 	Dataset     string     `config:"dataset" json:"dataset,omitempty" yaml:"dataset,omitempty"`
+	File        string     `config:"file" json:"file,omitempty" yaml:"file,omitempty"`


I'm not sure if File is the best name here? Also was undecided if I should add the .yml to it or not. Feedback welcome.

Does File contain the same content as Template, but inside the file?

Yes, file is basically the path (should we call it path?) and template is the content of the file.

Is there is a chance that the same Stream has defined both, File and Template?

If not, then I suggest to consider a following structure:

... template: { type: "text" source: "foo: bar" } ...

... template: { type: "file" source: "/foo/bar/baz.yml" } ...

The user must define the file path here as File, the Template part will then automatically be read in through the code: #335 So yes, both exist.

mtojek · 2020-04-15T07:20:12Z

I might have missed something, but does it mean that all conditions like the following are not supported?

{{#if input == "httpjson"}}
input: httpjson
api_key: {{this.api_key}}
authentication_scheme: {{this.authentication_scheme}}
http_client_timeout: {{this.http_client_timeout}}
http_method: {{this.http_method}}
http_headers: {{this.http_headers}}
http_request_body: {{this.http_request_body}}
no_http_body: {{this.no_http_body}}
interval: {{this.interval}}
json_objects_array: {{this.json_objects_array}}
pagination: {{this.pagination}}
rate_limit: {{this.rate_limit}}
url: {{this.url}}
ssl: {{this.ssl}}
{{else if input == "file"}}
input: log
paths:
{{#each paths}}
  - {{this}}
{{/each}}
exclude_files: [".gz$"]
{{/if}}

If so, then I need to adjust the import-beats script.

ruflin · 2020-04-15T07:27:34Z

@mtojek Yes, it would change the above. But still debating if it is the right approach to go. Initially I wanted to have both in one file as I was thinking it would simplify building the stream template and we already have it this way. The change was initially triggered by {{#if input == "httpjson"}} not being valid handlebar but it also made me think if having both together is really the best approach or was only built because it is the way it works today in Beats. My conclusion was we better separate them as it makes it much easier to just thinking about 1 stream and see directly all configs. Also it should simplify the generation of the config in the ingest manager.

ruflin · 2020-04-15T08:54:36Z

testdata/public/package/datasources/1.0.0/index.json

@@ -265,14 +275,25 @@
                }
              ],
              "dataset": "datasources.examplelog2",
-              "template": "foo: bar\n",
+              "template.path": "stream.yml",


@mtojek Got stock on this PR as I realised the . in the output here will probably cause some troubles in Kibana to read it :-( I still like the naming but need to find a way to make sure we expand it here but still have a short version in the yaml for devs.

I'm fine with leaving this as is if it brings more problems that helps ;)

mtojek · 2020-04-15T09:51:43Z

BTW I see that the CI is failing, you can convert this PR to draft to let us know when it's ready for review.

exekias · 2020-04-16T09:44:35Z

This approach could potentially explode in having all the inputs available per stream isn't it? for example, I could also be consuming nginx logs from Amazon S3.

We had this discussion about having a routing pipeline matching some field (for instance it could be using stream.*). Would that be another option?

ruflin · 2020-04-17T09:16:53Z

@exekias Yes, there are many options how to consume the same data. My assumption is that each config will look quite different for the different inputs.

For the routing pipeline: I think there are 2 parts here. One is same log data coming from different inputs and having slightly different content. Because of this is needs different processing. In this case I expect a dataset to have multiple ingest pipelines with a routing pipeline that decides on some fields / data on how to process it. This is supported already today for example with json and non json logs.

The other part is around having lots of data coming from a single input (s3 as an example) and containing data for nginx access and error logs in one stream. The above will not solve this. Is that what you are referring to @exekias ?

exekias · 2020-04-17T09:25:47Z

Yes, I'm referring to this later case. Just wondering if we should go deeper in this configuration of input and integration together, or work on disconnect this concerns, where the input should make sure that the data reaches ingestion (with the proper fields), and then it gets routed to the integration pipelines

ruflin · 2020-04-17T11:36:33Z

My current thinking: For the specific service packages like nginx we are in the first case. But there will probably be more generic packages like S3 that basically configure the input and there we potentially not some routing.

Had a quick chat about this @exekias and we need to do some more brainstorming here and then get back to this issue with an update.

A dataset can support more then one input. For example an nginx access log can come from file or syslog. To support this, each stream definition inside a dataset must be able to reference its own config template file. Because of this, a config option `file` was added to the dataset manifest. It only has to be specified if there is more then one input, otherwise it falls back to `stream.yml` as the default. I don't expect this to have any effects on the data source creation at the moment as this is only additional information and we don't support multiple inputs in the Ingest Manager yet.

ruflin · 2020-04-22T13:04:51Z

@mtojek I slightly modified this PR to for now stick to template for the content and template_path for the path. The reason is I do not want to deal with the . issue directly in this PR and also not break the existing Kibana implementation. But before we release this we need to revisit it and fix this together with Kibana. The internal Golang variables should already have the names that we discussed. Let me know what you think and if it is ok for your to move forward with this.

mtojek · 2020-04-22T13:06:59Z

@mtojek I slightly modified this PR to for now stick to template for the content and template_path for the path. The reason is I do not want to deal with the . issue directly in this PR and also not break the existing Kibana implementation. But before we release this we need to revisit it and fix this together with Kibana. The internal Golang variables should already have the names that we discussed. Let me know what you think and if it is ok for your to move forward with this.

Sure, let's move on. Thank you for heads up. I assume that the next step would be adjusting the import-beats script to support multiple stream templates?

ruflin · 2020-04-22T13:20:14Z

I wonder if it is worth the effort and complexity to support this in the import script. In how many cases do we have this? 2-3? I'll leave it up to you to see if the effort is worth it.

I will also follow up with either an issue or PR to make sure we don't forget about renaming the config options later.

Could you give "approve" PR if ok to merge?

ruflin requested review from nchaulet and mtojek April 15, 2020 06:42

ruflin self-assigned this Apr 15, 2020

ruflin commented Apr 15, 2020

View reviewed changes

ruflin mentioned this pull request Apr 20, 2020

Add Netflow integration 0.0.2 #338

Merged

ruflin added 7 commits April 22, 2020 14:59

update generated files

2f5a0c2

update to use template.path and template.content

23754f8

update test files

bd1e1ab

add syslog path

f48b0b2

temporary commit

2f42b84

use template_path

4e3bace

ruflin force-pushed the multi-stream-support branch from d4decfc to 4e3bace Compare April 22, 2020 13:03

mtojek approved these changes Apr 22, 2020

View reviewed changes

ruflin merged commit a79edc3 into elastic:master Apr 22, 2020

ruflin deleted the multi-stream-support branch April 22, 2020 13:51

This was referenced Apr 23, 2020

Add Cisco integration 0.0.2 #370

Closed

Adjust config options and naming for template.* #394

Closed

mtojek mentioned this pull request Feb 16, 2021

Adding package changelogs elastic/integrations#675

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multiple inputs per dataset #346

Add support for multiple inputs per dataset #346

ruflin commented Apr 15, 2020

ruflin Apr 15, 2020

mtojek Apr 15, 2020

ruflin Apr 15, 2020

mtojek Apr 15, 2020 •

edited

Loading

ruflin Apr 15, 2020

mtojek commented Apr 15, 2020

ruflin commented Apr 15, 2020

ruflin Apr 15, 2020

mtojek Apr 15, 2020

mtojek commented Apr 15, 2020

exekias commented Apr 16, 2020

ruflin commented Apr 17, 2020

exekias commented Apr 17, 2020

ruflin commented Apr 17, 2020

ruflin commented Apr 22, 2020

mtojek commented Apr 22, 2020 •

edited

Loading

ruflin commented Apr 22, 2020

Add support for multiple inputs per dataset #346

Add support for multiple inputs per dataset #346

Conversation

ruflin commented Apr 15, 2020

ruflin Apr 15, 2020

Choose a reason for hiding this comment

mtojek Apr 15, 2020

Choose a reason for hiding this comment

ruflin Apr 15, 2020

Choose a reason for hiding this comment

mtojek Apr 15, 2020 • edited Loading

Choose a reason for hiding this comment

ruflin Apr 15, 2020

Choose a reason for hiding this comment

mtojek commented Apr 15, 2020

ruflin commented Apr 15, 2020

ruflin Apr 15, 2020

Choose a reason for hiding this comment

mtojek Apr 15, 2020

Choose a reason for hiding this comment

mtojek commented Apr 15, 2020

exekias commented Apr 16, 2020

ruflin commented Apr 17, 2020

exekias commented Apr 17, 2020

ruflin commented Apr 17, 2020

ruflin commented Apr 22, 2020

mtojek commented Apr 22, 2020 • edited Loading

ruflin commented Apr 22, 2020

mtojek Apr 15, 2020 •

edited

Loading

mtojek commented Apr 22, 2020 •

edited

Loading