-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally use ECS conventions for dynamic mappings #85692
Comments
Pinging @elastic/es-search (Team:Search) |
This will simplify a lot the Integration development not hawing to remember and correlate which type belong to which field. |
We discussed this with the team, and later I had a chat with @ruflin about it. Will add some more info so we can discuss it further with the team. ECS defines a set of around 100 basic fields that are shared across different datasets, but the total number of fields it defines is over 1000. ECS differs from the Elasticsearch default dynamic mappings as follows:
The ip fields issue would be solved by adding auto-detection for ip fields (#64400), and the keyword/text multi-field by formalizing the default mappings for string fields (#53181), but while the existing dynamic mappings are type based (driven by the json type of each field being parsed), ECS is name based and there aren't necessarily common naming patterns between different fields of the same type. As a result, such fields need to be manually mapped. Dynamic templates can be used so that only fields that effectively appear in documents are mapped, but the dynamic templates for all those 1000+ fields take space in the mappings anyways, and for each field that effectively gets mapped, its mapping would then be effectively duplicated under the properties section. Ideally, fields that don't appear in any document would not take any space in the mappings, and fields that do would be mapped, without appearing both under dynamic templates as well as under properties. When it comes to maintaining these mappings, the 100 basic fields are well defined and never change. The remaining fields that are less commonly used may change, but mostly what happens is that additional fields are added. |
Adding some thoughts from another round of discussion with the team. We are open to adding auto-detection for ip addresses, it would work similarly to how date auto-detection works. This is not a particularly popular feature request (surprisingly?). If we introduced it we may not want to enable it by default for backwards compatibility concerns, but then few users would benefit from it if it's opt-in, which is not ideal either. It was mentioned that ip auto-detection would solve a big part of the problems that triggered the creation of this very issue, but we could not exactly understand why: ip fields need to be mapped manually, the disadvantage is that incoming documents will hold only one or two of those while ECS mappings need to define all the possible ip fields to ensure that if they get indexed they get the right type. How many are there in total, our of curiosity? Would it be possible to define them in a dynamic template? Is there some common naming convention for ip fields? It was unclear how much addressing two of the three issues above (default mapping of text fields and ip auto detection) would help. It feels like the size of these mappings is the main factor for integrating the ECS conventions into Elasticsearch over using for instance dynamic templates to define their mappings, as a lot of these fields need to be mapped one-by-one. When it comes to cluster state size there are two aspects:
To clarify, cluster state size is a concern, which both dynamic templates and mapped fields contribute to. Though the biggest issue we currently have is the memory footprint of mappings, which only mapped fields contribute to (see #86440). A reasonable approach would be to try and move as many as possible of the fields definition to the dynamic templates section, while working out a way to keep the size of dynamic templates contained. I wonder if the exercise of moving the current ECS mappings to using dynamic templates is a valid path forward regardless of the outcome of the discussion around integration ECS mappings within Elasticsearch. I think that if we decided to adopt ECS conventions in Elasticsearch dynamic mappings, we would very likely load these from a file holding a set of dynamic templates. There was also some resistance around ownership in case ECS mappings become part of Elasticsearch: who is then responsible for maintaining them, adding fields etc. ? We will continue the discussion once we receive feedback on these thoughts. |
This has also come up in the context of providing better default mappings for logs (LX). We didn't come to a definite answer whether or not we'd want all ECS fields to be mapped and indexed by default. As dynamic runtime fields also increase the cluster state (#88265), I think it would be better to disable dynamic mapping but to apply runtime field-like semantics to unmapped fields (look them up from _source) (#81357). |
I am removing the team-discuss label for now: there is ongoing work in defining the set of core fields that we would like to integrate within Elasticsearch, and once that is defined, we will look at their mappings and re-evaluate how to move forward. There is agreement on using dynamic templates to ensure that fields that never appear in documents are not mapped. It's important to reiterate what the main goal of this effort is: make it easier for users to use ECS, without them even knowing what ECS is. Relevant indices should get ECS mappings applied automatically. |
I have opened #89743 as a Draft PR to discuss potential default mappings in detail. It is not a fully implementation but the idea is to first agree on what the mapping should be we need to solve this use case and then in a second step discuss the implementation in Elasticsearch. At first it could be just a component template that can be referenced and later on a config flag to make usage much easier. Please have a look at the PR and especially the comments for each fields. There is quite a list of open questions / discussion points but it is much easier to have this directly in the PR itself. |
Is this elastic/integrations#3642 related ? |
I had some good follow up discussions with @P1llus around this topics. We can split the problem in two parts:
The way I think of it is that 1. will be a subset of 2. The following focuses mostly on 2. and belongs more into the integrations development. I'm updating the conversation here as we already touched on many of the points. The goal is to simplify building of integrations and allow everyone to use ECS. Some of the core guiding principles we discussed:
ProposalBased on the above, we came up with the following proposal. This is a high level proposal, details would have to be worked out. Installation of ECS fieldsAn integration package installed on Elasticsearch 8.1 can require ECS 8.3 This means bundling ECS component templates with Elasticsearch is not a viable solution. Instead Fleet should install a package which contains the component templates for the different ECS versions. The package contains all previous ECS versions means 8.0, 8.1 etc. of ECS are available in parallel. These component templates are also available to any user that wants to use ECS. Packages require ECS component template with versionA developer of an integration package can specify which ECS version should be used in the package. When the package is installed by Fleet, the correct ECS component template is referenced to be used by the package. Depending on how many component templates there are per ECS version, the reference might be slightly different. Dynaming mappins for ECS fieldsECS has grown over the last years a lot. The challenge is that we do not want to map all these fields by default as it would create a pretty large component template. Instead as ECS follows conventions, most fields can be mapped with just a few conventions. @P1llus worked on such a dynamic template for all the current ECS fields which can be found here. It is not split up into Core or Extended. My assumption is removing extended would shorten the file by ~1/3 but it is not clear if it is worth doing this. For testing this dynamic template, @P1llus also has created an example doc with all ECS fields. Having this dynamic templates for ECS available would mean, integrations developer can stop adding ECS fields but only reference a version. ECS fields could not be forgotten anymore by accident. If users add their own fields, these would also be correctly mapped to ECS. It goes further, if a user creates All fields are still indexedTaking the above approach means all ECS fields are indexed by default. In #89743 we are discussing to partially move away from this. But the above proposal is only for the integrations and replaces the complexity we have today around ECS fields, suddenly removing indexing would be a breaking change. The above approach also allows us to do improvements like offer |
heya @ruflin thanks a lot for the update. I looked at the linked mappings and left a couple of comments, very much inline with yours.
I am trying to figure out what this means in practice. In order to have ECS mappings integrated into Elasticsearch, wouldn't it be a requirement that they are not installed by an external component but rather managed internally by Elasticsearch? |
Not necessarily, but it should be a service that you trust and is always running besides Elasticsearch. In our case this is Kibana. The way it would work: Elasticsarch & Kibana 8.1 are running. A new version of the ECS package is published. Kibana / Fleet detects it and in the background installs the new ECS component templates which do not exist yet. |
Thanks for expanding. If ECS mappings though are ordinary component templates that are provided externally, what is the plan for a tighter integration? We were initially thinking of something like |
I think that is where the separation between 1 and 2 in #85692 (comment) is. I still would like to get to a point where users do not even have to use component templates but just have to enable it with a config setting but to me it seems going with 2 is a low hanging fruit and 1 is the long term goal. I could also see that |
Adding some comments here after speaking with @felixbarny. If the plan is to add the dynamic ECS template to the global
Only my 2 cents, there might be other views from other people around this subject as well of course. |
I discussed this with @ruflin today. He brought up the idea of creating an ECS integration package which contains the component templates for all ECS versions and it's ensured that this is always up-to-date and pre-installed. The benefit compared to bundling the different ECS versions in Elasticsearch is that this ECS integration can be installed in older ES versions. Therefore, an integration can reference an ECS template that's newer than the stack version it gets installed on. ES would then only bundle the current version of ECS, to be used for the default index templates (see also #95538). |
Yes! I would like to see this! Anything and everything (Beats, Integrations, SIEM detection indices...) should share a common set of Component Templates. This would 100% eliminate field conflicts. I thought that was the entire idea behind shared Component Templates, so I was a bit confused when I saw that Integrations are not using them. |
In #96171 we added the |
Sounds good to me, and great news! I just want to confirm that the original idea of introducing a new ECS dynamic mode is no longer a goal here, and we are good with the current approach that is based on component templates for the time being. |
With the logs-- templates in Elasticsearch we managed to roll this out to all users that are using the data stream naming scheme. It would be convenient for others to just turn it on with a flag but would not optimise for it as I rather help users to migrate to the data stream naming scheme as with it they also get all the other benefits. We should have a discussion in the context of logsdb if there is an option to have these templates in there automatically (also outside logs--). |
Ok thanks for the feedback. I am going to close this, we can always re-discuss the possibility of a boolean flag in the future, for now that's not something we are going to focus on. |
Description
The Elastic Common Schema (ECS) has naming and mapping conventions for an increasing set of fields. You might have already seen data sets that included fields called
@timestamp
,host.name
orhttp.response.status_code
, all these fields are standardized in ECS.Having normalized field names and mappings is in the best interest of users, it makes it easier to correlate events and metrics that come from different data sources. By the way, our integrations rely heavily on ECS.
One frustrating point is that even though some end users would leverage the ECS logging library for logging, which ensures that ECS field names get used (
@timestamp
for the time field,host.name
for the name of the host, etc.), Elasticsearch would not always honor the mappings that ECS suggests for these fields, because Elasticsearch simply doesn't know about ECS. The only way users can work around this problem is by creating an index template themselves, that includes ECS mappings for the fields that they are using. Note that it's not desirable to create an index template that includes dynamic template for every possible ECS field as there are now several thousands of fields that are standardized in ECS.Could we instead package ECS within Elasticsearch and introduce a new option for dynamic mappings to prefer ECS conventions for mappings when they exist? This would simplify significantly ingestion of custom sources of data that follow ECS conventions for field names, such as datasets produced by ECS logging.
For reference, it would also likely help simplify some of our integrations. Some sources of data have optional fields that might differ depending on vendors and other factors. Currently, the practice consists of creating index templates that include field mappings for every possible optional field, which results in pretty large templates such as the one for the Netflow integration, where lots of fields end up never being populated. If Elasticsearch had ECS built-in, these integrations could simply not map these optional vendor-specific fields and rely on Elasticsearch to map them automatically by following ECS conventions.
The text was updated successfully, but these errors were encountered: