Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested fields and data types #1406

Closed
WillDrug opened this issue Dec 20, 2019 · 2 comments · Fixed by #2095
Closed

Nested fields and data types #1406

WillDrug opened this issue Dec 20, 2019 · 2 comments · Fixed by #2095
Labels
transform: lua Anything `lua` transform related

Comments

@WillDrug
Copy link

There seems to be a problem with data types and nested fields.
For a log event with this structure:

{
  "str_data_a": "this is a string",
  "container_a": {
    "str_data_b": "this is another string",
    "int_data_a": 5
  },
  "int_data_b": 2
}

1. Lua transform
Containers do not evaluate. events["container_a"] - nil, while events["container_a.str_data_b"] - str
Each assingment drops type. So after events["container_a.int_data_a"] = events["int_data_b"] - now container_a.int_data_a = "2"

2. Coercer
int_data_b = "int" - works
container_a.int_data_a - expected str, got map, while "container_a.int_data_a" = "int" works, which is a bit unweildy


I would suggest digging into lua (if it's not already done with an upcoming update) , changing userdata into a table and somehow forcing data types.

As it stands, elasticsearch sink does not have a template mapping, so each transformation requires at least one coercer.

@ghost ghost added transform: lua Anything `lua` transform related transform: coercer labels Dec 20, 2019
@ghost ghost added this to the Improve data processing milestone Dec 20, 2019
@ghost
Copy link

ghost commented Dec 20, 2019

Thank you for creating this issue! We are in the process of improving processing of the nested events, see issue #704 for details.

In the meantime, all nested fields are, as you noticed, flattened as container_a.str_data_b. So, before the actual nesting is implemented in the lua transform, it is possible to interact with them from Lua using the following workarounds:

  • Use the pairs method added in version 0.6.0 to access the list of all nested field names in the form a.b.c;
  • Then use Lua string manipulations functions to detect or change nested fields, encoding newly created fields using the same dot notation.

For example, adding new field named x.y.z with value 'a' in Lua transform would result in writing object {"x": {"y": {"z": "a"}}} to sinks in case if json encoding is used.

For coercer transform it is the same, so "container_a.int_data_a" = "int" is the current way to interact with the nested fields from it, but this should change with the upcoming improvements of handling of the nested fields.

As it stands, elasticsearch sink does not have a template mapping, so each transformation requires at least one coercer.

This is interesting. Do you have any specific idea about improving the elasticsearch sink to reduce number of usages of coercer transform?

@WillDrug
Copy link
Author

WillDrug commented Dec 20, 2019

@a-rodin
Nested fields is a bugbear but it can be avoided. I'm more afraid of arrays tbh, didn't check them yet. For lua, the userdata field is "container_a.array_data_a[0]" as a string key. This can't be coereced back if it's broken, since there is no data type like that.

For ES, I was thinking specifically about ad-hoc typing, which ES supports. You can't use the API to force the already set data type, but you can actually update the template with every event (you can do this natively through logstash).

So, let's say I've forgotten to coerce some data. The string representation is still json and if my index has got a template it will do so. But on a first message, it will judge the field to be "string" and then API cannot change it back. So, I go into elasticsearch sink configurations and add something like
[sinks.elastic_sink.template] and add my_data = "int" (which in ES would be {"data": {"type": "int"}} IIRC). Now, even if I've forgotten, es will try and convert the field. Considering it was made for log data, it works with non-full templates easily.

The idea is to send template only for the fields I require as a particular type which then can be changed without full re-indexing. Note that I don't actually know how this works, I've just seen an logstash implementation :) And, obviously, since es is doing the coercing, I don't need it that much

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
transform: lua Anything `lua` transform related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant