Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(parsers.avro): Do not force addition of timestamp as a field #13856

Merged
merged 3 commits into from
Sep 11, 2023

Conversation

athornton
Copy link
Contributor

Required for all PRs

  • [ n/a ] Updated associated README.md.
  • [ X ] Wrote appropriate unit tests.
  • [ X ] Pull request title or commits are in conventional commit format

resolves #13821

Dropped the forced addition of the timestamp as a field, and touched up the JSON test case to note that I did want the timestamp as a field.

Note that this also includes the fix for Issue #13854 / PR #13855, because it impacts correct timestamp specification. We might want to just merge this PR instead and drop that one, although it's just going to be "skipping already-applied-commit" if we don't.

@athornton athornton changed the title Fix/timestamp handling fix: Do not force addition of timestamp as a field Sep 1, 2023
@athornton
Copy link
Contributor Author

athornton commented Sep 1, 2023

Ah, I see the problem.

The field selection is done before the parsing-as-timestamp, so if you tell the plugin you want as a timestamp, and then it's not one of your fields: could not parse \"testdata/json-array/message.json\": could not parse '<nil>' to 'unix_ms' because the field has already been dropped.

So all I need to do there is take the timestamp from the raw input data, and not from the stuff already parsed into fields.

@athornton
Copy link
Contributor Author

There's another case here, and I think what I am doing is the right thing but I do not know.

What I have is working fine for flattening or not flattening maps.

But even if you are not flattening the map (which is to say, you don't get measurements with the data type stitched to the field), I'm still flattening arrays, to {field_name}{field_sep}{index}, because I don't see a better way to do array values that will work with InfluxDB line protocol.

Example: you have data = [ 0, 1, 2, 3 ] and avro_field_separator = "_" ; this goes into your measurement as data_0=0,data_1=1,data_2=2,data_3=3

(The only other remotely plausible way seems like it would be outputting a string which is the JSON representation of the array, but that requires knowledge of how to interpret that string on the receiving end.)

And while it's true that people may not be outputting to InfluxDB, the telegraf testutils definitely expect to be able to parse metrics from a file that's in line protocol, so I think it's safe to consider that our default format.

@srebhan
Copy link
Member

srebhan commented Sep 5, 2023

@athornton before giving this a review I'm trying to understand why the behavior was what it was... Please bear with me as I'm not an Avro expert. Some questions:

  1. Is my assumption correct that you always get a timestamp in an Avro message?
  2. I read the code as, we output the timestamp as a field only if you set the timestamp option?
  3. Is there another way to extract the timestamp field from the Avro message with the existing means in the parser?

@athornton
Copy link
Contributor Author

I'm not an Avro expert either, but I can ask @afausti about, at least, how we're doing it here.

  1. I think that Avro messages do not necessarily have a timestamp. But you are sending it to Telegraf and putting it into InfluxDB line protocol, which necessarily DOES have a timestamp. So if there's not a field in that message you can identify as a timestamp, you should just use the time you handled the message as the timestamp. At least, I don't see anything in https://avro.apache.org/docs/1.11.1/specification/ that says to me that you must have a timestamp.

  2. The prior behavior was that, if we did identify a field in the incoming Avro message as a timestamp, that would always be added to our InfluxDB fields, even if we were explicitly listing the things we wanted as fields and that was not one of them. This is what the issue was about: since line protocol always has a timestamp, it's not necessary for it to be a separate field. Although in my use case we always do want it as a field, this turns out to not be generally true.

  3. Not that I can think of. Avro contains some logical types (basically type aliases) for date and time encoding (https://avro.apache.org/docs/1.11.1/specification/#timestamp-millisecond-precision) but they boil down to a long for number-of-millis-or-micros-since-the-Unix-epoch. We could attempt autodetection of the timestamp types, but that seems very fragile. I know in our own use case that by the time a message gets to telegraf it has several fields of that type, as each stage of the processing pipeline stamps its own "I handled this message at" field into it, and Rubin is probably not unique in this regard. Just saying what the timestamp field is will be simpler and less fragile, I think.

And actually I have a question about #1: The test utils all assume that you're basically parsing to line protocol. Is that a good generic assumption about telegraf? Obviously there are at least some people who do not output to InfluxDB at all, but in my mind, the point of telegraf is primarily to feed InfluxDB and other output cases are secondary--that is, I should think of it as a very flexible converter for InfluxDB measurements, and not as a generic converter like some kind of imagemagick equivalent for metric events. But I don't know if I'm actually correct about that.

@athornton athornton force-pushed the fix/timestamp-handling branch from c479139 to 48cb0e2 Compare September 6, 2023 17:35
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athornton can you please rebase this after PR #13855 is merged!? Furthermore, I do have one question regarding the supplied_timestamp_fields_unspecified test case...

@srebhan srebhan changed the title fix: Do not force addition of timestamp as a field fix(parsers.avro): Do not force addition of timestamp as a field Sep 7, 2023
@srebhan srebhan added fix pr to fix corresponding bug plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins labels Sep 7, 2023
@srebhan srebhan self-assigned this Sep 7, 2023
@athornton athornton force-pushed the fix/timestamp-handling branch 2 times, most recently from 4879d28 to 26a2140 Compare September 7, 2023 23:23
@athornton athornton force-pushed the fix/timestamp-handling branch from 26a2140 to a790583 Compare September 8, 2023 20:57
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @athornton!

@srebhan srebhan added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Sep 11, 2023
@srebhan srebhan assigned powersj and unassigned srebhan Sep 11, 2023
@powersj powersj merged commit 855b25d into influxdata:master Sep 11, 2023
@github-actions github-actions bot added this to the v1.28.0 milestone Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avro Parser put timestamp to the metric fields even if it is not specified in the field list
3 participants