Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp inconsistencies when wrong HTTP response received #5189

Closed
lmangani opened this issue Dec 25, 2018 · 11 comments
Closed

Timestamp inconsistencies when wrong HTTP response received #5189

lmangani opened this issue Dec 25, 2018 · 11 comments
Milestone

Comments

@lmangani
Copy link

This is a low impact, potentially no impact issue. Not sure if this was already documented or observed, if so, please feel free to dismiss and/or close without mercy :)

Our team noticed unexpected timestamping behavior from all recent releases of Telegraf triggered by a wrong HTTP response (200 instead of 204) is being received by output plugins for a prolonged amount of time causing timestamps to fold. This has been replicated easily with Influxdb output plugin using influx line protocol.

Relevant telegraf.conf:

  • default influxdb output and agent metrics

Steps to reproduce:

  • Use telegraf to send line protocol through an HTTP proxy returning status 200 instead of 204

Expected behavior:

  • Consistent time stamping on all new line emissions

Actual behavior:

  • Timestamp lacks behind and eventually stops incrementing after about 1h of execution

Thanks & Merry Holidays!

@glinton
Copy link
Contributor

glinton commented Dec 26, 2018

Thanks for the report, we'll take a look.

@glinton glinton added this to the 1.10.0 milestone Dec 26, 2018
@glinton
Copy link
Contributor

glinton commented Dec 26, 2018

What telegraf version were you using, and could you also provide more direct steps to reproduce to speed up the investigation?

@lmangani
Copy link
Author

lmangani commented Dec 26, 2018

Thanks Greg,

Hopefully its not a waste of your time as this is not impacting any official implementation, only proxies generating their own response - I've tried 1.8.x and 1.9.x with apparently identical results. I suppose the easiest way to replicate would be spinning up a dummy endpoint in express or fastify and pointing Telegraf InfluxDB output at it, ie:

app.post('/write', function(req, res) {
  res.sendStatus(200); // should be 204
});

Telegraf will complain about not being able to reach any of the Influxdb endpoints despite receiving the 200 OK and within 30 some minutes the emitted timestamps should start shifting off...

@danielnelson
Copy link
Contributor

Just so we have rational for the potential change, can you explain briefly what proxy you are using and why it can't be made to relay the actual 204 response code?

@lmangani
Copy link
Author

@danielnelson this was discovered during very innocent testing - We already patched our little proxy reponses to 204 and the issues has completely disappeared, so there's no request or reason to report other than we thought it was weird and concerning for a wrong response to impact emitted timestamps and the potential snowball effects - assuming this can be replicated by others.

@danielnelson
Copy link
Contributor

What version of Telegraf are you using?

@lmangani
Copy link
Author

The very latest 1.9.1 but I've tried previous releases without noticeable differences

@danielnelson
Copy link
Contributor

I'd like to make sure that Telegraf is degrading properly, even though Telegraf thinks the metrics haven't been written successfully they actually have, so I would expect most metrics to be written once.

Timestamp lacks behind and eventually stops incrementing after about 1h of execution

Does this mean the metrics completely stop being sent after about an hour?

@lmangani
Copy link
Author

@danielnelson in our test scenario metrics keep on coming and get relayed through the input plugins within the expected bucket duration, but tagged with progressively delayed timestamps

@danielnelson
Copy link
Contributor

Okay, I see what you are experiencing, this is definitely a bug but it is separate from accepting 200 status codes so I opened a new issue: #5194.

I think for now we won't change the output to support 200, since there is a minor risk that it could treat the wrong response as a success. If we run into a popular proxy that only will return 200s then we can reevaluate.

@lmangani
Copy link
Author

lmangani commented Dec 26, 2018

@danielnelson thanks! Reading the other issue helped me understand things better, the intent here was not to support the 200 response but rather figure the underlying bug, so very much thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants