-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers in restarting mode are still reported but the timestamp falls outside of retention policy... #3383
Comments
To be noted, it is actually ironic, but my stuck in restart mode container is my prometheus container that i was trying to replace... LOL |
Can you enable a file output and reproduce the error? This will let us see exactly what we are submitting to InfluxDB. |
I will put this on my todo list for this week |
We were just diagnosing this issue and tracked it down to a container that we were stopping.
|
@jspiewak Can you add the output of:
|
@danielnelson our system runs customer workloads and we create, start, stop, and then destroy containers all the time. That container is long gone. That said, we are regularly seeing the negative timestamp occur, so if there is a different thing you would like me to try, happy to give it a go. In case it is at all useful, we are running Docker 17.04 and telegraf 1.4.1 |
Do you think you can capture this from another container exhibiting the bug? |
Doubtful. The scenario we have is that our system decides the container should no longer be running, it is stopped, and then it is destroyed. My guess is that the telegraf docker plugin obtains a list of all running containers, then while it is obtaining the stats for the containers, the container in question is stopped and destroyed. When telegraf goes to obtain stats, something doesn't go quite right, and the negative timestamp is the consequence. FWIW, we likely run too many containers per host, so there will be times when 90 containers need metrics collected in a pass. |
I could create a special build of Telegraf that will dump the JSON when an error occurs, would you be able to run this until a negative timestamp is found? |
Sure thing. Would it be helpful to upgrade to 1.4.3 first? |
Here is a build of 1.4.3 with the logging patch, it will log the JSON from the endpoint if the timestamp is negative. Debian Package, let me know if you need a different package. |
We are running telegraf as a Docker container, actually |
|
FYI, looks like the image is missing the entrypoint.sh, and is much smaller than 1.4.3:
We using the Chef Docker cookbook to configure the container, and something about not having the entrypoint broke the update to use this image. Anyway, it is now deployed in production so we should have something by tomorrow. What can I grep for in the output, since we have it log the metrics as well? |
I don't see any JSON debugging, but I had the processors.printer enabled anyway:
|
By the way, -6795364578871345152 appears to be a magic number: https://groups.google.com/d/msg/influxdb/ZT1NOsKIHCs/9Ojqh083IyAJ |
Okay, it all makes sense now and explains why the logging didn't work too, this is just the zero value for a time.Time. I think the best solution is to ignore the time reported by docker and just always use the current time according to Telegraf. |
Bug report
I just started using the TICK stack in favor of our old Prometheus stack and found this. When you have a container that is stuck in restarting mode. Telegraf docker plugin generates points that fall outside my already pretty wide retention policy (180d) which probably means it is sending a timestamp that is extremely low or maybe even a 0 value.
Relevant telegraf.conf:
Nothing special, just:
System info:
Ubuntu 16.04 LTS: 4.4.0-66-generic
Docker version 1.12.6, build 78d1802
Telegraf (running from docker) using "telegraf:alpine"
Steps to reproduce:
1.1. a container is stuck in restarting mode (a custom container that fails to start because there is an error in the docker-entrypoint.sh usually does the trick)
1.2. an influxdb is running
1.3. a telegraf with docker plugin is active (no special config)
Expected behavior:
You should not be submitting points that result in an invalid timestamp
Actual behavior:
You are submitting points that result in an invalid timestamp which generates error messages
The text was updated successfully, but these errors were encountered: