-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus output do not remove metric that are not published anymore #1334
Comments
sounds like the issue could generally be that the prometheus output is not getting refreshed properly? |
@sparrc yes it seems so, but I need to find a way / time to prove that |
@sparrc ok so I just confirmed the bug. It seems not related at all to rabbitmq input. Here is the steps to reproduce: telegraf configuration:
test.sh
This mean that when a metric is not published by an input anymore, it's not removed from prometheus output. |
I can confirm this happening. |
I'm actually not sure that this is possible with prometheus. Prometheus basically comes with the assumption that once a metric has been reported, it must be reported at every interval. If you want to "unregister" a metric, it can only be done on a completely permanent basis. This means that, in the above example, if we were to "unregister" Telegraf (and influxdb), on the other hand, is quite a bit less stateful, and doesn't particularly care if a metric is reported at one timestamp and then not in another. It's not entirely clear to me what the best solution is here. I don't think it's a good solution for ephemeral metrics to stick around forever reporting "0" values from all plugins, so I'm not sure there is a generic solution that will work. For an overview of the register/unregister model of Prometheus you can view their Golang client instrumentation example code here: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register |
I think more fine-grained control on what is being published in the prometheus Output may be the solution here. Couple of examples:
With this functionality, I could simply make sure that only the stat "metric" was set to 42 and 43 and that a counter was incremented whenever foo=bar. Also, I did not know that a metric must be reported at every interval. This would be useful to have on the README.md for the Prometheus Output. |
@sparrc it looks like prometheus developers do not agree with the fact that "prometheus basically comes with the assumption that once a metric has been reported, it must be reported at every interval." : https://groups.google.com/forum/#!topic/prometheus-developers/iP2k68eUVrM |
It's not that it has to be reported, it's more that if it's removed it can't be re-reported again. AFAICT this is exactly the situation that they state as impossible in this example: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register specifically, this part of the code: // Try registering taskCounterVec again.
if err := prometheus.Register(taskCounterVec); err != nil {
fmt.Println("taskCounterVec not registered:", err)
} else {
fmt.Println("taskCounterVec registered.")
}
// Bummer! Still doesn't work.
// Prometheus will not allow you to ever export metrics with
// inconsistent help strings or label names. After unregistering, the
// unregistered metrics will cease to show up in the /metrics HTTP
// response, but the registry still remembers that those metrics had
// been exported before. For this example, we will now choose a
// different name. (In a real program, you would obviously not export
// the obsolete metric in the first place.) |
from what I read there, I understand that we would have to track all metrics ever reported and unregistered through the prometheus endpoint, and then artificially rename metrics that might have "returned" before we can re-register them. |
I think there is in fact a way to do this. From that mailing list thread it looks like telegraf needs to become a prometheus "collector" rather than doing "direct instrumentation". I'm still not 100% sure I understand the distinction, but implementing the prometheus.Collector interface will allow Telegraf to report metrics that it collected on an interval in an on-demand manner. This being said, it also would be going against one of the tenants of prometheus, which is that metrics are supposed to be collected at the time the http request is made. Telegraf will still be collecting on the interval and the cached values will get presented when the http request is made. |
What is the solution to this? |
@sidsingla If this is still an issue in version 1.5 can you open a new issue? |
Bug report
System info:
Telegraf 0.13.1
Docker container : Ubuntu 16.04
Rabbitmq : 3.5.7
Steps to reproduce:
Expected behavior:
As the queue no longer exists, telegraf should output 0 as message number for rabbitmq_queue_messages / rabbitmq_queue_messages_ready
Actual behavior:
I still see non zero metric for rabbitmq_queue_messages / rabbitmq_queue_messages_ready metrics of non existing queue.
Additional info:
With influxdb output, telegraf seems to stop sending the metric with the corresponding tag.
The problem of prometheus output not being refreshed / showing wrong metrics may be present for other input too.
Use case: [Why is this important (helps with prioritizing requests)]
This is important because we are sending metrics that are wrong.
The text was updated successfully, but these errors were encountered: