Prometheus output do not remove metric that are not published anymore #1334

rvrignaud · 2016-06-06T09:31:06Z

Bug report

System info:

Telegraf 0.13.1
Docker container : Ubuntu 16.04
Rabbitmq : 3.5.7

Steps to reproduce:

Configure rabbitmq input on telegraf
Configure prometheus output on telegraf
Create an ephemeral queue on rabbitmq
Publish a message
Wait that telegraf get metrics
Delete ephemeral queue
Wait a bit
Go to telegraf's prometheus output
See that rabbitmq_queue_messages / rabbitmq_queue_messages_ready metric is not 0

Expected behavior:

As the queue no longer exists, telegraf should output 0 as message number for rabbitmq_queue_messages / rabbitmq_queue_messages_ready

Actual behavior:

I still see non zero metric for rabbitmq_queue_messages / rabbitmq_queue_messages_ready metrics of non existing queue.

Additional info:

With influxdb output, telegraf seems to stop sending the metric with the corresponding tag.

The problem of prometheus output not being refreshed / showing wrong metrics may be present for other input too.

Use case: [Why is this important (helps with prioritizing requests)]

This is important because we are sending metrics that are wrong.

sparrc · 2016-06-06T09:53:28Z

sounds like the issue could generally be that the prometheus output is not getting refreshed properly?

rvrignaud · 2016-06-08T09:15:21Z

@sparrc yes it seems so, but I need to find a way / time to prove that

rvrignaud · 2016-06-08T09:33:09Z

@sparrc ok so I just confirmed the bug. It seems not related at all to rabbitmq input.

Here is the steps to reproduce:

telegraf configuration:

[agent]
 interval               = "15s"
 debug                  = true
 flush_buffer_when_full = true
 metric_buffer_limit    = 10000

[[outputs.prometheus_client]]
  listen = ":9126"

[[inputs.exec]]
    command = "bash test.sh"
    data_format="influx"

test.sh

#!/bin/bash

echo "metric,foo=bar value=42"
echo "metric,foo=foo value=43"

Start telegraf.
Go to localhost:9126. You should see the 2 metrics.
Edit test.sh: comment first metric line and change value of second line.
Go to localhost:9126. The second metric value has been refreshed but the first metric is still present.

This mean that when a metric is not published by an input anymore, it's not removed from prometheus output.

lswith · 2016-06-09T07:56:30Z

I can confirm this happening.

sparrc · 2016-06-10T16:08:54Z

I'm actually not sure that this is possible with prometheus. Prometheus basically comes with the assumption that once a metric has been reported, it must be reported at every interval. If you want to "unregister" a metric, it can only be done on a completely permanent basis.

This means that, in the above example, if we were to "unregister" metric,foo=bar value=42, then we can never have a metric with name/tag combo metric,foo=bar ever again.

Telegraf (and influxdb), on the other hand, is quite a bit less stateful, and doesn't particularly care if a metric is reported at one timestamp and then not in another.

It's not entirely clear to me what the best solution is here. I don't think it's a good solution for ephemeral metrics to stick around forever reporting "0" values from all plugins, so I'm not sure there is a generic solution that will work.

For an overview of the register/unregister model of Prometheus you can view their Golang client instrumentation example code here: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register

lswith · 2016-06-14T00:07:43Z

I think more fine-grained control on what is being published in the prometheus Output may be the solution here.

Couple of examples:

being able to rename the metric going into the prometheus output (similar to the statsd template stuff)
being able to add/divide/subtract/multiply/count and combine metrics going into the prometheus output

With this functionality, I could simply make sure that only the stat "metric" was set to 42 and 43 and that a counter was incremented whenever foo=bar.

Also, I did not know that a metric must be reported at every interval. This would be useful to have on the README.md for the Prometheus Output.

rvrignaud · 2016-06-14T12:12:50Z

@sparrc it looks like prometheus developers do not agree with the fact that "prometheus basically comes with the assumption that once a metric has been reported, it must be reported at every interval." : https://groups.google.com/forum/#!topic/prometheus-developers/iP2k68eUVrM

sparrc · 2016-06-14T12:37:40Z

It's not that it has to be reported, it's more that if it's removed it can't be re-reported again. AFAICT this is exactly the situation that they state as impossible in this example: https://godoc.org/github.com/prometheus/client_golang/prometheus#example-Register

specifically, this part of the code:

// Try registering taskCounterVec again.
if err := prometheus.Register(taskCounterVec); err != nil {
    fmt.Println("taskCounterVec not registered:", err)
} else {
    fmt.Println("taskCounterVec registered.")
}
// Bummer! Still doesn't work.

// Prometheus will not allow you to ever export metrics with
// inconsistent help strings or label names. After unregistering, the
// unregistered metrics will cease to show up in the /metrics HTTP
// response, but the registry still remembers that those metrics had
// been exported before. For this example, we will now choose a
// different name. (In a real program, you would obviously not export
// the obsolete metric in the first place.)

sparrc · 2016-06-14T12:38:55Z

from what I read there, I understand that we would have to track all metrics ever reported and unregistered through the prometheus endpoint, and then artificially rename metrics that might have "returned" before we can re-register them.

sparrc · 2016-06-14T14:51:50Z

I think there is in fact a way to do this. From that mailing list thread it looks like telegraf needs to become a prometheus "collector" rather than doing "direct instrumentation".

I'm still not 100% sure I understand the distinction, but implementing the prometheus.Collector interface will allow Telegraf to report metrics that it collected on an interval in an on-demand manner.

This being said, it also would be going against one of the tenants of prometheus, which is that metrics are supposed to be collected at the time the http request is made. Telegraf will still be collecting on the interval and the cached values will get presented when the http request is made.

closes #1334

closes influxdata#1334

ghost · 2018-02-13T09:20:00Z

What is the solution to this?
I am also experiencing the same in Prometheus. It is showing the metric that is not existing or getting published.

danielnelson · 2018-02-16T04:48:23Z

@sidsingla If this is still an issue in version 1.5 can you open a new issue?

sparrc added the bug unexpected problem or unintended behavior label Jun 6, 2016

rvrignaud changed the title ~~Prometheus output still show non existing rabbitmq queues~~ Prometheus output do not remove metric that are not published anymore Jun 8, 2016

sparrc added a commit that referenced this issue Jul 10, 2016

prometheus_client, implement Collector interface

498c39f

closes #1334

sparrc mentioned this issue Jul 10, 2016

prometheus_client, implement Collector interface #1476

Merged

1 task

sparrc added a commit that referenced this issue Jul 13, 2016

prometheus_client, implement Collector interface

6efe91e

closes #1334

sparrc closed this as completed in #1476 Jul 13, 2016

ashishrathore1 pushed a commit to ashishrathore1/telegraf that referenced this issue Jul 15, 2016

prometheus_client, implement Collector interface

410b5ed

closes influxdata#1334

sparrc mentioned this issue Sep 16, 2016

prometheus output can return partial results if a collect request is made in-between writing multiple batches #1775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus output do not remove metric that are not published anymore #1334

Prometheus output do not remove metric that are not published anymore #1334

rvrignaud commented Jun 6, 2016

sparrc commented Jun 6, 2016

rvrignaud commented Jun 8, 2016

rvrignaud commented Jun 8, 2016

lswith commented Jun 9, 2016

sparrc commented Jun 10, 2016

lswith commented Jun 14, 2016

rvrignaud commented Jun 14, 2016

sparrc commented Jun 14, 2016

sparrc commented Jun 14, 2016

sparrc commented Jun 14, 2016 •

edited

Loading

ghost commented Feb 13, 2018

danielnelson commented Feb 16, 2018

Prometheus output do not remove metric that are not published anymore #1334

Prometheus output do not remove metric that are not published anymore #1334

Comments

rvrignaud commented Jun 6, 2016

Bug report

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

Use case: [Why is this important (helps with prioritizing requests)]

sparrc commented Jun 6, 2016

rvrignaud commented Jun 8, 2016

rvrignaud commented Jun 8, 2016

lswith commented Jun 9, 2016

sparrc commented Jun 10, 2016

lswith commented Jun 14, 2016

rvrignaud commented Jun 14, 2016

sparrc commented Jun 14, 2016

sparrc commented Jun 14, 2016

sparrc commented Jun 14, 2016 • edited Loading

ghost commented Feb 13, 2018

danielnelson commented Feb 16, 2018

sparrc commented Jun 14, 2016 •

edited

Loading