Allow timeout for metrics #19

matthiasr · 2015-02-03T12:40:23Z

In some scenarios, a client will stop pushing metrics because it has gone away. Currently every node needs to be deleted explicitly or the last value will stick around forever. It would be good to be able to configure a timeout after which a metric is considered stale and removed. I think it would be best if the client could specify this.

brian-brazil · 2015-02-03T13:24:11Z

You may be interested in the textfile module of the node_exporter. It allows you to export information on the local filesystem, and as it's on the node will go away when the node does.

juliusv · 2015-02-03T14:07:53Z

@matthiasr Actually this is a great point by @brian-brazil. We should simply move chef-client exporting to the node exporter, since it can be considered a per-node metric. Then the time series will go away automatically if the host is gone.

brian-brazil · 2015-02-04T11:05:44Z

A use case for this has appeared, it may be a way to allow clients who really really want to push to do so; while offering some GC.

matthiasr · 2015-02-04T11:10:43Z

@juliusv agree, that side-steps the issue in our case. But I think it's still something needed e.g. for cron jobs – an hourly cronjob may report to pushgateway, but after >1h that metric is no longer valid.

brian-brazil · 2015-02-04T11:15:01Z

@matthiasr Hourly cronjobs is service-level monitoring of batch jobs which is the primary use case for the Pushgateway, you'd export that without any expiry, timestamps or other advanced things like that..

matthiasr · 2015-02-04T11:21:31Z

Not necessarily … I'm not necessarily monitoring the job itself, but instead e.g. some complex calculated value from a Hadoop job.

But even when monitoring, say, the runtime of my cronjob, how would I tell whether it just always takes the same time or there has never been a run again? I'd rather have no metric if it didn't run than the metric from the last time it ran. At least in some cases, which is why I think it should be optional.

juliusv · 2015-02-04T11:22:04Z

@matthiasr To expand on what Brian said, an hourly cronjob would push its last completion timestamp to the pushgateway. That way you can monitor (via time() - last_successful_run_timestamp_seconds) whether your batch job hasn't run for too long. The metric would still be ingested by Prometheus upon every scrape from the pushgateway and get a server-side current timestamp attached.

juliusv · 2015-02-04T11:22:59Z

See also http://prometheus.io/docs/practices/instrumentation/#batch-jobs

brian-brazil · 2015-02-04T11:23:25Z

Also, http://prometheus.io/docs/instrumenting/pushing/#java-batch-job-example

In some situations it is very useful if you can submit values to the pushgateway that disappear after a certain while (if they are not refreshed). The lifetime is specified by adding a "Lifetime"-Field to your HTTP-Header. The value is a string that the "ParseDuration" (golang-builtin) function accepts as valid format. Implements prometheus#19

beorn7 · 2016-06-13T12:24:12Z

After some discussions, the conclusion is that we don't want this feature for now (in the spirit of https://twitter.com/solomonstre/status/715277134978113536 ). In most cases, this feature is requested to implement anti-patterns in the monitoring set-up. There might still be a small number of legitimate use cases, but in view of the huge potential of abusing the feature, and also semantic intricacies that will be hand to get right in implementing it, we declare it a bad trade-off.

fvigotti · 2018-10-01T13:52:44Z

I would like TTL too, at the end I've created a very bare while loop in bash to accomplish that..
I use the pushgateway for some short lived scripts which I want to monitor and gather stats for..
ie:
1 ) I have a bash script that trigger backups jobs using a mixture of cron and inotify, those are short lived bash jobs attached to some kubernetes statefulsets they serve,
I export metrics to prometheus but when the statefulset gets recreated ( ie in event of eviction from a node/update/.. ) I still have all old job->instances, in my pushgateway and those are there forever, now I delete automatically after 10 minutes without that script I have a mess hard to filter in my grafana-graphs+grafana-alerts ( maybe in prometheus alerts would be easier having easily more expressive language for alerting )

anyway I don't see why you are very strong opinionated against TTL, it's a feature not hard to implement and a lot of people want it, I understand that you can say that everyone is using pushgateway in a wrong manner.. but maybe it's not true, a lot of people have different problems to solve,

now if you have a good alternative for my usecase at the end of which I have a lot of duplicated metrics like this
notifyborgbackup_throttle{instance="10.40.80.17",job="bk_grafana"} 1.538400733e+09
with expired "instance" ip address that no longer exists, but prometheus still scrape that metrics from pushgateway and create a lot of duplicated data for dead pod in my prometheus database.

another example,
2) I have some bash script that monitor latency from queries to legacy systems that are hard to monitor elseway, I don't want to open a socket in bash to let prometheus pull the data, so I use the pushgateway

let me know..
Thank you,
Francesco

p.s.
if the script could be useful to someone ( I've searched but found nothing before creating mine )
https://gist.github.com/fvigotti/cf5938d2ea037422555550e649b6a2c7

juliusv · 2018-10-01T15:41:00Z

@fvigotti Since the statefulset itself is fundamentally long-running and discoverable via Kubernetes SD (which gives you all the discovery metadata benefits), it seems like this is a similar case as using the Node Exporter's textfile module for metrics tied to a specific host (just that here it's a statefulset's pod and not a host). So I'd expect the recommended thing to do would be to have a sidecar in each pod that serves metrics (either a specialized exporter or the Node Exporter with only textfile collector module enabled) instead of pushing the metrics to a PGW and then not having pod+PGW lifecycles tied together. This will enable clean cuts of metrics as well, as even with a TTL you will either lose metrics too early or you will have a lot of overlap between dead and alive instances.

beorn7 · 2018-10-01T17:37:59Z

It makes more sense to have a discussion like this on the various Prometheus mailing lists rather than in a GitHub issue (in particular a closed one). A straight-forward feature request might still be a fit for a (new) GitHub issue, but where it is already apparent that it is more complicated than "Good idea, PRs welcome", I'd strongly recommend to start a thread on the prometheus-developers mailing list. If you are seeking advice how to implement your use case with the given tooling, the prometheus-users mailing list is a good place. On both mailing lists, more people are available to potentially chime in, and the whole community can benefit.

fvigotti · 2018-10-02T09:31:42Z

@juliusv yes the statefulset ie: mysql, jenkins, etc export their metrics using standard patterns as being long running services, but the job in preDestroy( which trigger snapshot-backup + some checks , then push the metrics about the backup/destroy process, and the statefulset which yeah is long running but it's going to close..so I can't wait the next prometheus scrape interval.. ) , or some sidecar pod with a bash script that performs backups, different/custom healthcheks those are better exported using PGW with a simple curl without having to integrate node exporter textfiles or web services in every sidecar ,

I'm saing that not because I'm looking for advices on how I setup metrics ( even if advices are always welcome :) ) , but to show you how I use the PGW and why I'm also interested in TTL, to me seems that the design that you have in mind for the PGW is a very limited use case and you don't want to extend it to not create possible anti-patterns that's also way I'm telling you my use case to let you decide if mine is an anti-pattern or not..
I use tens of software I'm not subscribed to all those mailing lists , I found a discussion about TTL and I've contributed If I find some time I'll partecipate to the mailing list or if you want you can reference this thread, I've found already my inelegant solution ( published in the gist ) but wanted to give my 2cents advice..

beorn7 · 2018-10-02T12:39:07Z

I'll lock this issue now. That's not to stifle the discussion but, on the contrary, to not let it rot in a closed issue in a repo that not every developer is tracking. Whoever is interested in convincing the Prometheus developer community to revert the decision of not implementing a TTL/timeout for pushed metrics, please open a thread on the prometheus-developers mailing list. (TODO for @beorn7 : Once such a thread exists, link it here and in the README.md.)

@fvigotti I understand that you are not keen on subscribing to a mailing list for every software you use. However, the Prometheus developers are not keen, either, to track all the (open and closed) issues of all repos in the Prometheus org (there are 38 of them!). As the Prometheus developers are doing all the work you are benefitting from, I think it is fair to ask that you play to their rules of how to tell them about your request.

beorn7 self-assigned this Feb 3, 2015

beorn7 added the enhancement label Feb 3, 2015

beorn7 added feature request and removed enhancement labels May 22, 2015

lemoer mentioned this issue Jun 1, 2016

Implement lifetime feature #78

Closed

beorn7 closed this as completed Jun 13, 2016

beorn7 added the wontfix label Jun 13, 2016

brian-brazil mentioned this issue Jul 6, 2016

metrics expiry #83

Closed

pkcakeout mentioned this issue Jan 25, 2017

** Why is there no pull request for this feature? ** pkcakeout/pushgateway#1

Open

beorn7 mentioned this issue Apr 5, 2017

Redis as a persistent storage #111

Closed

brian-brazil mentioned this issue May 9, 2017

TTL for pushed metrics? #117

Closed

beorn7 mentioned this issue Oct 2, 2018

Add documentation about TTL and GH issues #208

Merged

prometheus locked and limited conversation to collaborators Oct 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow timeout for metrics #19

Allow timeout for metrics #19

matthiasr commented Feb 3, 2015

brian-brazil commented Feb 3, 2015

juliusv commented Feb 3, 2015

brian-brazil commented Feb 4, 2015

matthiasr commented Feb 4, 2015

brian-brazil commented Feb 4, 2015

matthiasr commented Feb 4, 2015

juliusv commented Feb 4, 2015

juliusv commented Feb 4, 2015

brian-brazil commented Feb 4, 2015

beorn7 commented Jun 13, 2016

fvigotti commented Oct 1, 2018 •

edited

Loading

juliusv commented Oct 1, 2018

beorn7 commented Oct 1, 2018

fvigotti commented Oct 2, 2018

beorn7 commented Oct 2, 2018

Allow timeout for metrics #19

Allow timeout for metrics #19

Comments

matthiasr commented Feb 3, 2015

brian-brazil commented Feb 3, 2015

juliusv commented Feb 3, 2015

brian-brazil commented Feb 4, 2015

matthiasr commented Feb 4, 2015

brian-brazil commented Feb 4, 2015

matthiasr commented Feb 4, 2015

juliusv commented Feb 4, 2015

juliusv commented Feb 4, 2015

brian-brazil commented Feb 4, 2015

beorn7 commented Jun 13, 2016

fvigotti commented Oct 1, 2018 • edited Loading

juliusv commented Oct 1, 2018

beorn7 commented Oct 1, 2018

fvigotti commented Oct 2, 2018

beorn7 commented Oct 2, 2018

fvigotti commented Oct 1, 2018 •

edited

Loading