-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow timeout for metrics #19
Comments
You may be interested in the textfile module of the node_exporter. It allows you to export information on the local filesystem, and as it's on the node will go away when the node does. |
@matthiasr Actually this is a great point by @brian-brazil. We should simply move chef-client exporting to the node exporter, since it can be considered a per-node metric. Then the time series will go away automatically if the host is gone. |
A use case for this has appeared, it may be a way to allow clients who really really want to push to do so; while offering some GC. |
@juliusv agree, that side-steps the issue in our case. But I think it's still something needed e.g. for cron jobs – an hourly cronjob may report to pushgateway, but after >1h that metric is no longer valid. |
@matthiasr Hourly cronjobs is service-level monitoring of batch jobs which is the primary use case for the Pushgateway, you'd export that without any expiry, timestamps or other advanced things like that.. |
Not necessarily … I'm not necessarily monitoring the job itself, but instead e.g. some complex calculated value from a Hadoop job. But even when monitoring, say, the runtime of my cronjob, how would I tell whether it just always takes the same time or there has never been a run again? I'd rather have no metric if it didn't run than the metric from the last time it ran. At least in some cases, which is why I think it should be optional. |
@matthiasr To expand on what Brian said, an hourly cronjob would push its last completion timestamp to the pushgateway. That way you can monitor (via |
In some situations it is very useful if you can submit values to the pushgateway that disappear after a certain while (if they are not refreshed). The lifetime is specified by adding a "Lifetime"-Field to your HTTP-Header. The value is a string that the "ParseDuration" (golang-builtin) function accepts as valid format. Implements prometheus#19
After some discussions, the conclusion is that we don't want this feature for now (in the spirit of https://twitter.com/solomonstre/status/715277134978113536 ). In most cases, this feature is requested to implement anti-patterns in the monitoring set-up. There might still be a small number of legitimate use cases, but in view of the huge potential of abusing the feature, and also semantic intricacies that will be hand to get right in implementing it, we declare it a bad trade-off. |
I would like TTL too, at the end I've created a very bare while loop in bash to accomplish that.. anyway I don't see why you are very strong opinionated against TTL, it's a feature not hard to implement and a lot of people want it, I understand that you can say that everyone is using pushgateway in a wrong manner.. but maybe it's not true, a lot of people have different problems to solve, now if you have a good alternative for my usecase at the end of which I have a lot of duplicated metrics like this another example, let me know.. p.s. |
@fvigotti Since the statefulset itself is fundamentally long-running and discoverable via Kubernetes SD (which gives you all the discovery metadata benefits), it seems like this is a similar case as using the Node Exporter's textfile module for metrics tied to a specific host (just that here it's a statefulset's pod and not a host). So I'd expect the recommended thing to do would be to have a sidecar in each pod that serves metrics (either a specialized exporter or the Node Exporter with only textfile collector module enabled) instead of pushing the metrics to a PGW and then not having pod+PGW lifecycles tied together. This will enable clean cuts of metrics as well, as even with a TTL you will either lose metrics too early or you will have a lot of overlap between dead and alive instances. |
It makes more sense to have a discussion like this on the various Prometheus mailing lists rather than in a GitHub issue (in particular a closed one). A straight-forward feature request might still be a fit for a (new) GitHub issue, but where it is already apparent that it is more complicated than "Good idea, PRs welcome", I'd strongly recommend to start a thread on the prometheus-developers mailing list. If you are seeking advice how to implement your use case with the given tooling, the prometheus-users mailing list is a good place. On both mailing lists, more people are available to potentially chime in, and the whole community can benefit. |
@juliusv yes the statefulset ie: mysql, jenkins, etc export their metrics using standard patterns as being long running services, but the job in preDestroy( which trigger snapshot-backup + some checks , then push the metrics about the backup/destroy process, and the statefulset which yeah is long running but it's going to close..so I can't wait the next prometheus scrape interval.. ) , or some sidecar pod with a bash script that performs backups, different/custom healthcheks those are better exported using PGW with a simple curl without having to integrate node exporter textfiles or web services in every sidecar , I'm saing that not because I'm looking for advices on how I setup metrics ( even if advices are always welcome :) ) , but to show you how I use the PGW and why I'm also interested in TTL, to me seems that the design that you have in mind for the PGW is a very limited use case and you don't want to extend it to not create possible anti-patterns that's also way I'm telling you my use case to let you decide if mine is an anti-pattern or not.. |
I'll lock this issue now. That's not to stifle the discussion but, on the contrary, to not let it rot in a closed issue in a repo that not every developer is tracking. Whoever is interested in convincing the Prometheus developer community to revert the decision of not implementing a TTL/timeout for pushed metrics, please open a thread on the prometheus-developers mailing list. (TODO for @beorn7 : Once such a thread exists, link it here and in the README.md.) @fvigotti I understand that you are not keen on subscribing to a mailing list for every software you use. However, the Prometheus developers are not keen, either, to track all the (open and closed) issues of all repos in the Prometheus org (there are 38 of them!). As the Prometheus developers are doing all the work you are benefitting from, I think it is fair to ask that you play to their rules of how to tell them about your request. |
In some scenarios, a client will stop pushing metrics because it has gone away. Currently every node needs to be deleted explicitly or the last value will stick around forever. It would be good to be able to configure a timeout after which a metric is considered stale and removed. I think it would be best if the client could specify this.
The text was updated successfully, but these errors were encountered: