-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recorded metrics are not sent to Stackdriver #104
Comments
Thanks for the report. I filed an internal tracker issue and someone will try to help you soon. If you are willing to try something, could you see if using a metric name without any colons (:) works? It's just a hunch, but worth trying if you have the time. |
Nope, replacing colons |
I have a similar issue. I'm exporting a gauge metric and the metric itself is autocreated in Stackdriver with no problems. Labels are correct etc. But all the values are 0 in Stackdriver, even though prometheus clearly shows them varying as containers change status. I've even verified this via the api and getting the raw timeseries from stackdriver.
Other metrics that I've checked seem to work. Let me know if there's anything I can do to assist. |
^ worth noting: this is not a recorded rule, just a collected metric from kube_status. I merely did the explicit static_metadata entry to see if it changed anything. |
@oliver-neubauer-ecobee please file a new issue if you're not using recorded metrics. Also the original issue is about metrics not being sent. Metrics sent as 0 values sounds different. |
The same issue is in v0.4.1. |
Thanks for reporting, @soymo. Just want to ask a clarification question: do you see the Also if possible, can you create a new issue describing the following setup:
Thanks! |
I confirmed it's a bug still happening in version 0.4.3 and I also see error I reproduced it by setting recording rules as written in the first post:
I observed that I can see the metric I also added the static_metadata section as provided in stackdriver-prometheus-sideacr's configuration:
and I turned on the debug log and observed that series_cache couldn't find target with Tomorrow I will spend some time to troubleshoot the issue. |
I can confirm that this is still happening with version 0.5.0 of the sidecar and prometheus v2.10.0. Better yet, I believe can tell you why: The targets cache in The prometheus Labels struct exposes a
...alters enough of the cacheing behavior that targets/cache_test expects that I'm not sure how to validate that it is or could be safe. |
@StevenYCChou I've spent a bit more time trying to debug this and have not gotten very far: if you've got any suggestions I'm all ears. |
Hi @memory - sorry I didn't provide update since last comment. Last time we suspected it's the logic in targetMatch may not be correct. Based on the comment of this function, it should check match a label if I didn't commit a change I was debugging on, so I just created a branch in my personal repository to track the fix. I don't think it's ready to send a PR because I don't think I fully tested it yet. Can you try it out? If you want, you can build on top of it and send a PR for us to fix the bug you are facing. Last thing - the metric name with ":" inside is not a valid metric type (see naming rules), so you may still won't see metrics if the recorded name is Let us know what do you think, @memory. |
@StevenYCChou so, some progress:
The api documentation for metricDescriptors only notes that the |
Further testing: if I drastically shorten the name of the recorded metric and remove the colons:
...the errors go away, which is good, but now we're back to literally nothing happening on the stackdriver side. No debug output at all from the sidecar, and |
Okay, realized that the shortened names weren't being caught by my
...which means we're dropping the whole series I believe, per https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/master/retrieval/series_cache.go#L403-L407 ...which leads me to the smoking gun in https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/master/metadata/cache.go#L124-L133:
So the metadata cache will only detect a recorded rule if there is a colon in the metric name, but colons aren't allowed in stackdriver custom metric names, so we're stuck: if we don't put colons in the name, the metadata cache returns nil, but if we do put colons in the name stackdriver will never accept them. 🤦♂ My best suggestion here is to check for metric names that begin with a string like |
A fix for most of the issues noted in Stackdriver#104 - update targets.targetMatch() to match labels correctly in the case of a recorded metric (thank you @StevenYCChou) - update metadata.Cache.Get() to treat metrics with names prefixed with the value of the `--stackdriver.recorded-metric-prefix` flag (default: `recorded_`) as recorded metrics and log an error if a colon is found in the metric name (stackdriver metric names cannot have those) - update README with specific notes for forwarding recorded metrics
Hi memory@, Thanks for sharing with us all the troubleshooting information, these are really valuable, and I really appreciate your efforts on it! First, some comments on job/instance labels:
Labels job and instance are required for integration with recording rules. Here is a quote from Integration Issues [1]:
The reason is that stackdriver-prometheus-sidecar uses these two well-known labels to construct a Stackdriver MonitoredResource for your Kubernetes object (it’s also answered in Integration Issues [1]) For the PR you created, I have briefly reviewed it, and I have thought about the following 2 higher-level questions for any recording rules with colons:
Unfortunately, I don’t have the bandwidth at least before the end of September to answer the 2 questions above. I think it requires to come up with reasonable options, and analyze the pros and cons of these options. For the issue you have here, I’d like to provide a direction to fix the code and a potential workable for recording rules if you don’t mind to try it out, though I couldn’t verify it myself before end of September:
Let me know what do you think, @memory . [1] https://cloud.google.com/monitoring/kubernetes-engine/prometheus#prometheus_integration_issues |
Okay, we're in agreement that forwarding metrics that lack instance/job labels isn't currently supported and it would require a substantial rewrite of the sidecar to support. I'd gently suggest that given stackdriver's stratospheric per-metric cost, this feels like a limitation that should be addressed at some point: "just forward the raw metrics and do the aggregation on the stackdriver side" is a strategy with in many cases a large number of dollar signs attached to it. But this is obviously a larger issue than this one particular bug and isn't going to get solved in the short term. For the nearer-term issues, I'll try to update my PR with the test cases you suggested and update the documentation with a suggested configuration for renaming and properly labelling recorded metrics. |
@memory have you checked out the Counter Aggregator feature? It was added for your use case. |
@jkohen I did, but unfortunately (a) the raw metrics in question here were histograms, not gauges, and (b) I needed to group them by the labels that I intended to preserve (ie: for every deployment, aggregate the 50/95/99th histogram_quantile for request latency across all of the deployment's pods without having to manually specify each deployment name in the sidecar config); there did not seem to be a way to do either of those things with the counter aggregator although if I missed it I'd be happy to hear it. |
@memory it works for counters; I don't think it'll give you the right result for gauges and histograms. It should in principle work for histograms, with some additional work. We need to prioritize that with other work we have on our plates. Once we're ready, we'll make sure it covers your use case. Thanks for the feedback and sorry we don't have a better answer right now. |
@memory Just curious - Based on the recording rules you provided, do you try to get quantileX from a Prometheus Histogram? Prometheus Histogram type will be transformed into Stackdriver Distribution type of timeseries, and you can aggregate by Xth percentile and group by your labels as well. If you have any other use case beyond the observation I have above, please let me know too. |
@StevenYCChou yeah, I'm using recording rules to get a quantile from the prometheus histogram and forwarding it to stackdriver as a simple metric, e.g.:
Unfortunately, simply forwarding the histogram directly isn't possible for us and is how we got into this situation in the first place. The source histogram has more labels than the current stackdriver limit, so we were attempting to use recording rules to drop/aggregate across labels that we didn't need in stackdriver. |
Not sure if this is related, but we also got some metrics that are not properly forwarded to StackDriver. While metrics collected from the services directly are properly propagated, metrics collected from the Pushgateway do not seem to work. When setting the log level to debug we also got the
Versions in use are:
Metric type is I am not sure if this fits into the same issue discussed here, if that is not the case let me know and I will create a separate issue. Edit: I just converted the service to be scraped by Prometheus and not using the pushgateway and the metrics where properly forwarded, so there is some issue using the |
I am also getting same issue.
but unable to see on stackdriver. somehow stackdriver side car not sending this metric. Please help me here. |
It's been about 4 months since the last update to this issue, and nearly a year since it was first opened. I'm currently tottering along with the workaround from #159 but it would be nice to have some indication of whether and when this is likely to be fixed upstream. |
Gentle ping, one quarter later. Obviously the pandemic has derailed many a timetable; I hope everyone involved here is healthy and safe and keeping it that way should be everyone's #1 priority. Somewhere deep below that on the list of priorities: is this on anyone's radar? :) |
I have the same problem when trying to send data to local Kubernetes. Everything is complicated by the fact that I cannot find any similar complete examples in the documentation. Please help if someone has something similar. We are not talking about custom metrics, but about the metrics of the Kubernetes.
My docker-compose
|
@StevenYCChou it's approaching a year since the last substantial update on this issue. Is this expected to be addressed any time soon? |
This has also been a bit of a thorn in our side too, would love some idea of getting something like @memory's change into upstream? :) |
Hi @qingling128 and welcome to this extremely fun ticket! :) Let me know if I can offer any other details here. |
Is there any news on this? |
A fix for most of the issues noted in Stackdriver#104 - update targets.targetMatch() to match labels correctly in the case of a recorded metric (thank you @StevenYCChou) - update metadata.Cache.Get() to treat metrics with names prefixed with the value of the `--stackdriver.recorded-metric-prefix` flag (default: `recorded_`) as recorded metrics and log an error if a colon is found in the metric name (stackdriver metric names cannot have those) - update README with specific notes for forwarding recorded metrics
A fix for most of the issues noted in Stackdriver#104 - update targets.targetMatch() to match labels correctly in the case of a recorded metric (thank you @StevenYCChou) - update metadata.Cache.Get() to treat metrics with names prefixed with the value of the `--stackdriver.recorded-metric-prefix` flag (default: `recorded_`) as recorded metrics and log an error if a colon is found in the metric name (stackdriver metric names cannot have those) - update README with specific notes for forwarding recorded metrics
This should address most of Stackdriver#104 - fix target cache lookup for recorded metrics - update documentation with examples Also: allow overriding docker FROM target with DOCKER_IMAGE_BASE env var (helpful for testing and runtime scenarios where /bin/sh is needed)
Coming back to this after a long absence, I think several of us (myself very much included) got stuck in a rabbit hole when there is a much simpler solution available for most cases. With only the code change to For example a recording rule like:
Forwarding
I have not yet tested to see if this would work for prometheus histograms / stackdriver distributions. It's a little unclear to me if this was something that would have worked all along and we were all too deep in the weeds to see it, or if something in between the 0.5.0 and 0.8.1 releases made this possible. In any case I think the main need at this point is the fix to targets/cache and some explicit documentation around the case of recorded metrics: #266 |
(As a small followup: exporting recorded histograms does not seem to currently be possible. The current code assumes that all recorded metrics are gauges, and attempting to pin the value/value_type as counter/distribution produces an error that distribution is not a supported value type.) |
This should address most of Stackdriver#104 - fix target cache lookup for recorded metrics - update documentation with examples Also: allow overriding docker FROM target with DOCKER_IMAGE_BASE env var (helpful for testing and runtime scenarios where /bin/sh is needed)
This should address most of Stackdriver#104 - fix target cache lookup for recorded metrics - update documentation with examples Also: allow overriding docker FROM target with DOCKER_IMAGE_BASE env var (helpful for testing and runtime scenarios where /bin/sh is needed)
This should address most of Stackdriver#104 - fix target cache lookup for recorded metrics - update documentation with examples Also: allow overriding docker FROM target with DOCKER_IMAGE_BASE env var (helpful for testing and runtime scenarios where /bin/sh is needed)
This should address most of #104 - fix target cache lookup for recorded metrics - update documentation with examples Also: allow overriding docker FROM target with DOCKER_IMAGE_BASE env var (helpful for testing and runtime scenarios where /bin/sh is needed)
With the merge of #266 I believe this bug can, at long last, be closed. |
#266 has been merged. |
I have the following recording rule:
The rule is evaluated and recorded correctly:
I have also the following config file:
No metric appers in Stackdriver. Also no errors in the stackdriver-prometheus-sidecar logs.
This is just an example of the recorded metric. All other recorded metrics are also not sent to Stackdriver. Raw metrics are exported to Stackdriver as it should be.
The text was updated successfully, but these errors were encountered: