Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kube-prometheus-stack] Feat: Improve cAdvisor metrics scrape #5063

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

SuperQ
Copy link
Contributor

@SuperQ SuperQ commented Dec 15, 2024

What this PR does / why we need it

Improve the collection of cAdvisor metrics by adjusting the scrape interval to match the kubelet hardcoded minimum housekeeping interval.

  • Set the cAdvisor, and resource, interval to 10s by default.
  • Use the user configured a kubelet metrics interval if not default.
  • Enforce honorTimestamps for cAdvisor and resource if timetamp staleness is enabled.

Which issue this PR fixes

(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged)

  • fixes #

Special notes for your reviewer

Checklist

  • DCO signed
  • Chart Version bumped
  • Title of the PR starts with chart name (e.g. [prometheus-couchdb-exporter])

@SuperQ
Copy link
Contributor Author

SuperQ commented Dec 15, 2024

Followup to #5050 (CC @Breee).

For reference, this is the hardcoded housekeeping interval.

Note, this will likely cause PrometheusDuplicateTimestamps to fire for most users. We should probably remove/disable this alert.

We could disable it by default in the values.yaml, or we could remove it from the upstream kube-prometheus alert mixin.

@jkroepke
Copy link
Member

Hey @SuperQ

I feel the information about trackTimestampsStaleness and cAdvisor interesting for some users (and for maintainers very important).

WDYT about leaving a note or link to the trackTimestampsStaleness that explain that a little bit? The link to the slack discussion will be unavailable in 90-180 days. one or two sentences are fine.


Note, this will likely cause PrometheusDuplicateTimestamps to fire for most users. We should probably remove/disable this alert.

Does this include a lot of warnings Error on ingesting samples with different value but same timestamp on the prometheus logs as well? If so, I see some incoming GitHub issues that users may complain about this. :/

From my point of view, it should be sufficient to exclude the kubelet scrape job from PrometheusDuplicateTimestamps alert, or I'm wrong?

@SuperQ
Copy link
Contributor Author

SuperQ commented Dec 16, 2024

Does this include a lot of warnings Error on ingesting samples with different value but same timestamp

It shouldn't. This is more related to ingesting the same timestamps more than once. Maybe this is a Prometheus metric "bug", that we're counting the same condition for multiple reasons. I haven't investigated the code for this.

@SuperQ
Copy link
Contributor Author

SuperQ commented Dec 16, 2024

Sure, so here's the details on trackTimestampsStaleness.

Normal metrics exposed to Prometheus do not include timestamps. Metrics are supposed to be produced "on demand", so the data timestamp is the same as the scrape timestamp.

cAdvisor does not follow this best practice. Instead cAdvisor has its own internal data gathering loop, it exposes timestamps on container_... metrics for the exact time in which the data was collected.

Prometheus scrapes have millisecond time accuracy and are intentionally splayed over time avoid correlated network traffic. But user queries do have a correlated timestamp, and systems like Grafana will snap the query to a round number like 00:30:00. This means that when you query at 00:30:00, but the metric you want was scraped at 00:29:55.231, Prometheus has to "look back" some amount of time to find that sample.

This lookback defaults to 5 minutes in order to handle a variety of scrape interval use cases. The down side to lookback is that if a metric is removed from an endpoint, or a target is down, Prometheus would continue to show data up to 5 minutes after it really existed. This was confusing for users.

In Prometheus 2.0 we improved on this situation by introducing a staleness handler. When a metric label combo is removed from a scrape it will be marked as "stale" in the TSDB.

Prometheus, due to historical reasons, handles metrics that include the optional timestamp field without this staleness marker.

Because cAdvisor has its own internal data gathering loop, it exposes timestamps on container_... metrics for the exact time in which the data was collected. This allows the data to be more accurate since it can be many seconds between when the data is collected and when Prometheus gets the data.

But now, without the staleness markers, Prometheus will continue to apply the lookback of up to 5 minutes to find metrics. This can be very confusing for users as they will "see" cAdvisor metrics that don't exist anymore. The common example of this is metrics like container_memory_rss. What users will see is that for 5 minutes after a Pod is deleted, the memory RSS will continue to exist.

The new track timestamp staleness feature was introduced to fix this behavior. Now with trackTimestampsStaleness enabled metrics like container_memory_rss will correctly be removed when the Pod is removed.

The previous workaround was to disable "honorTimestamps" for cAdvisor. This worked to fix the staleness handling issue. But it causes the data to be slightly inaccurate as cAdvisor's internal loop is not in sync with Prometheus. So in order to fully enable trackTimestampsStaleness, we also need to force enable the honorTimestamps feature.

The second half of this change has to do with the fact that cAdvisor does not have a strict timer for internal data collection. It uses variable timing from 10 to 15 seconds in Kubernetes. So in order to capture every sample produced by cAvisor, we need to scrape slightly faster than the minimum housekeeping interval.

@SuperQ SuperQ force-pushed the superq/cadvisor_improvement branch from 2adb9e3 to 23c1934 Compare December 16, 2024 14:27
Copy link
Member

@jkroepke jkroepke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some question here and I guess one go template condition is set to the wrong endpoint.

Thanks for the detailed comment. The link should be added to kubelet.serviceMonitor.trackTimestampsStaleness in the value.yaml!

@SuperQ SuperQ force-pushed the superq/cadvisor_improvement branch from 23c1934 to 5727923 Compare December 16, 2024 20:03
Improve the collection of cAdvisor metrics by adjusting the scrape
interval to match the kubelet hardcoded minimum housekeeping interval.
* Set the cAdvisor, and resource, interval to 10s by default.
* Use the user configured a kubelet metrics interval if not default.
* Enforce honorTimestamps for cAdvisor and resource if timetamp
  staleness is enabled.

Signed-off-by: SuperQ <superq@gmail.com>
@SuperQ SuperQ force-pushed the superq/cadvisor_improvement branch from 5727923 to eece396 Compare December 16, 2024 20:07
@SuperQ SuperQ marked this pull request as ready for review December 16, 2024 20:07
@SuperQ SuperQ requested a review from jkroepke December 16, 2024 20:07
@jkroepke
Copy link
Member

@SuperQ what about the PrometheusDuplicateTimestamps alert that you mention earlier? Should it be covered in this or an follow-up PR, if users complain about?

@SuperQ
Copy link
Contributor Author

SuperQ commented Dec 16, 2024

I've been testing the 10s interval on one of my k3s test clusters all day and haven't been able to reproduce the duplicate timestamp issue there. Perhaps there is something wrong with my production setup that isn't normal. Maybe it's a cAdvisor under heavy load problem? I'm not sure at this point.

Copy link
Member

@jkroepke jkroepke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@jkroepke jkroepke merged commit 22d1213 into main Dec 16, 2024
5 checks passed
@jkroepke jkroepke deleted the superq/cadvisor_improvement branch December 16, 2024 21:26
@festeveira
Copy link
Contributor

festeveira commented Dec 18, 2024

Hello, after these changes I started repeatedly getting the following log messages like this one, for every kubelet target, in the prometheus container:

time=2024-12-18T17:08:26.335Z level=WARN source=scrape.go:1877 msg="Error on ingesting out-of-order samples" component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-agent-kube-prom-kubelet/1 target=https://10.134.60.34:10250/metrics/cadvisor num_dropped=2646

My prometheus setup (multiple agents and one receiver) is configured with a scrapeInterval of 1m. If I set the kubelet scrape to 1m as well, the messages disappear. Furthermore, when scraping with 10s, data points in grafana stay 1m apart (querying the remote receiver prometheus and not the prometheus scraping these metrics). However, the graphs produced by prometheus have gaps in them, for cAdvisor metrics.

Are these log messages expected? Thank you in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants