Reflect agent scrape problems in pipeline status #976

a-thaler · 2024-04-16T12:31:08Z

Description
Following up on #425, user problems in the metric agent are currently neglected. The typical user problems happening in the agent are:

A scrape target for the prometheus or istio input is down (not reachable because of networkPolicy for example)
A scrape target for the prometheus or istio input is returning too much data exceeding the sample limit

Goals:

Reflect these problems as warnings in the pipeline status
Make the diagnostic metrics indicating scrape problems accessible for operations

Criterias

The MetricPipeline status is reflecting the two mentioned problems, either in the existing dataFlow or in a new agent specific ScrapeHealthiness condition
There is a troubleshooting section describing the typical reasons for the problems
The condition reason list the top 5 scrape problems, not only a single problem
The status will reflect the situation with delay but dynamically without any restart
All diagnostic metrics of a scrape loop are accessible via a port of the agent for troubleshooting but not for the user

Implementation Ideas
The used prometheusreceiver provides diagnosticMetrics which can be enabled by the user already. However, they are not available for operations and also are not accessible by the self-monitor. So we could introduce a new otel-collector pipeline in the metric agent (enabled only if there is a prometheusreceiver) which has all prometheusreceivers as input, filters by relevant metrics only (maybe even unhealthy ones to save timeseries) and exports them under a new dedicated port using the prometheusexporter. Then configure the self-monitor to scrape the new endpoint. For troubleshooting the self-monitor dashboard can be used to introspect the selected metrics or the new port can be accessed directly to introspect all scrape related metrics.

Potential metrics interesting for realizing the goal are:

scrape_samples_scraped: The number of samples the target exposed
scrape_samples_post_metric_relabeling: The number of samples remaining after metric relabeling was applied
scrape_series_added: The approximate number of new series in this scrape
up: The scraping was successful

Items

Preparation
- Understand which metrics need to get collected and how the alert rules must look like to have status available for the described situations
- Have a PoC in place proving the idea E2E
Implementation
- Expose the diagnosticMetrics via a new port always. Have the port secured by a networkPolicy (Show appropriate status message via self monitor if a target is close to scrape limit #1037)
- Scrape the metrics by the self-monitor (minimalistic data set) and have alert rules in place
- Reflect alerts in the status
- Have E2E tests
- Have performance tests (the metrics are dependent on the amount of scrape targets, assure that it will not blow up the self-monitor)

Reasons

Attachments

Release Notes

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-06T00:10:06Z

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs.
Thank you for your contributions.

github-actions · 2024-09-07T00:11:12Z

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs.
Thank you for your contributions.

a-thaler added area/metrics MetricPipeline kind/feature Categorizes issue or PR as related to a new feature. labels Apr 16, 2024

a-thaler mentioned this issue May 6, 2024

Show appropriate status message via self monitor if a target is close to scrape limit #1037

Open

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2024

a-thaler removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2024

a-thaler removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reflect agent scrape problems in pipeline status #976

Reflect agent scrape problems in pipeline status #976

a-thaler commented Apr 16, 2024 •

edited

Loading

github-actions bot commented Jul 6, 2024

github-actions bot commented Sep 7, 2024

Reflect agent scrape problems in pipeline status #976

Reflect agent scrape problems in pipeline status #976

Comments

a-thaler commented Apr 16, 2024 • edited Loading

github-actions bot commented Jul 6, 2024

github-actions bot commented Sep 7, 2024

a-thaler commented Apr 16, 2024 •

edited

Loading