You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
Following up on #425, user problems in the metric agent are currently neglected. The typical user problems happening in the agent are:
A scrape target for the prometheus or istio input is down (not reachable because of networkPolicy for example)
A scrape target for the prometheus or istio input is returning too much data exceeding the sample limit
Goals:
Reflect these problems as warnings in the pipeline status
Make the diagnostic metrics indicating scrape problems accessible for operations
Criterias
The MetricPipeline status is reflecting the two mentioned problems, either in the existing dataFlow or in a new agent specific ScrapeHealthiness condition
There is a troubleshooting section describing the typical reasons for the problems
The condition reason list the top 5 scrape problems, not only a single problem
The status will reflect the situation with delay but dynamically without any restart
All diagnostic metrics of a scrape loop are accessible via a port of the agent for troubleshooting but not for the user
Implementation Ideas
The used prometheusreceiver provides diagnosticMetrics which can be enabled by the user already. However, they are not available for operations and also are not accessible by the self-monitor. So we could introduce a new otel-collector pipeline in the metric agent (enabled only if there is a prometheusreceiver) which has all prometheusreceivers as input, filters by relevant metrics only (maybe even unhealthy ones to save timeseries) and exports them under a new dedicated port using the prometheusexporter. Then configure the self-monitor to scrape the new endpoint. For troubleshooting the self-monitor dashboard can be used to introspect the selected metrics or the new port can be accessed directly to introspect all scrape related metrics.
Potential metrics interesting for realizing the goal are:
scrape_samples_scraped: The number of samples the target exposed
scrape_samples_post_metric_relabeling: The number of samples remaining after metric relabeling was applied
scrape_series_added: The approximate number of new series in this scrape
up: The scraping was successful
Items
Preparation
Understand which metrics need to get collected and how the alert rules must look like to have status available for the described situations
This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs.
Thank you for your contributions.
This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs.
Thank you for your contributions.
Description
Following up on #425, user problems in the metric agent are currently neglected. The typical user problems happening in the agent are:
Goals:
Criterias
Implementation Ideas
The used prometheusreceiver provides diagnosticMetrics which can be enabled by the user already. However, they are not available for operations and also are not accessible by the self-monitor. So we could introduce a new otel-collector pipeline in the metric agent (enabled only if there is a prometheusreceiver) which has all prometheusreceivers as input, filters by relevant metrics only (maybe even unhealthy ones to save timeseries) and exports them under a new dedicated port using the prometheusexporter. Then configure the self-monitor to scrape the new endpoint. For troubleshooting the self-monitor dashboard can be used to introspect the selected metrics or the new port can be accessed directly to introspect all scrape related metrics.
Potential metrics interesting for realizing the goal are:
Items
Reasons
Attachments
Release Notes
The text was updated successfully, but these errors were encountered: