Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheus] Expired metrics were not be deleted #17306

Closed
ralphgj opened this issue Dec 30, 2022 · 17 comments
Closed

[exporter/prometheus] Expired metrics were not be deleted #17306

ralphgj opened this issue Dec 30, 2022 · 17 comments

Comments

@ralphgj
Copy link
Contributor

ralphgj commented Dec 30, 2022

Component(s)

exporter/prometheus

What happened?

Description

Hi, I am trying to use spanmetrics processor and prometheus exporter to transform spans to metrics. But I found some expired metrics seems to be appeared repeatedly when new metrics received. And also the memory usage continued to rise. So is this a bug of prometheus exporter?

Steps to Reproduce

examples:
when I posted span to collector, the prometheus exporter exported the metric like this

calls_total{db_instance="N/A",db_name="name-KSDORKdOKV",db_sql_table="table-IWstkE",db_system="redis",operation="get",service_name="go-project.examples",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 1

after 5 seconds, the metric will disappear. Then I posted another span to collector, the prometheus exporter exported two metrics included the expired one

calls_total{db_instance="N/A",db_name="name-KSDORKdOKV",db_sql_table="table-IWstkE",db_system="redis",operation="get",service_name="go-project.examples",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 1
calls_total{db_instance="N/A",db_name="name-QSvMJKDYso",db_sql_table="table-ZHdGvF",db_system="redis",operation="set",service_name="go-project.examples",span_kind="SPAN_KIND_CLIENT",status_code="STATUS_CODE_UNSET"} 1

Expected Result

Expired metrics will be deleted.

Actual Result

Expired metrics seems to be still stored in memory cache.

Collector version

2ada50f

Environment information

Environment

OS: MacOS 13.0.1
Compiler(if manually compiled): go 1.19.3

OpenTelemetry Collector configuration

receivers:
  # Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:12345"

  otlp:
    protocols:
      grpc:
        endpoint: "localhost:55677"

processors:
  batch:
  spanmetrics:
    metrics_exporter: otlp/spanmetrics
    latency_histogram_buckets: [10ms, 100ms]
    dimensions:
      - name: db.system
        default: N/A
      - name: db.name
        default: N/A
      - name: db.sql.table
        default: N/A
      - name: db.instance
        default: N/A
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

exporters:
  logging:
    verbosity: basic

  otlp/spanmetrics:
    endpoint: "localhost:55677"
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    metric_expiration: 5s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [spanmetrics, batch]
      exporters: [logging]

    # The exporter name must match the metrics_exporter name.
    # The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [otlp/spanmetrics]

    metrics:
      receivers: [otlp]
      exporters: [prometheus]

Log output

No response

Additional context

No response

@ralphgj ralphgj added bug Something isn't working needs triage New item requiring triage labels Dec 30, 2022
@ralphgj ralphgj changed the title Expired metrics were not be deleted [exporter/prometheus] Expired metrics were not be deleted Dec 30, 2022
@Aneurysm9
Copy link
Member

Can you put a logging exporter on both of the metrics pipelines? I'd like to localize the issue to either the spanmetrics processor or the prometheus exporter. My first impression is that this is likely a result of memory in the spanmetrics processor emitting the last-known state of all metrics it has generated.

@albertteoh
Copy link
Contributor

I think @Aneurysm9 is right. Should the spanmetrics processor be introducing some sort of expiry on its in-memory collection of metrics as well to maintain consistency with prometheus exporter?

@ralphgj
Copy link
Contributor Author

ralphgj commented Jan 4, 2023

@Aneurysm9 sorry for the late reply, I updated the config with logging exporter and debug log level

receivers:
  # Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:12345"

  otlp:
    protocols:
      grpc:
        endpoint: "localhost:55677"

processors:
  batch:
  spanmetrics:
    metrics_exporter: otlp/spanmetrics
    latency_histogram_buckets: [10ms, 100ms]
    dimensions:
      - name: db.system
        default: N/A
      - name: db.name
        default: N/A
      - name: db.sql.table
        default: N/A
      - name: db.instance
        default: N/A
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

exporters:
  logging:
    verbosity: basic
  logging/1:
    verbosity: normal
  logging/2:
    verbosity: normal

  otlp/spanmetrics:
    endpoint: "localhost:55677"
    tls:
      insecure: true

  prometheus:
    endpoint: "0.0.0.0:8889"
    metric_expiration: 5s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [spanmetrics, batch]
      exporters: [logging]

    # The exporter name must match the metrics_exporter name.
    # The receiver is just a dummy and never used; added to pass validation requiring at least one receiver in a pipeline.
    metrics/spanmetrics:
      receivers: [otlp/spanmetrics]
      exporters: [otlp/spanmetrics, logging/1]

    metrics:
      receivers: [otlp]
      exporters: [prometheus, logging/2]
  telemetry:
    logs:
      level: debug

When I repeated the before steps, I got logs like this:

2023-01-04T17:15:03.247+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:03.247+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:03.747+0800    debug   prometheusexporter/accumulator.go:90    accumulating metric: calls_total        {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:03.747+0800    debug   prometheusexporter/accumulator.go:90    accumulating metric: latency    {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:03.747+0800    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging/2", "#metrics": 2}
2023-01-04T17:15:03.755+0800    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 1}
2023-01-04T17:15:05.784+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:05.784+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:05.784+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "calls_total", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]} {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:05.784+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "latency", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]}     {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}



2023-01-04T17:15:11.680+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:11.680+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:11.680+0800    debug   prometheusexporter/accumulator.go:288   metric expired: latency {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:11.680+0800    debug   prometheusexporter/accumulator.go:288   metric expired: calls_total     {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}



2023-01-04T17:15:18.675+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:18.675+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:18.746+0800    debug   prometheusexporter/accumulator.go:90    accumulating metric: calls_total        {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:18.746+0800    debug   prometheusexporter/accumulator.go:90    accumulating metric: latency    {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:18.746+0800    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "logging/2", "#metrics": 2}
2023-01-04T17:15:18.804+0800    info    TracesExporter  {"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 1}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:343     collect called  {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/accumulator.go:279   Accumulator collect called      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "latency", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]}     {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "latency", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]}     {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "calls_total", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]} {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
2023-01-04T17:15:23.098+0800    debug   prometheusexporter/collector.go:367     metric served: Desc{fqName: "calls_total", help: "", constLabels: {}, variableLabels: [service_name operation span_kind status_code db_system db_name db_sql_table db_instance]} {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Mar 6, 2023
@danielgblanco
Copy link

danielgblanco commented Mar 8, 2023

We're also seeing this in some collectors where we do spanmetrics. We had to set the expiration of prometheus exporter to a large value otherwise any cumulativetodelta calculations are affected., but we try to limit the cardinality very drastically on these metrics (around 100k time series per day). It does sound like it's a spanmetrics issue though. p.histograms is only reset for delta metrics here, and it seems that's what being used to produce metrics on any export interval.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2023

Pinging code owners for processor/spanmetrics: @albertteoh. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2023

Pinging code owners for connector/spanmetrics: @albertteoh. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme atoulme removed the needs triage New item requiring triage label Mar 10, 2023
@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 10, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jul 9, 2023

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 9, 2023
@mfilipe
Copy link

mfilipe commented Sep 18, 2023

@albertteoh @kovrus we are seeing this problem with spanmetrics connector and prometheus exporter. Collectors are using spanmetrics.aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE" and prometheus.metric_expiration: 60s, but the metrics are available in /metrics for some days.

The panel below shows how /metrics grows with the time and keeps stable when nobody is using:

rate(otelcol_exporter_sent_metric_points[1m])

Screenshot 2023-09-18 at 17 51 38

@albertteoh
Copy link
Contributor

Hi @mfilipe , the spanmetrics connector will continue to emit metrics at a configurable time interval (default every 15s) even when no new spans are sent to it, so I believe this is what's preventing the prometheus.metric_expiration to kick in.

There are some knobs available to influence the number of metrics the spanmetrics connector emits which are dimensions_cache_size and the buckets config parameters.

The former ensures no more than this many metrics will be store in memory and emitted downstream.

The latter speaks to the resolution of histograms, so a smaller number should reduced the number of metrics exported.

@nijave
Copy link
Contributor

nijave commented Sep 22, 2023

Hi @mfilipe , the spanmetrics connector will continue to emit metrics at a configurable time interval (default every 15s) even when no new spans are sent to it, so I believe this is what's preventing the prometheus.metric_expiration to kick in.

There are some knobs available to influence the number of metrics the spanmetrics connector emits which are dimensions_cache_size and the buckets config parameters.

The former ensures no more than this many metrics will be store in memory and emitted downstream.

The latter speaks to the resolution of histograms, so a smaller number should reduced the number of metrics exported.

dimensions_cache_size doesn't actually control how many metric series are kept. See grafana/agent#5271

tldr; processorImp.histograms isn't pruned in-sync with metricKeyToDimensions

@nijave
Copy link
Contributor

nijave commented Sep 22, 2023

It appears the updated timestamp on the metric is set when it's "accumulated" (called by ConsumeMetrics here https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L266C39-L266C39) so metric_expiration will only apply to things the spanmetricsprocessor stops sending to the metricsConsumer

Put another way, spanmetricsprocessor needs to expire metrics internally otherwise the downstream prometheusexporter thinks they keep getting updated regardless of if the value changed.

spanmetricsprocessor.exportMetrics -> ConsumeMetrics
prometheusexporter.ConsumeMetrics -> processMetrics (for each metric)
prometheusexporter.processMetrics -> Accumulate
prometheusexporter.Accumulate (gets now as timestamp) -> addMetric
prometheusexporter.addMetric -> accumulate<T> (timestamp now gets set to updated field)

@matej-g
Copy link
Contributor

matej-g commented Dec 20, 2023

I believe this issue should be reopened, it seems it was closed due to inactivity but it still persists, I also tried to reach out on Slack some time ago but no luck unfortunately (https://cloud-native.slack.com/archives/CJFCJHG4Q/p1700844297561809)

I'm facing same scenario as @mfilipe but none of the suggested solutions really satisfy my needs, as I simply want to stop exporting metrics for spans from services that no longer produce spans. I don't think configuring cache size or buckets will help me achieve it. I'd like to have a mechanism where such metrics are "expired" once no new spans are being produced.

From my understanding the plan proposed by @nijave but it seems like this has not been implemented yet. Are there any objections or known blockers?

@nijave
Copy link
Contributor

nijave commented Dec 20, 2023

I believe this issue should be reopened, it seems it was closed due to inactivity but it still persists, I also tried to reach out on Slack some time ago but no luck unfortunately (https://cloud-native.slack.com/archives/CJFCJHG4Q/p1700844297561809)

I'm facing same scenario as @mfilipe but none of the suggested solutions really satisfy my needs, as I simply want to stop exporting metrics for spans from services that no longer produce spans. I don't think configuring cache size or buckets will help me achieve it. I'd like to have a mechanism where such metrics are "expired" once no new spans are being produced.

From my understanding the plan proposed by @nijave but it seems like this has not been implemented yet. Are there any objections or known blockers?

You would need a feature request assuming it's related to what I thought it was. The spanmetricsprocessor doesn't take time into account and keeps publishing metrics subject to dimensions_cache_size (so it always keeps last ###)

However, the spanmetricsprocessor should actually respect dimensions_cache_size now (#27083 ). Also worth noting spanmetricsprocessor is deprecated and replaced with spanmetricsconnector

@nijave
Copy link
Contributor

nijave commented Dec 20, 2023

I believe this issue should be reopened, it seems it was closed due to inactivity but it still persists, I also tried to reach out on Slack some time ago but no luck unfortunately (https://cloud-native.slack.com/archives/CJFCJHG4Q/p1700844297561809)

I'm facing same scenario as @mfilipe but none of the suggested solutions really satisfy my needs, as I simply want to stop exporting metrics for spans from services that no longer produce spans. I don't think configuring cache size or buckets will help me achieve it. I'd like to have a mechanism where such metrics are "expired" once no new spans are being produced.

From my understanding the plan proposed by @nijave but it seems like this has not been implemented yet. Are there any objections or known blockers?

Actually I might have lied. aggregation_temporality to AGGREGATION_TEMPORALITY_DELTA probably does what you want but keep in mind that changes to gauges instead of counters (actually not sure if this is true). See https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/spanmetricsprocessor. In that mode, the series get purged every cycle https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/spanmetricsprocessor/processor.go#L256

@rotscher
Copy link
Contributor

rotscher commented Jan 5, 2024

We're also facing the case where metrics are not deleted, e.g. when a service is stopped and is not producing spans anymore. In that case the metric stops counting up but it is still exported.
As a consequence the target_info metric of prometheusexporter also never stops to be exported.

Everything gets (of course) to a clean state when the opentelemetry collector instance is restarted.

To me this issue seems related to the issue #21101.

Configuring aggregation_temporality to AGGREGATION_TEMPORALITY_DELTA does not resolve the issue as the spanmetrics (which are of type Sum and are monotonic) gets dropped by the prometheus exporter when I understand open-telemetry/opentelemetry-java#5062 correctly.

I've used version 0.90.1 of opentelemetry-collector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants