[connector/spanmetrics] Metrics keeps being produced for spans that are no longer being received #30559

matej-g · 2024-01-15T16:30:00Z

Component(s)

connector/spanmetrics

What happened?

Description

I have an application that is sending spans to the collector, which are subsequently ran through the connector. However, once that application is shut down, I'm seeing metrics for the spans previously generated by the app being produce indefinitely. This is despite the fact that no new traces are being emitted by the application (since the application has been already shut down, as stated above). This is particularly problematic for applications with large number of operations (spans), since I keep receiving tons of data indefinitely (i.e. until I restart the collector).

Steps to Reproduce

Easiest is to reproduce with telemetrygen. For example:

Run a collector with simple pipeline that accepts OTLP traces -> exports traces to spanmetrics connector -> receives metrics -> exports metrics to debug exporter
Send a couple of traces from telemetrygen to collector
Observe on stdout that the duration histogram is being produced by the span metrics connector for the spans emitted by telemetrygen
Terminate the telemetrygen pod
Observe on the stdout that the duration histogram data points are still being produced for spans previously created by telemetrygen, even though I already terminated the service

Expected Result

The metrics should stop being produced eventually.

Actual Result

The metrics keep getting exported indefinitely (until I restart the collector).

Collector version

v0.91.0

Environment information

Environment

Local kind cluster

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  debug:

connectors:
  spanmetrics:

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [spanmetrics]
    metrics:
      receivers: [spanmetrics]
      exporters: [debug]

Log output

No response

Additional context

There have been couple of similar issues flying around (e.g. #29604, #17306), although it's not 100% clear the users are describing the same issue as here, since previously there were also related reports of memory leaks.

Some users have been adviced to adjust the config (e.g. this suggestion #17306 (comment)), but these unfortunately do not address the cause of the issue (and as a side note, even when trying to decrease the size of cache, this does not affect the number of metrics that keep being produced according to my tests. At least for the cumulative temporality, the cache eviction actually does not seem to be taking place, but this is only my deduction after glancing at the connector code).

I would imagine that ideally this could be solved if we could implement a logic where "if span X is not seen for Y amount of time, stop producing metrics for this span".

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-15T16:30:18Z

Pinging code owners:

connector/spanmetrics: @albertteoh

See Adding Labels via Comments if you do not have permissions to add labels yourself.

portertech · 2024-01-18T17:28:41Z

@albertteoh seems like we need to implement an expiration mechanism. Even if it's last_touched < Time.now - 5m etc.

albertteoh · 2024-01-21T00:13:08Z

@albertteoh seems like we need to implement an expiration mechanism. Even if it's last_touched < Time.now - 5m etc.

if we could implement a logic where "if span X is not seen for Y amount of time, stop producing metrics for this span".

This suggestion sounds quite reasonable to me. I agree, the spanmetrics connector should stop producing metrics after its source stops producing spans for some configurable period of time.

matej-g · 2024-01-22T09:22:58Z

Great, thanks for the feedback folks. If there's an agreement on the feature I'd be happy to give it a try.

tqi-raurora · 2024-01-26T17:21:09Z

Hello everyone, just to let you know, after reading through the comments, I do believe this is the same behavior causing this issue.

portertech · 2024-01-31T19:48:57Z

@matej-g are you able to have a go at the feature?

matej-g · 2024-02-05T09:38:30Z

@portertech Hey, sorry for the delay, it's on my table right now. I'd come back with a PR soon.

jaskarnshergillCE · 2024-03-11T22:42:30Z

I have the exact same issue with identical config. Would love to see the PR merged soon :)

I have an application pushing spans and metrics via opentelemetry-java-agent to an otel collector.

When I stop the application or when the application is running and not doing anything span metrics are still being pushed to prometheus.

When I restart the otel collector the span metrics connector stops pushing metrics to prometheus.

Using otel/opentelemetry-collector-contrib:0.95.0

**Description:** Adds a new feature to expire metrics that are considered stale. If no new spans are received within given time frame, on the next export cycle, the metrics are considered expired and will no longer be exported by the `spanmetricsconnector`. This intends to solve a situation where a service is no longer producing spans (e.g. because it was removed), but the metrics for such spans keep being produced indefinitely. See the linked issue for more details. Feature can be configured by setting `metrics_expiration` option. The current behavior (metrics never expire) is kept as the default. **Link to tracking Issue:** #30559 **Testing:** Added unit tests and tested manually as well. **Documentation:** Updated in-code documentation and README. --------- Signed-off-by: Matej Gera <matejgera@gmail.com>

…#31106) **Description:** Adds a new feature to expire metrics that are considered stale. If no new spans are received within given time frame, on the next export cycle, the metrics are considered expired and will no longer be exported by the `spanmetricsconnector`. This intends to solve a situation where a service is no longer producing spans (e.g. because it was removed), but the metrics for such spans keep being produced indefinitely. See the linked issue for more details. Feature can be configured by setting `metrics_expiration` option. The current behavior (metrics never expire) is kept as the default. **Link to tracking Issue:** open-telemetry#30559 **Testing:** Added unit tests and tested manually as well. **Documentation:** Updated in-code documentation and README. --------- Signed-off-by: Matej Gera <matejgera@gmail.com>

matej-g · 2024-03-25T10:40:22Z

@jaskarnshergillCE this change has been merged already, it's a matter of releasing 0.97.0 now, in which this will be available.

atoulme · 2024-03-26T20:23:28Z

Now that 0.97.0 is out, can we close this issue as complete?

matej-g · 2024-04-04T07:33:35Z

Yes, looks like this was left open accidentally, closing. Thanks everyone!

shicli · 2024-06-24T13:39:30Z

@matej-g thx， This is indeed the problem I encountered. #29604

manojksardana · 2024-07-12T11:59:45Z

i am using 0.103 version of the collector and still see this problem. I have set the following configuration

metrics_expiration: 15m

however even post the K8s pod emitting data has been removed, I continue to see the aggregated metrics series (_total, _sum, _count and _bucket) even post hours. Those get cleaned up only after collector restart.

tqi-raurora · 2024-11-06T19:12:11Z

Hi everyone!

Version: 0.110.0

I'm still experiencing a behavior like this, would like to know if it's intended or not.

Steps to Reproduce

Use the following configuration:

connectors:
  spanmetrics:
    dimensions:
      - name: host.name
    metrics_expiration: 2m

Send a trace with host.name="A", then continue sending traces with host.name="B".
As long as traces with host.name="B" are sent, metrics for host.name="A" continue to be generated.

Issue

This behavior creates issues for services running in ephemeral containers, where new host.name labels are frequently generated as old ones are no longer used. Consequently, the host.name label continues to grow in cardinality as long as the service remains active.

matej-g added bug Something isn't working needs triage New item requiring triage labels Jan 15, 2024

github-actions bot added the connector/spanmetrics label Jan 15, 2024

github-actions bot mentioned this issue Jan 16, 2024

Weekly Report: 2024-01-09 - 2024-01-16 #30565

Closed

github-actions bot mentioned this issue Jan 23, 2024

Weekly Report: 2024-01-16 - 2024-01-23 #30711

Closed

toughnoah mentioned this issue Jan 26, 2024

prometheusexporter's "metric_expiration" parameter is ignored when used with spanmetrics connector #30688

Closed

github-actions bot mentioned this issue Jan 30, 2024

Weekly Report: 2024-01-23 - 2024-01-30 #30848

Closed

crobert-1 removed the needs triage New item requiring triage label Jan 31, 2024

matej-g mentioned this issue Feb 7, 2024

[connector/spanmetrics] Add feature to expire metrics #31106

Merged

swar8080 mentioned this issue Mar 9, 2024

[connector/spanmetrics] Delta span metric StartTimeUnixNano doesn't follow specification, causing unbounded memory usage with prometheusexporter #31671

Closed

matej-g closed this as completed Apr 4, 2024

SatyKrish mentioned this issue May 10, 2024

otelcol.connector.spanmetrics spanmetrics count is growing infinitely over the period grafana/alloy#234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[connector/spanmetrics] Metrics keeps being produced for spans that are no longer being received #30559

[connector/spanmetrics] Metrics keeps being produced for spans that are no longer being received #30559

matej-g commented Jan 15, 2024 •

edited

Loading

github-actions bot commented Jan 15, 2024

portertech commented Jan 18, 2024

albertteoh commented Jan 21, 2024

matej-g commented Jan 22, 2024

tqi-raurora commented Jan 26, 2024

portertech commented Jan 31, 2024

matej-g commented Feb 5, 2024

jaskarnshergillCE commented Mar 11, 2024

matej-g commented Mar 25, 2024

atoulme commented Mar 26, 2024

matej-g commented Apr 4, 2024

shicli commented Jun 24, 2024

manojksardana commented Jul 12, 2024

tqi-raurora commented Nov 6, 2024

[connector/spanmetrics] Metrics keeps being produced for spans that are no longer being received #30559

[connector/spanmetrics] Metrics keeps being produced for spans that are no longer being received #30559

Comments

matej-g commented Jan 15, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jan 15, 2024

portertech commented Jan 18, 2024

albertteoh commented Jan 21, 2024

matej-g commented Jan 22, 2024

tqi-raurora commented Jan 26, 2024

portertech commented Jan 31, 2024

matej-g commented Feb 5, 2024

jaskarnshergillCE commented Mar 11, 2024

matej-g commented Mar 25, 2024

atoulme commented Mar 26, 2024

matej-g commented Apr 4, 2024

shicli commented Jun 24, 2024

manojksardana commented Jul 12, 2024

tqi-raurora commented Nov 6, 2024

matej-g commented Jan 15, 2024 •

edited

Loading