Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/spanmetrics] Metrics keeps being produced for spans that are no longer being received #30559

Closed
matej-g opened this issue Jan 15, 2024 · 14 comments
Labels
bug Something isn't working connector/spanmetrics

Comments

@matej-g
Copy link
Contributor

matej-g commented Jan 15, 2024

Component(s)

connector/spanmetrics

What happened?

Description

I have an application that is sending spans to the collector, which are subsequently ran through the connector. However, once that application is shut down, I'm seeing metrics for the spans previously generated by the app being produce indefinitely. This is despite the fact that no new traces are being emitted by the application (since the application has been already shut down, as stated above). This is particularly problematic for applications with large number of operations (spans), since I keep receiving tons of data indefinitely (i.e. until I restart the collector).

Steps to Reproduce

Easiest is to reproduce with telemetrygen. For example:

  1. Run a collector with simple pipeline that accepts OTLP traces -> exports traces to spanmetrics connector -> receives metrics -> exports metrics to debug exporter
  2. Send a couple of traces from telemetrygen to collector
  3. Observe on stdout that the duration histogram is being produced by the span metrics connector for the spans emitted by telemetrygen
  4. Terminate the telemetrygen pod
  5. Observe on the stdout that the duration histogram data points are still being produced for spans previously created by telemetrygen, even though I already terminated the service

Expected Result

The metrics should stop being produced eventually.

Actual Result

The metrics keep getting exported indefinitely (until I restart the collector).

Collector version

v0.91.0

Environment information

Environment

Local kind cluster

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      http:
      grpc:

exporters:
  debug:

connectors:
  spanmetrics:

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [spanmetrics]
    metrics:
      receivers: [spanmetrics]
      exporters: [debug]

Log output

No response

Additional context

There have been couple of similar issues flying around (e.g. #29604, #17306), although it's not 100% clear the users are describing the same issue as here, since previously there were also related reports of memory leaks.

Some users have been adviced to adjust the config (e.g. this suggestion #17306 (comment)), but these unfortunately do not address the cause of the issue (and as a side note, even when trying to decrease the size of cache, this does not affect the number of metrics that keep being produced according to my tests. At least for the cumulative temporality, the cache eviction actually does not seem to be taking place, but this is only my deduction after glancing at the connector code).

I would imagine that ideally this could be solved if we could implement a logic where "if span X is not seen for Y amount of time, stop producing metrics for this span".

@matej-g matej-g added bug Something isn't working needs triage New item requiring triage labels Jan 15, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@portertech
Copy link
Contributor

@albertteoh seems like we need to implement an expiration mechanism. Even if it's last_touched < Time.now - 5m etc.

@albertteoh
Copy link
Contributor

@albertteoh seems like we need to implement an expiration mechanism. Even if it's last_touched < Time.now - 5m etc.

if we could implement a logic where "if span X is not seen for Y amount of time, stop producing metrics for this span".

This suggestion sounds quite reasonable to me. I agree, the spanmetrics connector should stop producing metrics after its source stops producing spans for some configurable period of time.

@matej-g
Copy link
Contributor Author

matej-g commented Jan 22, 2024

Great, thanks for the feedback folks. If there's an agreement on the feature I'd be happy to give it a try.

@tqi-raurora
Copy link

Hello everyone, just to let you know, after reading through the comments, I do believe this is the same behavior causing this issue.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Jan 31, 2024
@portertech
Copy link
Contributor

@matej-g are you able to have a go at the feature?

@matej-g
Copy link
Contributor Author

matej-g commented Feb 5, 2024

@portertech Hey, sorry for the delay, it's on my table right now. I'd come back with a PR soon.

@jaskarnshergillCE
Copy link

I have the exact same issue with identical config. Would love to see the PR merged soon :)

I have an application pushing spans and metrics via opentelemetry-java-agent to an otel collector.

When I stop the application or when the application is running and not doing anything span metrics are still being pushed to prometheus.

When I restart the otel collector the span metrics connector stops pushing metrics to prometheus.

Using otel/opentelemetry-collector-contrib:0.95.0

TylerHelmuth pushed a commit that referenced this issue Mar 12, 2024
**Description:** 
Adds a new feature to expire metrics that are considered stale. If no
new spans are received within given time frame, on the next export
cycle, the metrics are considered expired and will no longer be exported
by the `spanmetricsconnector`.

This intends to solve a situation where a service is no longer producing
spans (e.g. because it was removed), but the metrics for such spans keep
being produced indefinitely. See the linked issue for more details.

Feature can be configured by setting `metrics_expiration` option. The
current behavior (metrics never expire) is kept as the default.

**Link to tracking Issue:** #30559

**Testing:** Added unit tests and tested manually as well.

**Documentation:** Updated in-code documentation and README.

---------

Signed-off-by: Matej Gera <matejgera@gmail.com>
DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
…#31106)

**Description:** 
Adds a new feature to expire metrics that are considered stale. If no
new spans are received within given time frame, on the next export
cycle, the metrics are considered expired and will no longer be exported
by the `spanmetricsconnector`.

This intends to solve a situation where a service is no longer producing
spans (e.g. because it was removed), but the metrics for such spans keep
being produced indefinitely. See the linked issue for more details.

Feature can be configured by setting `metrics_expiration` option. The
current behavior (metrics never expire) is kept as the default.

**Link to tracking Issue:** open-telemetry#30559

**Testing:** Added unit tests and tested manually as well.

**Documentation:** Updated in-code documentation and README.

---------

Signed-off-by: Matej Gera <matejgera@gmail.com>
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
…#31106)

**Description:** 
Adds a new feature to expire metrics that are considered stale. If no
new spans are received within given time frame, on the next export
cycle, the metrics are considered expired and will no longer be exported
by the `spanmetricsconnector`.

This intends to solve a situation where a service is no longer producing
spans (e.g. because it was removed), but the metrics for such spans keep
being produced indefinitely. See the linked issue for more details.

Feature can be configured by setting `metrics_expiration` option. The
current behavior (metrics never expire) is kept as the default.

**Link to tracking Issue:** open-telemetry#30559

**Testing:** Added unit tests and tested manually as well.

**Documentation:** Updated in-code documentation and README.

---------

Signed-off-by: Matej Gera <matejgera@gmail.com>
@matej-g
Copy link
Contributor Author

matej-g commented Mar 25, 2024

@jaskarnshergillCE this change has been merged already, it's a matter of releasing 0.97.0 now, in which this will be available.

@atoulme
Copy link
Contributor

atoulme commented Mar 26, 2024

Now that 0.97.0 is out, can we close this issue as complete?

@matej-g
Copy link
Contributor Author

matej-g commented Apr 4, 2024

Yes, looks like this was left open accidentally, closing. Thanks everyone!

@shicli
Copy link

shicli commented Jun 24, 2024

@matej-g thx, This is indeed the problem I encountered. #29604

@manojksardana
Copy link

i am using 0.103 version of the collector and still see this problem. I have set the following configuration

metrics_expiration: 15m

however even post the K8s pod emitting data has been removed, I continue to see the aggregated metrics series (_total, _sum, _count and _bucket) even post hours. Those get cleaned up only after collector restart.

@tqi-raurora
Copy link

Hi everyone!

Version: 0.110.0

I'm still experiencing a behavior like this, would like to know if it's intended or not.

Steps to Reproduce

  • Use the following configuration:
connectors:
  spanmetrics:
    dimensions:
      - name: host.name
    metrics_expiration: 2m
  • Send a trace with host.name="A", then continue sending traces with host.name="B".

  • As long as traces with host.name="B" are sent, metrics for host.name="A" continue to be generated.

Issue

This behavior creates issues for services running in ephemeral containers, where new host.name labels are frequently generated as old ones are no longer used. Consequently, the host.name label continues to grow in cardinality as long as the service remains active.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics
Projects
None yet
Development

No branches or pull requests

9 participants