Spanmetrics Connector are not giving correct metrics of spans #21101

harshraigroww · 2023-04-21T16:43:44Z

Component(s)

connector/spanmetrics

What happened?

Description

I am using spanmetrics connector to generate metrics from span. The calls metrics is counter metrics so it's value should always increase but i can see graph going up and down of metrics generated by this connector

Steps to Reproduce

Using spanmetrics connector config and passing it as exporter in trace pipeline and receiving it in metrics pipeline.

connectors:
   count:
    servicegraph:
      latency_histogram_buckets: [5ms, 30ms, 100ms, 500ms, 2s]
      dimensions:
        - span.kind
      #   - dimension-2
      store:
        ttl: 5s
        max_items: 5000
    spanmetrics:
      histogram:
        explicit:
          buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 2s]
      dimensions:
        # - name: http.method
        #   default: GET
        - name: http.status_code
      dimensions_cache_size: 100000
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

Expected Result

calls and duration_count is counter metrics so there value should always increase

Actual Result

When graph ploted using these metrics it's going up and down

Collector version

0.75.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

exporters:
    otlp:
      headers:
        x-scope-orgid: ****
      endpoint: *.*.*.*:4317
      tls:
        insecure: true
    prometheus:
      endpoint: "0.0.0.0:8081"
  extensions:
    health_check: {}
    memory_ballast: {}
  processors:
    batch: {}
    tail_sampling:
      decision_wait: 10s
      num_traces: 100
      expected_new_traces_per_sec: 100
      policies:
        [{ name: latency_policy, type: latency, latency: { threshold_ms: 1 } }]
    memory_limiter: null
  connectors:
    count:
    servicegraph:
      latency_histogram_buckets: [5ms, 30ms, 100ms, 500ms, 2s]
      dimensions:
        - span.kind
      store:
        ttl: 5s
        max_items: 5000
    spanmetrics:
      histogram:
        explicit:
          buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 2s]
      dimensions:
        - name: http.status_code
      dimensions_cache_size: 100000
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
  receivers:
    otlp:
      protocols:
        grpc:
          include_metadata: true
          endpoint: 0.0.0.0:4317
  service:
    telemetry:
      metrics:
        address: 0.0.0.0:8888
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      traces:
        exporters:
          - otlp
          - count
          - servicegraph
          - spanmetrics
        processors:
          - memory_limiter
          - batch
        receivers:
          - otlp
      metrics:
        receivers:
          - count
          - servicegraph
          - spanmetrics
        exporters:
          - prometheus

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-04-21T16:44:01Z

Pinging code owners:

connector/spanmetrics: @albertteoh @kovrus

See Adding Labels via Comments if you do not have permissions to add labels yourself.

fatsheep9146 · 2023-04-23T10:53:51Z

Does the collector keep restarting?

harshraigroww · 2023-04-23T12:09:00Z

no, collector is not restarting

aptomaKetil · 2023-05-04T09:14:13Z

I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors.

fatsheep9146 · 2023-05-04T23:22:05Z

I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors.

Could you share your configuration? @aptomaKetil

aptomaKetil · 2023-05-05T08:15:43Z

Sure @fatsheep9146:

---
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  otlp/tempo:
    endpoint: tempo-distributor:4317
    compression: gzip
    balancer_name: round_robin
    tls:
      insecure: true
  otlphttp/mimir:
    endpoint: http://mimir-distributor:8080/otlp
    tls:
      insecure: true
    compression: gzip

connectors:
  spanmetrics:
    histogram:
      explicit: null
      exponential:
        max_size: 64
    dimensions:
      - name: http.route
      - name: http.method
      - name: db.system
      - name: service.namespace
    namespace: spanmetrics

processors:
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.node.name
      labels:
        - tag_name: __app
          key: app.kubernetes.io/name
          from: pod
        - tag_name: __app
          key: app
          from: pod
        - tag_name: __app
          key: k8s-app
          from: pod
        - tag_name: service.version
          key: app.kubernetes.io/version
          from: pod
    pod_association:
      - sources:
          - from: connection
  resource:
    attributes:
      - key: service.name
        from_attribute: __app
        action: upsert
      - key: service.instance.id
        from_attribute: k8s.pod.name
        action: upsert
      - key: service.namespace
        from_attribute: k8s.namespace.name
        action: upsert

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors:
        - memory_limiter
        - k8sattributes
        - resource
        - batch
      exporters: [otlp/tempo, spanmetrics]
    metrics:
      receivers: [otlp, spanmetrics]
      processors:
        - memory_limiter
        - k8sattributes
        - resource
        - batch
      exporters: [otlphttp/mimir]
    logs: null

garry-cairns · 2023-05-19T07:14:18Z

@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave.

You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.

attributes/collector_info:
  actions:
    - key: collector.hostname
      value: $HOSTNAME
      action: upsert

equinsuocha · 2023-06-06T12:56:04Z

@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave.

You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.
attributes/collector_info:
  actions:
    - key: collector.hostname
      value: $HOSTNAME
      action: upsert

We actually had this setup initially with spanmetrics processor, but still run into the similar problem with connector. The issue as I see it is that internally it takes into account ALL resource entry labels, disregarding connector configuration, and then sends it down the line to prometheus exporter with only preferred label set, which causes metric collision if resource entry contains some changing value (we had this problem with php and process id, as it starts a new process for every incoming request). Once we sanitized resource entries before spanmetrics connector, the issue got resolved.

mshebeko-twist · 2023-06-18T06:43:06Z

I experience same behaviour - I have a collector as an DaemonSet and calls_total metrics goes up and down.
I've port-forwarded to pod to see what values I get on prometheus exporter and indeed there are some flactuation.
In my case its is the same pod on the same host so I'm not sure if its a label set.

Edit:
Forgot to mention that if I will reload configs and restart the pod for some time there will be no flactuation in calls_total and it appears later.

dan-corneanu · 2023-06-26T23:10:33Z

I am seeing the same behaviour with spans that are not at the root of the trace. For root spans all counter metrics are monotonic and going up.

What is the span.kind of your metric @harshraigroww ?

harshraigroww · 2023-06-27T07:16:56Z

@dan-corneanu i am experiencing this issue with all span_kind

dan-corneanu · 2023-07-12T04:49:04Z

@harshraigroww what version of otel/opentelemetry-collector-contrib are you using?
I have just updated my docker image to use the latest version from dockerhub and at a first glance it seems that I do not have this problem anymore. I'll keep investigating, though.

harshraigroww · 2023-07-26T11:32:03Z

Collector-contrib version is 0.75.0
I will check with latest version again

mshebeko-twist · 2023-08-28T07:36:08Z

Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now...

@harshraigroww have you found solution/workaround?

harshraigroww · 2023-09-04T09:57:25Z

no @mshebeko-twist
i am using span-metrics processor currently instead of connectors

albertteoh · 2023-09-04T23:19:51Z

Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now...

@harshraigroww have you found solution/workaround?

It looks like there are two competing sets of metrics (under the same metric name and label) being emitted to the metrics storage. Perhaps there are multiple instances of the otel collector running spanmetrics connector?

mshebeko-twist · 2023-09-05T11:44:31Z

@albertteoh, thanks for advice! Unfortunately I've validated this before upgrade - port forwarded to otel-collector's metrics endpoint which prometheus scraps, and after refreshing couple of times I've seen fluctuation in calls_total metric for problematic service. So at one time I will have a high value that indeed represents amount of calls and other time I will get a low value which is the cause of this issue, so when calculating rate on this metric it will result in really high value...

I have same setup on multiple environments, every environment has multiple instrumented services that are written in different languages/framework. What's interesting is that every environment will have this issue occur for different services, which points me to the fact that its not OTEL instrumentation that's the cause of this issue but connector itself.

P.S.
@harshraigroww said that it works well for him using processor. In my case both of them eventually produces this issue.

albertteoh · 2023-09-05T12:16:08Z

Are there multiple instances of otel-collector pods running behind a service? Even though prometheus is scraping from a single otel-collector port, the service could be load balancing across the otel-collector pods.

These metrics are all held in memory on the otel-collector instance; there's not federation across otel-collector instances.

I wouldn't expect the spanmetrics processor or connector to produce fluctuating metrics like that, especially when we can see a monotonically increasing pattern as from the screenshot above.

mshebeko-twist · 2023-09-05T12:24:21Z

They are running on each kubernetes node and exposed via nodePort. And for validation I've port forwarded to specific pod to isolate the problem. This is how I've confirmed the fluctuation of the value. So its same OTEL collector, monitoring same pod producing different results.

albertteoh · 2023-09-05T20:43:45Z

Okay, thanks for checking that.

Is it possible to create a local reproducer through docker containers? You could use https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor as a template.

dan-corneanu · 2023-09-05T20:56:27Z

Hey, just an idea. Could we somehow make the spanmetrics connector tag its metrics with a uniq UUID? This will allow us to detect if the collector process itself gets restarted or if there are multiple processes sending metrics with the same dimensions. What do you think?

rotscher · 2024-01-05T07:47:06Z

This is how I can reproduce the issue.

App middleware.tc10java17 (instrumented with javaagent) => otelcol (v0.86.0, as agent) => otelcol (v0.90.1, as gateway) => prometheus

The spanmetricsconnector is configured on the gateway collector.

The agent collector is configured as follows:

(...)
processors:
  transform:
    error_mode: ignore
    trace_statements:
      - context: resource
        statements:
          - set(attributes["provider.observability"], "true")
(...)
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - memory_limiter
        - transform
        - attributes
        - batch
      exporters:
        - otlp/gateway

The following steps are done:

Starting telemetry processing, everything ok
Change attribute provider.observability from true to false and restart agent collector

After the restart the span metrics start going "up and down" (red circle in the screenshot)
The issue is resolved as soon as the gateway collector is restarted (blue circle in the screenshot)

Another hint: When the flag resource_to_telemetry_conversion of prometheusremotewrite exporter is enabled then the metrics behave correctly as a new time series is created due to the changed attribute. The fact that the "expired" metric never vanishes seems to be another issue (#17306).

rotscher · 2024-01-11T16:08:16Z

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See #29711).

Updating to 0.92.0 and configuring this option resolved the issue for me.

chewrocca · 2024-01-24T19:50:06Z

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See #29711).

Updating to 0.92.0 and configuring this option resolved the issue for me.

Out of curiosity, what value did you set for resource_metrics_key_attributes?

rotscher · 2024-01-25T14:12:31Z

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See #29711).
Updating to 0.92.0 and configuring this option resolved the issue for me.

Out of curiosity, what value did you set for resource_metrics_key_attributes?

    resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name

It is also mentioned in the example in spanmetrics connector's README

mshebeko-twist · 2024-01-29T12:52:48Z

Thanks @rotscher, from what I see right now calls_total is now stopped fluctuating, after setting resource_metrics_key_attributes as you mentioned to:

resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name

You can see counters are behaving properly now:

mx-psi · 2024-04-26T10:13:10Z

I filed open-telemetry/opentelemetry.io/issues/4368 since this seems like a common point of confusion, and it can potentially happen with other components. We can consider adding a link to this docs section on the spanmetrics connector page when we fix that.

My recommendation would be to use the resource detection processor or the k8sattributes processor to add an appropriate label.

Could we somehow make the spanmetrics connector tag its metrics with a uniq UUID

If you want to do this, you can use the UUID() function in the transform processor for this if the above solution does not work for you. This runs the risk of producing a cardinality explosion if restarts are happening frequently, so use it at your own risk :) I also think it's less useful than the above suggestion, since the UUID does not have any meaning.

ramanjaneyagupta · 2024-04-30T04:29:10Z

Hi i am also having similar issue.. and using span-metrics it is keep giving wrong metrics ..
This is the issue i raised - #32043
I tried with central gateway, and also tried with sending the data by routing to a second layer using service name load balancing.. but it is keep giving wrong metrics..

Some help would be apprecaited!

vaibhhavv · 2024-07-09T05:32:49Z

Hi @harshraigroww any update on this issue?
I am also facing the same issue, when we use tempo metrics generator, we can see graph with monotonic increase.
But with spanmetrics connector, it is observed the graph is not monotonic.
As @rotscher suggested to add resource_metrics_key_attributes but still no luck the graph has ups and downs and its not monotonic.

VijayPatil872 · 2024-07-19T08:31:10Z

Hi @harshraigroww @aptomaKetil as suggested to upsert the collector hostname into spans then using that as a dimension for spanmetrics, the issue of not resolved. also, As @rotscher suggested to add resource_metrics_key_attributes but still the graph has ups and downs and it's not monotonic increase way.
here we facing the same issue, when we use tempo metrics generator, we can see graph with monotonic increase.
But with spanmetrics connector, it is observed the graph is not monotonic.

harshraigroww added bug Something isn't working needs triage New item requiring triage labels Apr 21, 2023

github-actions bot added the connector/spanmetrics label Apr 21, 2023

atoulme removed the needs triage New item requiring triage label Apr 24, 2023

garry-cairns mentioned this issue May 19, 2023

Send only when metrics differ from t-1 #22100

Closed

fatsheep9146 added the never stale Issues marked with this label will be never staled and automatically removed label Sep 4, 2023

garry-cairns mentioned this issue Dec 4, 2023

Phantom Metrics being emitted open-telemetry/opentelemetry-python#3518

Open

rotscher mentioned this issue Jan 5, 2024

[exporter/prometheus] Expired metrics were not be deleted #17306

Closed

Frapschen mentioned this issue Aug 5, 2024

Issue in calls_total metric of spanmetric connector #34126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spanmetrics Connector are not giving correct metrics of spans #21101

Spanmetrics Connector are not giving correct metrics of spans #21101

harshraigroww commented Apr 21, 2023

github-actions bot commented Apr 21, 2023

fatsheep9146 commented Apr 23, 2023 •

edited

Loading

harshraigroww commented Apr 23, 2023

aptomaKetil commented May 4, 2023

fatsheep9146 commented May 4, 2023

aptomaKetil commented May 5, 2023

garry-cairns commented May 19, 2023 •

edited

Loading

equinsuocha commented Jun 6, 2023

mshebeko-twist commented Jun 18, 2023 •

edited

Loading

dan-corneanu commented Jun 26, 2023

harshraigroww commented Jun 27, 2023

dan-corneanu commented Jul 12, 2023 •

edited

Loading

harshraigroww commented Jul 26, 2023

mshebeko-twist commented Aug 28, 2023

harshraigroww commented Sep 4, 2023

albertteoh commented Sep 4, 2023

mshebeko-twist commented Sep 5, 2023

albertteoh commented Sep 5, 2023

mshebeko-twist commented Sep 5, 2023

albertteoh commented Sep 5, 2023

dan-corneanu commented Sep 5, 2023

rotscher commented Jan 5, 2024

rotscher commented Jan 11, 2024

chewrocca commented Jan 24, 2024

rotscher commented Jan 25, 2024

mshebeko-twist commented Jan 29, 2024

mx-psi commented Apr 26, 2024

ramanjaneyagupta commented Apr 30, 2024

vaibhhavv commented Jul 9, 2024

VijayPatil872 commented Jul 19, 2024

Spanmetrics Connector are not giving correct metrics of spans #21101

Spanmetrics Connector are not giving correct metrics of spans #21101

Comments

harshraigroww commented Apr 21, 2023

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Apr 21, 2023

fatsheep9146 commented Apr 23, 2023 • edited Loading

harshraigroww commented Apr 23, 2023

aptomaKetil commented May 4, 2023

fatsheep9146 commented May 4, 2023

aptomaKetil commented May 5, 2023

garry-cairns commented May 19, 2023 • edited Loading

equinsuocha commented Jun 6, 2023

mshebeko-twist commented Jun 18, 2023 • edited Loading

dan-corneanu commented Jun 26, 2023

harshraigroww commented Jun 27, 2023

dan-corneanu commented Jul 12, 2023 • edited Loading

harshraigroww commented Jul 26, 2023

mshebeko-twist commented Aug 28, 2023

harshraigroww commented Sep 4, 2023

albertteoh commented Sep 4, 2023

mshebeko-twist commented Sep 5, 2023

albertteoh commented Sep 5, 2023

mshebeko-twist commented Sep 5, 2023

albertteoh commented Sep 5, 2023

dan-corneanu commented Sep 5, 2023

rotscher commented Jan 5, 2024

rotscher commented Jan 11, 2024

chewrocca commented Jan 24, 2024

rotscher commented Jan 25, 2024

mshebeko-twist commented Jan 29, 2024

mx-psi commented Apr 26, 2024

ramanjaneyagupta commented Apr 30, 2024

vaibhhavv commented Jul 9, 2024

VijayPatil872 commented Jul 19, 2024

fatsheep9146 commented Apr 23, 2023 •

edited

Loading

garry-cairns commented May 19, 2023 •

edited

Loading

mshebeko-twist commented Jun 18, 2023 •

edited

Loading

dan-corneanu commented Jul 12, 2023 •

edited

Loading