Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanmetrics Connector are not giving correct metrics of spans #21101

Open
harshraigroww opened this issue Apr 21, 2023 · 30 comments
Open

Spanmetrics Connector are not giving correct metrics of spans #21101

harshraigroww opened this issue Apr 21, 2023 · 30 comments
Labels
bug Something isn't working connector/spanmetrics never stale Issues marked with this label will be never staled and automatically removed

Comments

@harshraigroww
Copy link

Component(s)

connector/spanmetrics

What happened?

Description

I am using spanmetrics connector to generate metrics from span. The calls metrics is counter metrics so it's value should always increase but i can see graph going up and down of metrics generated by this connector

Steps to Reproduce

Using spanmetrics connector config and passing it as exporter in trace pipeline and receiving it in metrics pipeline.

connectors:
   count:
    servicegraph:
      latency_histogram_buckets: [5ms, 30ms, 100ms, 500ms, 2s]
      dimensions:
        - span.kind
      #   - dimension-2
      store:
        ttl: 5s
        max_items: 5000
    spanmetrics:
      histogram:
        explicit:
          buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 2s]
      dimensions:
        # - name: http.method
        #   default: GET
        - name: http.status_code
      dimensions_cache_size: 100000
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

Expected Result

calls and duration_count is counter metrics so there value should always increase

Actual Result

When graph ploted using these metrics it's going up and down
Screenshot 2023-04-21 at 7 58 29 PM

Collector version

0.75.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

exporters:
    otlp:
      headers:
        x-scope-orgid: ****
      endpoint: *.*.*.*:4317
      tls:
        insecure: true
    prometheus:
      endpoint: "0.0.0.0:8081"
  extensions:
    health_check: {}
    memory_ballast: {}
  processors:
    batch: {}
    tail_sampling:
      decision_wait: 10s
      num_traces: 100
      expected_new_traces_per_sec: 100
      policies:
        [{ name: latency_policy, type: latency, latency: { threshold_ms: 1 } }]
    memory_limiter: null
  connectors:
    count:
    servicegraph:
      latency_histogram_buckets: [5ms, 30ms, 100ms, 500ms, 2s]
      dimensions:
        - span.kind
      store:
        ttl: 5s
        max_items: 5000
    spanmetrics:
      histogram:
        explicit:
          buckets: [2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 2s]
      dimensions:
        - name: http.status_code
      dimensions_cache_size: 100000
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
  receivers:
    otlp:
      protocols:
        grpc:
          include_metadata: true
          endpoint: 0.0.0.0:4317
  service:
    telemetry:
      metrics:
        address: 0.0.0.0:8888
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      traces:
        exporters:
          - otlp
          - count
          - servicegraph
          - spanmetrics
        processors:
          - memory_limiter
          - batch
        receivers:
          - otlp
      metrics:
        receivers:
          - count
          - servicegraph
          - spanmetrics
        exporters:
          - prometheus

Log output

No response

Additional context

No response

@harshraigroww harshraigroww added bug Something isn't working needs triage New item requiring triage labels Apr 21, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@fatsheep9146
Copy link
Contributor

fatsheep9146 commented Apr 23, 2023

Does the collector keep restarting?

@harshraigroww
Copy link
Author

no, collector is not restarting

@atoulme atoulme removed the needs triage New item requiring triage label Apr 24, 2023
@aptomaKetil
Copy link

I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors.

@fatsheep9146
Copy link
Contributor

I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors.

Could you share your configuration? @aptomaKetil

@aptomaKetil
Copy link

Sure @fatsheep9146:

---
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  otlp/tempo:
    endpoint: tempo-distributor:4317
    compression: gzip
    balancer_name: round_robin
    tls:
      insecure: true
  otlphttp/mimir:
    endpoint: http://mimir-distributor:8080/otlp
    tls:
      insecure: true
    compression: gzip

connectors:
  spanmetrics:
    histogram:
      explicit: null
      exponential:
        max_size: 64
    dimensions:
      - name: http.route
      - name: http.method
      - name: db.system
      - name: service.namespace
    namespace: spanmetrics

processors:
  k8sattributes:
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.pod.name
        - k8s.node.name
      labels:
        - tag_name: __app
          key: app.kubernetes.io/name
          from: pod
        - tag_name: __app
          key: app
          from: pod
        - tag_name: __app
          key: k8s-app
          from: pod
        - tag_name: service.version
          key: app.kubernetes.io/version
          from: pod
    pod_association:
      - sources:
          - from: connection
  resource:
    attributes:
      - key: service.name
        from_attribute: __app
        action: upsert
      - key: service.instance.id
        from_attribute: k8s.pod.name
        action: upsert
      - key: service.namespace
        from_attribute: k8s.namespace.name
        action: upsert

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors:
        - memory_limiter
        - k8sattributes
        - resource
        - batch
      exporters: [otlp/tempo, spanmetrics]
    metrics:
      receivers: [otlp, spanmetrics]
      processors:
        - memory_limiter
        - k8sattributes
        - resource
        - batch
      exporters: [otlphttp/mimir]
    logs: null

@garry-cairns
Copy link
Contributor

garry-cairns commented May 19, 2023

@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave.

You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.

attributes/collector_info:
  actions:
    - key: collector.hostname
      value: $HOSTNAME
      action: upsert

@equinsuocha
Copy link

@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave.

You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.

attributes/collector_info:
  actions:
    - key: collector.hostname
      value: $HOSTNAME
      action: upsert

We actually had this setup initially with spanmetrics processor, but still run into the similar problem with connector. The issue as I see it is that internally it takes into account ALL resource entry labels, disregarding connector configuration, and then sends it down the line to prometheus exporter with only preferred label set, which causes metric collision if resource entry contains some changing value (we had this problem with php and process id, as it starts a new process for every incoming request). Once we sanitized resource entries before spanmetrics connector, the issue got resolved.

@mshebeko-twist
Copy link

mshebeko-twist commented Jun 18, 2023

I experience same behaviour - I have a collector as an DaemonSet and calls_total metrics goes up and down.
I've port-forwarded to pod to see what values I get on prometheus exporter and indeed there are some flactuation.
In my case its is the same pod on the same host so I'm not sure if its a label set.

Edit:
Forgot to mention that if I will reload configs and restart the pod for some time there will be no flactuation in calls_total and it appears later.

@dan-corneanu
Copy link

I am seeing the same behaviour with spans that are not at the root of the trace. For root spans all counter metrics are monotonic and going up.

What is the span.kind of your metric @harshraigroww ?

@harshraigroww
Copy link
Author

@dan-corneanu i am experiencing this issue with all span_kind

Screenshot 2023-06-27 at 12 45 55 PM

@dan-corneanu
Copy link

dan-corneanu commented Jul 12, 2023

@harshraigroww what version of otel/opentelemetry-collector-contrib are you using?
I have just updated my docker image to use the latest version from dockerhub and at a first glance it seems that I do not have this problem anymore. I'll keep investigating, though.

@harshraigroww
Copy link
Author

Collector-contrib version is 0.75.0
I will check with latest version again

@mshebeko-twist
Copy link

Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now...

@harshraigroww have you found solution/workaround?

Screen Shot 2023-08-28 at 11 29 54

@harshraigroww
Copy link
Author

no @mshebeko-twist
i am using span-metrics processor currently instead of connectors

@fatsheep9146 fatsheep9146 added the never stale Issues marked with this label will be never staled and automatically removed label Sep 4, 2023
@albertteoh
Copy link
Contributor

Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now...

@harshraigroww have you found solution/workaround?

Screen Shot 2023-08-28 at 11 29 54

It looks like there are two competing sets of metrics (under the same metric name and label) being emitted to the metrics storage. Perhaps there are multiple instances of the otel collector running spanmetrics connector?

@mshebeko-twist
Copy link

@albertteoh, thanks for advice! Unfortunately I've validated this before upgrade - port forwarded to otel-collector's metrics endpoint which prometheus scraps, and after refreshing couple of times I've seen fluctuation in calls_total metric for problematic service. So at one time I will have a high value that indeed represents amount of calls and other time I will get a low value which is the cause of this issue, so when calculating rate on this metric it will result in really high value...

I have same setup on multiple environments, every environment has multiple instrumented services that are written in different languages/framework. What's interesting is that every environment will have this issue occur for different services, which points me to the fact that its not OTEL instrumentation that's the cause of this issue but connector itself.

P.S.
@harshraigroww said that it works well for him using processor. In my case both of them eventually produces this issue.

@albertteoh
Copy link
Contributor

Are there multiple instances of otel-collector pods running behind a service? Even though prometheus is scraping from a single otel-collector port, the service could be load balancing across the otel-collector pods.

These metrics are all held in memory on the otel-collector instance; there's not federation across otel-collector instances.

I wouldn't expect the spanmetrics processor or connector to produce fluctuating metrics like that, especially when we can see a monotonically increasing pattern as from the screenshot above.

@mshebeko-twist
Copy link

They are running on each kubernetes node and exposed via nodePort. And for validation I've port forwarded to specific pod to isolate the problem. This is how I've confirmed the fluctuation of the value. So its same OTEL collector, monitoring same pod producing different results.

@albertteoh
Copy link
Contributor

Okay, thanks for checking that.

Is it possible to create a local reproducer through docker containers? You could use https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor as a template.

@dan-corneanu
Copy link

Hey, just an idea. Could we somehow make the spanmetrics connector tag its metrics with a uniq UUID? This will allow us to detect if the collector process itself gets restarted or if there are multiple processes sending metrics with the same dimensions. What do you think?

@rotscher
Copy link
Contributor

rotscher commented Jan 5, 2024

This is how I can reproduce the issue.

App middleware.tc10java17 (instrumented with javaagent) => otelcol (v0.86.0, as agent) => otelcol (v0.90.1, as gateway) => prometheus

The spanmetricsconnector is configured on the gateway collector.

The agent collector is configured as follows:

(...)
processors:
  transform:
    error_mode: ignore
    trace_statements:
      - context: resource
        statements:
          - set(attributes["provider.observability"], "true")
(...)
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - memory_limiter
        - transform
        - attributes
        - batch
      exporters:
        - otlp/gateway

The following steps are done:

  1. Starting telemetry processing, everything ok
  2. Change attribute provider.observability from true to false and restart agent collector

After the restart the span metrics start going "up and down" (red circle in the screenshot)
The issue is resolved as soon as the gateway collector is restarted (blue circle in the screenshot)

image

Another hint: When the flag resource_to_telemetry_conversion of prometheusremotewrite exporter is enabled then the metrics behave correctly as a new time series is created due to the changed attribute. The fact that the "expired" metric never vanishes seems to be another issue (#17306).

@rotscher
Copy link
Contributor

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See #29711).

Updating to 0.92.0 and configuring this option resolved the issue for me.

@chewrocca
Copy link

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See #29711).

Updating to 0.92.0 and configuring this option resolved the issue for me.

Out of curiosity, what value did you set for resource_metrics_key_attributes?

@rotscher
Copy link
Contributor

As of version 0.92.0 there is a new configuration option resource_metrics_key_attributes (See #29711).
Updating to 0.92.0 and configuring this option resolved the issue for me.

Out of curiosity, what value did you set for resource_metrics_key_attributes?

    resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name

It is also mentioned in the example in spanmetrics connector's README

@mshebeko-twist
Copy link

Thanks @rotscher, from what I see right now calls_total is now stopped fluctuating, after setting resource_metrics_key_attributes as you mentioned to:

resource_metrics_key_attributes:
      - service.name
      - telemetry.sdk.language
      - telemetry.sdk.name

You can see counters are behaving properly now:
Screen Shot 2024-01-29 at 12 51 30

@mx-psi
Copy link
Member

mx-psi commented Apr 26, 2024

I filed open-telemetry/opentelemetry.io/issues/4368 since this seems like a common point of confusion, and it can potentially happen with other components. We can consider adding a link to this docs section on the spanmetrics connector page when we fix that.

My recommendation would be to use the resource detection processor or the k8sattributes processor to add an appropriate label.

Could we somehow make the spanmetrics connector tag its metrics with a uniq UUID

If you want to do this, you can use the UUID() function in the transform processor for this if the above solution does not work for you. This runs the risk of producing a cardinality explosion if restarts are happening frequently, so use it at your own risk :) I also think it's less useful than the above suggestion, since the UUID does not have any meaning.

@ramanjaneyagupta
Copy link

Hi i am also having similar issue.. and using span-metrics it is keep giving wrong metrics ..
This is the issue i raised - #32043
I tried with central gateway, and also tried with sending the data by routing to a second layer using service name load balancing.. but it is keep giving wrong metrics..

Some help would be apprecaited!

@vaibhhavv
Copy link

Hi @harshraigroww any update on this issue?
I am also facing the same issue, when we use tempo metrics generator, we can see graph with monotonic increase.
But with spanmetrics connector, it is observed the graph is not monotonic.
As @rotscher suggested to add resource_metrics_key_attributes but still no luck the graph has ups and downs and its not monotonic.

@VijayPatil872
Copy link

Hi @harshraigroww @aptomaKetil as suggested to upsert the collector hostname into spans then using that as a dimension for spanmetrics, the issue of not resolved. also, As @rotscher suggested to add resource_metrics_key_attributes but still the graph has ups and downs and it's not monotonic increase way.
here we facing the same issue, when we use tempo metrics generator, we can see graph with monotonic increase.
But with spanmetrics connector, it is observed the graph is not monotonic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics never stale Issues marked with this label will be never staled and automatically removed
Projects
None yet
Development

No branches or pull requests