Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Connector/Servicegraph] Servicegraph Connector are not giving correct metrics of spans #34170

Open
VijayPatil872 opened this issue Jul 19, 2024 · 4 comments
Labels
bug Something isn't working connector/servicegraph

Comments

@VijayPatil872
Copy link

VijayPatil872 commented Jul 19, 2024

Component(s)

connector/servicegraph

What happened?

Description

I am using servicegraph connector to generate service graph and metrics from span. the metrics are emitted by the connector are fluctuating up and down.
We are using service graphs connector to build service graph. We have deployed a layer of Collectors containing the load-balancing exporter in front of traces Collectors doing the span metrics and service graph connector processing. The load-balancing exporter is used to hash the trace ID consistently and determine which collector backend should receive spans for that trace.
the service graph exporting the metrics to Grafana mimir with prometheusremotewrite exporter.

Steps to Reproduce

Expected Result

The metrics are emitted by the connector should be correct

Actual Result

image

Collector version

0.104.0

Environment information

No response

OpenTelemetry Collector configuration

config:        
  exporters:


    prometheusremotewrite/mimir-default-processor-spanmetrics:
      endpoint: 
      headers:
        x-scope-orgid: ********
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500        

    prometheusremotewrite/mimir-default-servicegraph:
      endpoint: 
      headers:
        x-scope-orgid: **********
      resource_to_telemetry_conversion:
        enabled: true
      timeout: 30s  
      tls:
        insecure: true
      remote_write_queue:
        enabled: true
        queue_size: 100000
        num_consumers: 500


  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [100ms, 500ms, 2s, 5s, 10s, 20s, 30s]
      aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
      metrics_flush_interval: 15s
      metrics_expiration: 5m
      exemplars:
        enabled: false
      dimensions:
        - name: http.method
        - name: http.status_code
        - name: cluster
        - name: collector.hostname
      events:
        enabled: true
        dimensions:
          - name: exception.type
      resource_metrics_key_attributes:
        - service.name
        - telemetry.sdk.language
        - telemetry.sdk.name
    servicegraph:
      latency_histogram_buckets: [100ms, 250ms, 1s, 5s, 10s]
      store:
        ttl: 2s
        max_items: 10

  receivers:
    otlp:
      protocols:
        http:
          endpoint: ${env:MY_POD_IP}:*****
        grpc:
          endpoint: ${env:MY_POD_IP}:*****
  service:


    pipelines:
      traces/connector-pipeline:
        exporters:
          - otlphttp/tempo-processor-default
          - spanmetrics
          - servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - otlp
     
      metrics/spanmetrics:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-processor-spanmetrics
        processors:
          - batch          
          - memory_limiter
        receivers:
          - spanmetrics

      metrics/servicegraph:
        exporters:
          - debug
          - prometheusremotewrite/mimir-default-servicegraph
        processors:
          - batch          
          - memory_limiter
        receivers:
          - servicegraph

Log output

No response

Additional context

No response

@VijayPatil872 VijayPatil872 added bug Something isn't working needs triage New item requiring triage labels Jul 19, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@VijayPatil872
Copy link
Author

any update on the issue?

@mapno
Copy link
Contributor

mapno commented Oct 3, 2024

Can you provide more information on why metrics are incorrect? A test or test data that reproduces the behaviour would be very helpful

@VijayPatil872
Copy link
Author

VijayPatil872 commented Oct 10, 2024

@mapno If we consider traces_service_graph_request_total metrics or traces_service_graph_request_failed_total metrics, these should be counter, but it is seen fluctuating up and down.
similarly for calls_total metrics in case of spanmetrics it should be a counter, but the graph is up & down at sometimes.
Also Can you explain for me what kind of A test or test data you need as the configurations as applied above. Let me know for addition details required.

@atoulme atoulme removed the needs triage New item requiring triage label Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/servicegraph
Projects
None yet
Development

No branches or pull requests

3 participants