-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanmetrics Connector are not giving correct metrics of spans #21101
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Does the collector keep restarting? |
no, collector is not restarting |
I'm seeing the same issue when running more than one instance of the collector with the spanmetrics connector enabled. The generated metrics do not have a label unique to that instance of the collector, so any metrics generated for a given span end up in the same timeseries even though they are different counters from different collectors. |
Could you share your configuration? @aptomaKetil |
Sure @fatsheep9146: ---
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
otlp/tempo:
endpoint: tempo-distributor:4317
compression: gzip
balancer_name: round_robin
tls:
insecure: true
otlphttp/mimir:
endpoint: http://mimir-distributor:8080/otlp
tls:
insecure: true
compression: gzip
connectors:
spanmetrics:
histogram:
explicit: null
exponential:
max_size: 64
dimensions:
- name: http.route
- name: http.method
- name: db.system
- name: service.namespace
namespace: spanmetrics
processors:
k8sattributes:
extract:
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.node.name
labels:
- tag_name: __app
key: app.kubernetes.io/name
from: pod
- tag_name: __app
key: app
from: pod
- tag_name: __app
key: k8s-app
from: pod
- tag_name: service.version
key: app.kubernetes.io/version
from: pod
pod_association:
- sources:
- from: connection
resource:
attributes:
- key: service.name
from_attribute: __app
action: upsert
- key: service.instance.id
from_attribute: k8s.pod.name
action: upsert
- key: service.namespace
from_attribute: k8s.namespace.name
action: upsert
service:
pipelines:
traces:
receivers: [otlp]
processors:
- memory_limiter
- k8sattributes
- resource
- batch
exporters: [otlp/tempo, spanmetrics]
metrics:
receivers: [otlp, spanmetrics]
processors:
- memory_limiter
- k8sattributes
- resource
- batch
exporters: [otlphttp/mimir]
logs: null |
@harshraigroww @aptomaKetil One possible cause of this I've seen in the wild is if you have jobs that can change host dynamically on restart but collectors running stable on the hosts. A job starting on host A sending metrics to collector A restarts and comes back up on host B. The collector on host B will now send lower numbers for spanmetrics, but the collector on host A will continue sending the higher number it had last for the job when it was running there. This will show up as huge differences in short spaces of time as the two collectors' results interleave. You can fix this by upserting the collector hostname onto your spans then using that as a dimension for spanmetrics.
|
We actually had this setup initially with spanmetrics processor, but still run into the similar problem with connector. The issue as I see it is that internally it takes into account ALL resource entry labels, disregarding connector configuration, and then sends it down the line to prometheus exporter with only preferred label set, which causes metric collision if resource entry contains some changing value (we had this problem with php and process id, as it starts a new process for every incoming request). Once we sanitized resource entries before spanmetrics connector, the issue got resolved. |
I experience same behaviour - I have a collector as an DaemonSet and Edit: |
I am seeing the same behaviour with spans that are not at the root of the trace. For root spans all counter metrics are monotonic and going up. What is the |
@dan-corneanu i am experiencing this issue with all span_kind |
@harshraigroww what version of otel/opentelemetry-collector-contrib are you using? |
Collector-contrib version is 0.75.0 |
Updated to v0.83.0 and issue still happens. This issue really bothers me because developers lose trust in the whole setup when they see wrong request rate. And its impossible to scale based on request rate now... @harshraigroww have you found solution/workaround? |
no @mshebeko-twist |
It looks like there are two competing sets of metrics (under the same metric name and label) being emitted to the metrics storage. Perhaps there are multiple instances of the otel collector running spanmetrics connector? |
@albertteoh, thanks for advice! Unfortunately I've validated this before upgrade - port forwarded to otel-collector's metrics endpoint which prometheus scraps, and after refreshing couple of times I've seen fluctuation in calls_total metric for problematic service. So at one time I will have a high value that indeed represents amount of calls and other time I will get a low value which is the cause of this issue, so when calculating I have same setup on multiple environments, every environment has multiple instrumented services that are written in different languages/framework. What's interesting is that every environment will have this issue occur for different services, which points me to the fact that its not OTEL instrumentation that's the cause of this issue but connector itself. P.S. |
Are there multiple instances of otel-collector pods running behind a service? Even though prometheus is scraping from a single otel-collector port, the service could be load balancing across the otel-collector pods. These metrics are all held in memory on the otel-collector instance; there's not federation across otel-collector instances. I wouldn't expect the spanmetrics processor or connector to produce fluctuating metrics like that, especially when we can see a monotonically increasing pattern as from the screenshot above. |
They are running on each kubernetes node and exposed via |
Okay, thanks for checking that. Is it possible to create a local reproducer through docker containers? You could use https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor as a template. |
Hey, just an idea. Could we somehow make the |
This is how I can reproduce the issue. App The spanmetricsconnector is configured on the gateway collector. The agent collector is configured as follows:
The following steps are done:
After the restart the span metrics start going "up and down" (red circle in the screenshot) Another hint: When the flag |
As of version 0.92.0 there is a new configuration option Updating to 0.92.0 and configuring this option resolved the issue for me. |
Out of curiosity, what value did you set for |
It is also mentioned in the example in spanmetrics connector's README |
Thanks @rotscher, from what I see right now resource_metrics_key_attributes:
- service.name
- telemetry.sdk.language
- telemetry.sdk.name |
I filed open-telemetry/opentelemetry.io/issues/4368 since this seems like a common point of confusion, and it can potentially happen with other components. We can consider adding a link to this docs section on the spanmetrics connector page when we fix that. My recommendation would be to use the resource detection processor or the k8sattributes processor to add an appropriate label.
If you want to do this, you can use the |
Hi i am also having similar issue.. and using span-metrics it is keep giving wrong metrics .. Some help would be apprecaited! |
Hi @harshraigroww any update on this issue? |
Hi @harshraigroww @aptomaKetil as suggested to upsert the collector hostname into spans then using that as a dimension for spanmetrics, the issue of not resolved. also, As @rotscher suggested to add resource_metrics_key_attributes but still the graph has ups and downs and it's not monotonic increase way. |
Component(s)
connector/spanmetrics
What happened?
Description
I am using spanmetrics connector to generate metrics from span. The
calls
metrics is counter metrics so it's value should always increase but i can see graph going up and down of metrics generated by this connectorSteps to Reproduce
Using spanmetrics connector config and passing it as exporter in trace pipeline and receiving it in metrics pipeline.
Expected Result
calls and duration_count is counter metrics so there value should always increase
Actual Result
When graph ploted using these metrics it's going up and down
Collector version
0.75.0
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: