Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheusremotewrite emits noisy errors on empty data points #4972

Closed
nicks opened this issue Dec 1, 2020 · 18 comments
Closed

prometheusremotewrite emits noisy errors on empty data points #4972

nicks opened this issue Dec 1, 2020 · 18 comments

Comments

@nicks
Copy link

nicks commented Dec 1, 2020

Describe the bug
Here's the error message:

2020-12-01T00:03:54.421Z	error	exporterhelper/queued_retry.go:226	Exporting failed. The error is not retryable. Dropping data.	{"component_kind": "exporter", "component_type": "prometheusremotewrite", "component_name": "prometheusremotewrite", "error": "Permanent error: [Permanent error: nil data point. image_build_count is dropped; Permanent error: nil data point. image_build_duration_dist is dropped]", "dropped_items": 2}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	/home/circleci/project/exporter/exporterhelper/queued_retry.go:226
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
	/home/circleci/project/exporter/exporterhelper/metricshelper.go:115
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	/home/circleci/project/exporter/exporterhelper/queued_retry.go:128
github.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumers.func1
	/home/circleci/go/pkg/mod/github.com/jaegertracing/jaeger@v1.21.0/pkg/queue/bounded_queue.go:77

Here's the metric descriptor emitted by logging exporter for the same metric:

Metric #0
Descriptor:
     -> Name: image_build_count
     -> Description: Image build count
     -> Unit: ms
     -> DataType: IntSum
     -> IsMonotonic: true
     -> AggregationTemporality: AGGREGATION_TEMPORALITY_CUMULATIVE

I don't know enough about the contracts here to know if this is a bug in the opencensus code that I'm using to send the metric, or in the batcher, or in the prometheusremotewrite exporter, or something else entirely.

What did you expect to see?
No error messages

What did you see instead?
An error message

What version did you use?
Docker image: otel/opentelemetry-collector:0.15.0

What config did you use?

    extensions:
      health_check:
      pprof:
        endpoint: 0.0.0.0:1777
      zpages:
        endpoint: 0.0.0.0:55679
    
    receivers:
      opencensus:
        endpoint: "0.0.0.0:55678"
    
    processors:
      memory_limiter:
        check_interval: 5s
        limit_mib: 4000
        spike_limit_mib: 500
      batch:
    
    exporters:
      logging:
        loglevel: debug
      prometheusremotewrite:
        endpoint: "http://.../api/v1/prom/write?db=tilt"
        insecure: true
    
    service:
      extensions: [health_check, pprof, zpages]
      pipelines:
        metrics:
          receivers: [opencensus]
          exporters: [logging, prometheusremotewrite]

Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

Additional context
Add any other context about the problem here.

@bogdandrutu bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Aug 30, 2021
@alolita alolita added the ci-cd CI, CD, testing, build issues label Sep 2, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2022

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Nov 4, 2022
@atoulme atoulme added exporter/prometheusremotewrite and removed ci-cd CI, CD, testing, build issues labels Mar 11, 2023
@github-actions
Copy link
Contributor

Pinging code owners for exporter/prometheusremotewrite: @Aneurysm9. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot removed the Stale label Mar 12, 2023
@kovrus
Copy link
Member

kovrus commented Mar 13, 2023

The part about the noisy error message should be fixed now, each type of metric has a following condition to handle the case with empty data points.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 15, 2023
@kovrus
Copy link
Member

kovrus commented May 15, 2023

@nicks do you still have this issue?

@nicks
Copy link
Author

nicks commented May 15, 2023

nope, let's close it!

@nicks nicks closed this as completed May 15, 2023
@gogreen53
Copy link

gogreen53 commented Jun 27, 2023

I think there may have been a regression because I'm seeing this behavior and it makes the logs completely unusable for debugging. Version: opentelemetry-collector-contrib:0.75.0

2023-06-27T17:55:17.713Z error exporterhelper/queued_retry.go:401 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: empty data points. xxxx_download_size is dropped", "dropped_items": 16} 53 go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send 52 go.opentelemetry.io/collector/exporter@v0.75.0/exporterhelper/queued_retry.go:401 51 go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send 50 go.opentelemetry.io/collector/exporter@v0.75.0/exporterhelper/metrics.go:136 49 go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1 48 go.opentelemetry.io/collector/exporter@v0.75.0/exporterhelper/queued_retry.go:205 47 go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1 46 go.opentelemetry.io/collector/exporter@v0.75.0/exporterhelper/internal/bounded_memory_queue.go:58

@Boeller666
Copy link

Boeller666 commented Jul 5, 2023

Updated to the latest version of the opentelemetry-operator (0.33.0), but same here:

2023-07-05 10:38:26 | 2023-07-05T08:38:26.666Z	error	exporterhelper/queued_retry.go:391	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: empty data points. XXX is dropped", "dropped_items": 38}
-- | --
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter@v0.80.0/exporterhelper/queued_retry.go:391
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter@v0.80.0/exporterhelper/metrics.go:125
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter@v0.80.0/exporterhelper/queued_retry.go:195
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
2023-07-05 10:38:26 | go.opentelemetry.io/collector/exporter@v0.80.0/exporterhelper/internal/bounded_memory_queue.go:47

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Oct 25, 2023
@cyberw
Copy link

cyberw commented Nov 26, 2023

@crlnz ’s PR seemed to have a nice fix/workaround but it was closed.

We’re still experiencing this issue (on v0.83, but we'll try to update to v0.89 and see if that helps). @Aneurysm9 @rapphil

2023-11-24T14:25:25.375+0100 error exporterhelper/queued_retry.go:391 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite/mimir", "error": "Permanent error: empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.elapsed is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.elapsed is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.elapsed is dropped; empty data points. xxx.socket.io.push.size is dropped; empty data points. xxx.socket.io.push.elapsed is dropped", "dropped_items": 12077}

@github-actions github-actions bot removed the Stale label Nov 27, 2023
@crlnz
Copy link

crlnz commented Nov 28, 2023

@cyberw We're currently in the process of completing the EasyCLA internally, so this will be re-opened eventually. It's taking a little longer because we need to poke around regarding this issue. If anyone that has already signed the EasyCLA would like to take ownership of these changes, please feel free to fork my changes and open a new PR.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 29, 2024
@esuh-descript
Copy link

We are still running into this issue on 0.95.0:

2024-02-21T23:08:53.528Z	error	exporterhelper/queued_retry.go:391	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: empty data points. [metric_name] is dropped", "dropped_items": 19}

@github-actions github-actions bot removed the Stale label Feb 22, 2024
@alxbl
Copy link
Member

alxbl commented Apr 11, 2024

Hello, I've done some investigation on my end because this issue is still affecting us in 0.97 as well. I think the main problem is that there is an inconsistency in what receivers push into the pipeline and what (some) exporters expect.

In our case, the issue is happening with the windowsperfcounters receiver because the way it works is by pre-allocating all metrics in the OTLP object before attempting to scrape. If it fails to scrape (usually because it couldn't open the counter in the first place), it does not remove the metric from the OTLP message, but also does not add any data points. The final message is then pushed down the pipeline, processed, and when prometheusremotewrite finally receives it, it loops through the metrics and complains for every metric without data points.

I haven't checked other receivers, but any receiver that does something like this will cause prometheusremotewrite to log that error.

There are a few ways that I can think of for fixing this:

  1. Receivers should not be allowed to send metrics with empty points down the pipeline (Not sure how this plays with the spec though)
  2. There could be a processor that drops empty metrics? (maybe transform/filter can already do this?)
  3. Exporters should silently ignore empty metrics that are not a direct result of their manipulation of the data

I personally feel like option 1 is the best, as it reduces unnecessary processing/exporting work. Option 2 might be a decent temporary workaround, and option 3 seems like it could lead down a path of undetected data loss.

In the broader sense, there is also the problem of an OTLP client pushing a message with empty metrics. In that case it's not clear whether OTLP receiver should reject the message as malformed, or drop the empty metrics before pushing them into the pipeline. (I haven't checked, but if this is already specified, then receivers should probably implement a similar behavior)

In my specific case of the Windows Perf Counters, the error is already being logged by the receiver (once) at startup, and then results in one error from prometheusremotewrite per scrape cycle.


My plan is to open a PR that fixes the scrape behavior of windowsperfcounters and link it to this issue, but it will not be a silver bullet. I think the community/maintainers will need to decide how we want to handle empty points in general for this to be fully fixed.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jun 11, 2024
@alxbl
Copy link
Member

alxbl commented Jun 12, 2024

/label -Stale

The PR for fixing windowsperfcounters (#32384) is pending review/merge, but this issue still needs to be addressed on a receiver-by-receiver basis, unless prometheusremotewrite decides that empty datapoints are not a log-worthy error (maybe a counter instead?)

@github-actions github-actions bot removed the Stale label Jun 13, 2024
dmitryax pushed a commit that referenced this issue Jul 15, 2024
…s. (#32384)

**Description:** When scraping Windows Performance Counters, it's
possible that some counter objects do not exist. When that is the case,
`windowsperfcounters` will still create the `Metric` object with no
datapoints in it. Some exporters throw errors when encountering this.
The fix proposed in this PR does an extra pass after all metrics have
been scraped and removes the `Metric` objects for which no datapoints
were scraped.

**Link to tracking Issue:** #4972 

**Testing:** 
- Confirmed that `debug` exporter sees `ResourceMetrics` with no metrics
and doesn't throw.
- Confirmed that `prometheusremotewrite` exporter no longer complains
about empty datapoints and that it skips the export when no metrics are
available
- ~No unit tests added for now. I will add a unit test once I have
confirmation that this is the right way to remove empty datapoints~ Unit
test covering the changes and enabling fixture validation which was not
implemented.
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 13, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests