[exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378

jamesmoessis · 2024-09-24T01:53:04Z

Component(s)

exporter/loadbalancing

What happened?

Description

We have used the DNS resolver and K8S resolver which seem to both cause data loss, because they route based on the IP address.

We are using the load balancing exporter for tail sampling, but this problem is generally applicable to this exporter.

When our downstream services changes scales/deploys, the IP addresses change. And, since the IP address is tied to exporter, the spans get trapped in that exporter's queue, with no way out. That exporter is deleted along with the data that lives in its queue.

Everytime we deploy we get a bunch of dropped spans as a result.

Steps to Reproduce

Set up the loadbalancing exporter with the DNS resolver, then change the IP address that the DNS record resolves to. You will see otelcol_exporter_send_failed_spans tick up.

Expected Result

Data is not dropped but re-balanced / successfully exported.

Actual Result

Data is lost along with the exporter that is deleted due to its IP address no longer being active.

Potential solutions

We are using a stateful set, so the DNS records / pod names are stable, i.e. pod-0, pod-1, ... , pod-n. If the exporters were attached to these names instead of the IP addresses, this would solve the issue. This may be possible using the pod hostname or SRV record.

Another potential solution, which may or may not be possible, is to release the data trapped in the exporter queue, when an exporter is deleted.

Collector version

v0.109.0

Environment information

Environment

Kubernetes, arm64

OpenTelemetry Collector configuration

exporters:
  loadbalancing:
    protocol:
      otlp:
        sending_queue:
          queue_size: 1000
    resolver:
      dns:
        hostname: stateful-sampler.tailsampling
        port: "4317"

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-24T01:53:19Z

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

bvsvas · 2024-10-05T11:31:49Z

We noticed similar issue# 35512 for metrics, we are using this otel lb since our custom exporter is stateful.

Do you see any errors in loadbalancing exporter logs in your case? We see only 503, http_server_duration and refused_metric_point metrics.

jamesmoessis · 2024-10-09T23:00:52Z

@bvsvas our primary signal is the metric otelcol_exporter_send_failed_spans, I don't think there's a 503 because we are using a headless service and the exporter is trying to send to an IP address that has nothing running on it.

We have temporarily solved the issue by falling back to the static resolver, but it means we have to hardcode all the hostnames in:

exporters:
  loadbalancing:
    protocol:
      otlp:
        sending_queue:
          queue_size: 500
    resolver:
      static:
        hostnames:
          - pod-0.stateful-sampler.tailsampling:4317
          - pod-1.stateful-sampler.tailsampling:4317
          - pod-2.stateful-sampler.tailsampling:4317
          - ....

Would appreciate some thoughts from maintainers, @jpkrohling @MovieStoreGuy any ideas? I would be happy to take a closer look and contribute a fix, just would want some input from codeowners first.

MovieStoreGuy · 2024-10-09T23:06:12Z

I'm not overly familiar with the loadbalancer code base but it sounds like the component meant to provide the discovery is failing to provide a refresh or is persisting old exporters.

I can take a look some time this week if no one else has the time.

MovieStoreGuy · 2024-10-29T05:51:26Z

Okay, I think I have somewhat deduce what the issues but it is unclear on what is the best solution going forward.

So in the event that an load balanced exporter is removed from the internal ring, there is a chance that the following scenario has happened:

flowchart LR
  A[Prior Pipeline] --> B[Load Balance Exporter]
  B --> C[Exporter #0]
  B --> D[Exporter #1]
  C --> E(((Endpoint #0)))
  D --> F(((Endpoint #1)))

If endpoint #1 is no longer viable which then causes the internal resolve to remove that endpoint from the returned endpoints there is a moment that diagram looks like this:

flowchart LR
  A[Prior Pipeline] --> B[Load Balance Exporter]
  B --> C[Exporter #0]
  B --> D[Exporter #1]
  C --> E(((Endpoint #0)))
  D --> F@{ shape: cross-circ }

There is a moment that that is still being routed to Exporter #1 and from the Load Balancers looks successful due to some exporters having queueing functionality be default, but it means there is now potentially 1000 (I forget the default queue size) records queued inside the exporter that are lost once it is shutdown due to the endpoint missing.

Now, this is is the part I am unsure of what is the best path forward since each idea I currently have comes with some amount of cons.

Potential solution is that exporters with a Queue expose a method that allows work to be stolen from them effectively. This would mean that queue interface would need some means of being exposed so it could be redistributed in the exporter.

Another solution would be for the load balancer exporter to implement its own queue system as a wrapper around the components that effective do the same on above.

jamesmoessis · 2024-10-29T23:23:42Z

Thanks for the explanation @MovieStoreGuy, this certainly makes sense and it's been our suspicion for a while that this is exactly what's happening. This architecture doesn't allow for auto scaling (or any change to downstream IP addresses) without loss of data in the exporter queue.

In my opinion, I think the way forward is to restructure the loadbalancing exporter so that it has it's own queue, and then each exporter should be a consumer of that queue. The"worker" exporters should not have their own queues, as this is where the data is lost. One consumption from the queue would include the routing of a batch and the successful export of all items in the batch, if it fails it stays in the top-level queue.

Of course easier said than done, and I am by no means an expert on this code. I would be interested to see @jpkrohling's comment too.

jpkrohling · 2024-11-04T09:50:00Z

The understanding of this issue is correct. Originally, the load balancing exporter was developed with the idea of doing an A query to obtain the backends for a specific service name. Once the IP became unavailable, data meant for it would start failing. In ideal scenarios, the IP is first removed from the DNS, the change is picked up by the load balancer, and the internal ring is changed to reassign the trace IDs for that IP to another IP. In cases where IPs simply disappear, this means that we'd have data loss.

Data loss was (is?) an acceptable compromise at the beginning for traces, as the assumption is that it would only happen in small quantities and, given enough traffic, similar traces would be available before or after the data loss event. This allowed us to have a load balancer that is horizontally scalable, without need for data sync among instances of the load balancer.

Note that the load balancer's main job is to balance the trace IDs among backends in a consistent manner, so that multiple instances of the load balancer can make the same decision without coordination between themselves (which is the consistent hashing part of the solution). So, if an IP is not available anymore, it should be removed from all load balancer instances at about the same time, otherwise spans from the same trace will be routed to different backing collectors.

jamesmoessis · 2024-11-28T03:00:39Z

@Fiery-Fenix thankyou for your contribution!!

…e and timeout settings (open-telemetry#36094) #### Description ##### Problem statement `loadbalancing` exporter is actually a wrapper that's creates and manages set of actual `otlp` exporters Those `otlp` exporters technically shares same configuration parameters that are defined on `loadbalancing` exporter level, including `sending_queue` configuration. The only difference is `endpoint` parameter that are substituted by `loadbalancing` exporter itself This means, that `sending_queue`, `retry_on_failure` and `timeout` settings can be defined only on `otlp` sub-exporters, while top-level `loadbalancing` exporter is missing all those settings This configuration approach produces several issue, that are already reported by users: * Impossibility to use Persistent Queue in `loadbalancing` exporter (see open-telemetry#16826). That's happens because `otlp` sub-exporters are sharing the same configurations, including configuration of the queue, i.e. they all are using the same `storage` instance at the same time which is not possible at the moment * Data loss even using `sending_queue` configuration (see open-telemetry#35378). That's happens because Queue is defined on level of `otlp` sub-exporters and if this exporter cannot flush data from queue (for example, endpoint is not available anymore) there is no other options that just to discard data from queue, i.e. there is no higher level queue and persistent storage where data can be returned is case of permanent failure There might be some other potential issue that was already tracked and related to current configuration approach ##### Proposed solution The easiest way to solve issues above - is to use standard approach for queue, retry and timeout configuration using `exporterhelper` This will bring queue, retry and timeout functionality to the top-level of `loadbalancing` exporter, instead of `otlp` sub-exporters Related to mentioned issues it will bring: * Single Persistent Queue, that is used by all `otlp` sub-exporters (not directly of course) * Queue will not be discarded/destroyed if any (or all) of endpoint that are unreachable anymore, top-level queue will keep data until new endpoints will be available * Scale-up and scale-down event for next layer of OpenTelemetry Collectors in K8s environments will be more predictable, and will not include data loss anymore (potential fix for open-telemetry#33959). There is still a big chance of inconsistency when some data will be send to incorrect endpoint, but it's already better state that we have right now ##### Noticeable changes * `loadbalancing` exporter on top-level now uses `exporterhelper` with all supported functionality by it * `sending_queue` will be automatically disabled on `otlp` exporters when it already present on top-level `loadbalancing` exporter. This change is done to prevent data loss on `otlp` exporters because queue there doesn't provide expected result. Also it will prevent potential misconfiguration from user side and as result - irrelevant reported issues * `exporter` attribute for metrics generated from `otlp` sub-exporters now includes endpoint for better visibility and to segregate them from top-level `loadbalancing` exporter - was `"exporter": "loadbalancing"`, now `"exporter": "loadbalancing/127.0.0.1:4317"` * logs, generated by `otlp` sub-exporters now includes additional attribute `endpoint` with endpoint value with the same reasons as for metrics #### Link to tracking issue Fixes open-telemetry#35378 Fixes open-telemetry#16826 #### Testing Proposed changes was heavily tested on large K8s environment with set of different scale-up/scale-down event using persistent queue configuration - no data loss were detected, everything works as expected #### Documentation `README.md` was updated to reflect new configuration parameters available. Sample `config.yaml` was updated as well

jamesmoessis · 2024-12-09T04:16:10Z

@jpkrohling @Fiery-Fenix @MovieStoreGuy I commented this on the PR but since it's closed I'll re-iterate. I don't think this is actually fixed.

#36094 (comment)

@Fiery-Fenix @MovieStoreGuy revisiting this, I think we actually do need to have a way of disabling the sub exporter queue without blocking in series here.
It's not really solving the problem if the sub exporter has a queue remaining, because that's where the data loss occurred. Then, if we disable the sub-exporter queue, it's extremely slow (blocking exporting in series for each endpoint).
So I think this part of the code needs more work. Perhaps we should re open the issue or start a new issue.

The PR unfortunately needs more work for it to avoid data loss for us. Confirmed our production load cannot handle this, even with 500 consumers on the top level queue. With sub-level queue, data loss occurs on scaling. Without it, performance problems occur which cause data loss anyway.

jamesmoessis · 2024-12-09T04:28:53Z

Will create a new issue for this

…e and timeout settings (open-telemetry#36094) #### Description ##### Problem statement `loadbalancing` exporter is actually a wrapper that's creates and manages set of actual `otlp` exporters Those `otlp` exporters technically shares same configuration parameters that are defined on `loadbalancing` exporter level, including `sending_queue` configuration. The only difference is `endpoint` parameter that are substituted by `loadbalancing` exporter itself This means, that `sending_queue`, `retry_on_failure` and `timeout` settings can be defined only on `otlp` sub-exporters, while top-level `loadbalancing` exporter is missing all those settings This configuration approach produces several issue, that are already reported by users: * Impossibility to use Persistent Queue in `loadbalancing` exporter (see open-telemetry#16826). That's happens because `otlp` sub-exporters are sharing the same configurations, including configuration of the queue, i.e. they all are using the same `storage` instance at the same time which is not possible at the moment * Data loss even using `sending_queue` configuration (see open-telemetry#35378). That's happens because Queue is defined on level of `otlp` sub-exporters and if this exporter cannot flush data from queue (for example, endpoint is not available anymore) there is no other options that just to discard data from queue, i.e. there is no higher level queue and persistent storage where data can be returned is case of permanent failure There might be some other potential issue that was already tracked and related to current configuration approach ##### Proposed solution The easiest way to solve issues above - is to use standard approach for queue, retry and timeout configuration using `exporterhelper` This will bring queue, retry and timeout functionality to the top-level of `loadbalancing` exporter, instead of `otlp` sub-exporters Related to mentioned issues it will bring: * Single Persistent Queue, that is used by all `otlp` sub-exporters (not directly of course) * Queue will not be discarded/destroyed if any (or all) of endpoint that are unreachable anymore, top-level queue will keep data until new endpoints will be available * Scale-up and scale-down event for next layer of OpenTelemetry Collectors in K8s environments will be more predictable, and will not include data loss anymore (potential fix for open-telemetry#33959). There is still a big chance of inconsistency when some data will be send to incorrect endpoint, but it's already better state that we have right now ##### Noticeable changes * `loadbalancing` exporter on top-level now uses `exporterhelper` with all supported functionality by it * `sending_queue` will be automatically disabled on `otlp` exporters when it already present on top-level `loadbalancing` exporter. This change is done to prevent data loss on `otlp` exporters because queue there doesn't provide expected result. Also it will prevent potential misconfiguration from user side and as result - irrelevant reported issues * `exporter` attribute for metrics generated from `otlp` sub-exporters now includes endpoint for better visibility and to segregate them from top-level `loadbalancing` exporter - was `"exporter": "loadbalancing"`, now `"exporter": "loadbalancing/127.0.0.1:4317"` * logs, generated by `otlp` sub-exporters now includes additional attribute `endpoint` with endpoint value with the same reasons as for metrics #### Link to tracking issue Fixes open-telemetry#35378 Fixes open-telemetry#16826 #### Testing Proposed changes was heavily tested on large K8s environment with set of different scale-up/scale-down event using persistent queue configuration - no data loss were detected, everything works as expected #### Documentation `README.md` was updated to reflect new configuration parameters available. Sample `config.yaml` was updated as well

jamesmoessis added bug Something isn't working needs triage New item requiring triage labels Sep 24, 2024

github-actions bot added the exporter/loadbalancing label Sep 24, 2024

github-actions bot mentioned this issue Oct 1, 2024

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

github-actions bot mentioned this issue Oct 8, 2024

Weekly Report: 2024-10-01 - 2024-10-08 #35659

Closed

atoulme removed the needs triage New item requiring triage label Oct 12, 2024

khushijain21 mentioned this issue Oct 16, 2024

[connector/otlpjson]: Do not emit empty batches #35827

Merged

Fiery-Fenix mentioned this issue Oct 30, 2024

[exporter/loadbalancing] Add top level sending_queue, retry_on_failure and timeout settings #36094

Merged

jpkrohling closed this as completed in #36094 Nov 27, 2024

jpkrohling closed this as completed in 8e0ea01 Nov 27, 2024

jamesmoessis mentioned this issue Dec 9, 2024

[exporter/loadbalancing] Data loss still unavoidable under high load and autoscaling #36717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378

[exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378

jamesmoessis commented Sep 24, 2024 •

edited

Loading

github-actions bot commented Sep 24, 2024

bvsvas commented Oct 5, 2024

jamesmoessis commented Oct 9, 2024

MovieStoreGuy commented Oct 9, 2024

MovieStoreGuy commented Oct 29, 2024

jamesmoessis commented Oct 29, 2024 •

edited

Loading

jpkrohling commented Nov 4, 2024

jamesmoessis commented Nov 28, 2024

jamesmoessis commented Dec 9, 2024 •

edited

Loading

jamesmoessis commented Dec 9, 2024

[exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378

[exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378

Comments

jamesmoessis commented Sep 24, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Potential solutions

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Sep 24, 2024

bvsvas commented Oct 5, 2024

jamesmoessis commented Oct 9, 2024

MovieStoreGuy commented Oct 9, 2024

MovieStoreGuy commented Oct 29, 2024

jamesmoessis commented Oct 29, 2024 • edited Loading

jpkrohling commented Nov 4, 2024

jamesmoessis commented Nov 28, 2024

jamesmoessis commented Dec 9, 2024 • edited Loading

jamesmoessis commented Dec 9, 2024

jamesmoessis commented Sep 24, 2024 •

edited

Loading

jamesmoessis commented Oct 29, 2024 •

edited

Loading

jamesmoessis commented Dec 9, 2024 •

edited

Loading