Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/loadbalancing] Data loss occurring when change in downstream DNS records or pods #35378

Closed
jamesmoessis opened this issue Sep 24, 2024 · 10 comments · Fixed by #36094
Closed
Labels
bug Something isn't working exporter/loadbalancing

Comments

@jamesmoessis
Copy link
Contributor

jamesmoessis commented Sep 24, 2024

Component(s)

exporter/loadbalancing

What happened?

Description

We have used the DNS resolver and K8S resolver which seem to both cause data loss, because they route based on the IP address.

We are using the load balancing exporter for tail sampling, but this problem is generally applicable to this exporter.

When our downstream services changes scales/deploys, the IP addresses change. And, since the IP address is tied to exporter, the spans get trapped in that exporter's queue, with no way out. That exporter is deleted along with the data that lives in its queue.

Everytime we deploy we get a bunch of dropped spans as a result.

Steps to Reproduce

Set up the loadbalancing exporter with the DNS resolver, then change the IP address that the DNS record resolves to. You will see otelcol_exporter_send_failed_spans tick up.

Expected Result

Data is not dropped but re-balanced / successfully exported.

Actual Result

Data is lost along with the exporter that is deleted due to its IP address no longer being active.

Potential solutions

We are using a stateful set, so the DNS records / pod names are stable, i.e. pod-0, pod-1, ... , pod-n. If the exporters were attached to these names instead of the IP addresses, this would solve the issue. This may be possible using the pod hostname or SRV record.

Another potential solution, which may or may not be possible, is to release the data trapped in the exporter queue, when an exporter is deleted.

Collector version

v0.109.0

Environment information

Environment

Kubernetes, arm64

OpenTelemetry Collector configuration

exporters:
  loadbalancing:
    protocol:
      otlp:
        sending_queue:
          queue_size: 1000
    resolver:
      dns:
        hostname: stateful-sampler.tailsampling
        port: "4317"

Log output

No response

Additional context

No response

@jamesmoessis jamesmoessis added bug Something isn't working needs triage New item requiring triage labels Sep 24, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@bvsvas
Copy link

bvsvas commented Oct 5, 2024

We noticed similar issue# 35512 for metrics, we are using this otel lb since our custom exporter is stateful.

Do you see any errors in loadbalancing exporter logs in your case? We see only 503, http_server_duration and refused_metric_point metrics.

@jamesmoessis
Copy link
Contributor Author

@bvsvas our primary signal is the metric otelcol_exporter_send_failed_spans, I don't think there's a 503 because we are using a headless service and the exporter is trying to send to an IP address that has nothing running on it.

We have temporarily solved the issue by falling back to the static resolver, but it means we have to hardcode all the hostnames in:

exporters:
  loadbalancing:
    protocol:
      otlp:
        sending_queue:
          queue_size: 500
    resolver:
      static:
        hostnames:
          - pod-0.stateful-sampler.tailsampling:4317
          - pod-1.stateful-sampler.tailsampling:4317
          - pod-2.stateful-sampler.tailsampling:4317
          - ....

Would appreciate some thoughts from maintainers, @jpkrohling @MovieStoreGuy any ideas? I would be happy to take a closer look and contribute a fix, just would want some input from codeowners first.

@MovieStoreGuy
Copy link
Contributor

I'm not overly familiar with the loadbalancer code base but it sounds like the component meant to provide the discovery is failing to provide a refresh or is persisting old exporters.

I can take a look some time this week if no one else has the time.

@MovieStoreGuy
Copy link
Contributor

Okay, I think I have somewhat deduce what the issues but it is unclear on what is the best solution going forward.

So in the event that an load balanced exporter is removed from the internal ring, there is a chance that the following scenario has happened:

flowchart LR
  A[Prior Pipeline] --> B[Load Balance Exporter]
  B --> C[Exporter #0]
  B --> D[Exporter #1]
  C --> E(((Endpoint #0)))
  D --> F(((Endpoint #1)))
Loading

If endpoint #1 is no longer viable which then causes the internal resolve to remove that endpoint from the returned endpoints there is a moment that diagram looks like this:

flowchart LR
  A[Prior Pipeline] --> B[Load Balance Exporter]
  B --> C[Exporter #0]
  B --> D[Exporter #1]
  C --> E(((Endpoint #0)))
  D --> F@{ shape: cross-circ }
Loading

There is a moment that that is still being routed to Exporter #1 and from the Load Balancers looks successful due to some exporters having queueing functionality be default, but it means there is now potentially 1000 (I forget the default queue size) records queued inside the exporter that are lost once it is shutdown due to the endpoint missing.

Now, this is is the part I am unsure of what is the best path forward since each idea I currently have comes with some amount of cons.

Potential solution is that exporters with a Queue expose a method that allows work to be stolen from them effectively. This would mean that queue interface would need some means of being exposed so it could be redistributed in the exporter.

Another solution would be for the load balancer exporter to implement its own queue system as a wrapper around the components that effective do the same on above.

@jamesmoessis
Copy link
Contributor Author

jamesmoessis commented Oct 29, 2024

Thanks for the explanation @MovieStoreGuy, this certainly makes sense and it's been our suspicion for a while that this is exactly what's happening. This architecture doesn't allow for auto scaling (or any change to downstream IP addresses) without loss of data in the exporter queue.

In my opinion, I think the way forward is to restructure the loadbalancing exporter so that it has it's own queue, and then each exporter should be a consumer of that queue. The"worker" exporters should not have their own queues, as this is where the data is lost. One consumption from the queue would include the routing of a batch and the successful export of all items in the batch, if it fails it stays in the top-level queue.

Of course easier said than done, and I am by no means an expert on this code. I would be interested to see @jpkrohling's comment too.

@jpkrohling
Copy link
Member

The understanding of this issue is correct. Originally, the load balancing exporter was developed with the idea of doing an A query to obtain the backends for a specific service name. Once the IP became unavailable, data meant for it would start failing. In ideal scenarios, the IP is first removed from the DNS, the change is picked up by the load balancer, and the internal ring is changed to reassign the trace IDs for that IP to another IP. In cases where IPs simply disappear, this means that we'd have data loss.

Data loss was (is?) an acceptable compromise at the beginning for traces, as the assumption is that it would only happen in small quantities and, given enough traffic, similar traces would be available before or after the data loss event. This allowed us to have a load balancer that is horizontally scalable, without need for data sync among instances of the load balancer.

Note that the load balancer's main job is to balance the trace IDs among backends in a consistent manner, so that multiple instances of the load balancer can make the same decision without coordination between themselves (which is the consistent hashing part of the solution). So, if an IP is not available anymore, it should be removed from all load balancer instances at about the same time, otherwise spans from the same trace will be routed to different backing collectors.

@jamesmoessis
Copy link
Contributor Author

@Fiery-Fenix thankyou for your contribution!!

shivanthzen pushed a commit to shivanthzen/opentelemetry-collector-contrib that referenced this issue Dec 5, 2024
…e and timeout settings (open-telemetry#36094)

#### Description
##### Problem statement
`loadbalancing` exporter is actually a wrapper that's creates and
manages set of actual `otlp` exporters
Those `otlp` exporters technically shares same configuration parameters
that are defined on `loadbalancing` exporter level, including
`sending_queue` configuration. The only difference is `endpoint`
parameter that are substituted by `loadbalancing` exporter itself
This means, that `sending_queue`, `retry_on_failure` and `timeout`
settings can be defined only on `otlp` sub-exporters, while top-level
`loadbalancing` exporter is missing all those settings
This configuration approach produces several issue, that are already
reported by users:
* Impossibility to use Persistent Queue in `loadbalancing` exporter (see
open-telemetry#16826). That's happens because `otlp` sub-exporters are sharing the
same configurations, including configuration of the queue, i.e. they all
are using the same `storage` instance at the same time which is not
possible at the moment
* Data loss even using `sending_queue` configuration (see open-telemetry#35378).
That's happens because Queue is defined on level of `otlp` sub-exporters
and if this exporter cannot flush data from queue (for example, endpoint
is not available anymore) there is no other options that just to discard
data from queue, i.e. there is no higher level queue and persistent
storage where data can be returned is case of permanent failure

There might be some other potential issue that was already tracked and
related to current configuration approach

##### Proposed solution
The easiest way to solve issues above - is to use standard approach for
queue, retry and timeout configuration using `exporterhelper`
This will bring queue, retry and timeout functionality to the top-level
of `loadbalancing` exporter, instead of `otlp` sub-exporters
Related to mentioned issues it will bring:
* Single Persistent Queue, that is used by all `otlp` sub-exporters (not
directly of course)
* Queue will not be discarded/destroyed if any (or all) of endpoint that
are unreachable anymore, top-level queue will keep data until new
endpoints will be available
* Scale-up and scale-down event for next layer of OpenTelemetry
Collectors in K8s environments will be more predictable, and will not
include data loss anymore (potential fix for open-telemetry#33959). There is still a
big chance of inconsistency when some data will be send to incorrect
endpoint, but it's already better state that we have right now

##### Noticeable changes
* `loadbalancing` exporter on top-level now uses `exporterhelper` with
all supported functionality by it
* `sending_queue` will be automatically disabled on `otlp` exporters
when it already present on top-level `loadbalancing` exporter. This
change is done to prevent data loss on `otlp` exporters because queue
there doesn't provide expected result. Also it will prevent potential
misconfiguration from user side and as result - irrelevant reported
issues
* `exporter` attribute for metrics generated from `otlp` sub-exporters
now includes endpoint for better visibility and to segregate them from
top-level `loadbalancing` exporter - was `"exporter": "loadbalancing"`,
now `"exporter": "loadbalancing/127.0.0.1:4317"`
* logs, generated by `otlp` sub-exporters now includes additional
attribute `endpoint` with endpoint value with the same reasons as for
metrics

#### Link to tracking issue
Fixes open-telemetry#35378
Fixes open-telemetry#16826

#### Testing
Proposed changes was heavily tested on large K8s environment with set of
different scale-up/scale-down event using persistent queue configuration
- no data loss were detected, everything works as expected

#### Documentation
`README.md` was updated to reflect new configuration parameters
available. Sample `config.yaml` was updated as well
ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this issue Dec 6, 2024
…e and timeout settings (open-telemetry#36094)

#### Description
##### Problem statement
`loadbalancing` exporter is actually a wrapper that's creates and
manages set of actual `otlp` exporters
Those `otlp` exporters technically shares same configuration parameters
that are defined on `loadbalancing` exporter level, including
`sending_queue` configuration. The only difference is `endpoint`
parameter that are substituted by `loadbalancing` exporter itself
This means, that `sending_queue`, `retry_on_failure` and `timeout`
settings can be defined only on `otlp` sub-exporters, while top-level
`loadbalancing` exporter is missing all those settings
This configuration approach produces several issue, that are already
reported by users:
* Impossibility to use Persistent Queue in `loadbalancing` exporter (see
open-telemetry#16826). That's happens because `otlp` sub-exporters are sharing the
same configurations, including configuration of the queue, i.e. they all
are using the same `storage` instance at the same time which is not
possible at the moment
* Data loss even using `sending_queue` configuration (see open-telemetry#35378).
That's happens because Queue is defined on level of `otlp` sub-exporters
and if this exporter cannot flush data from queue (for example, endpoint
is not available anymore) there is no other options that just to discard
data from queue, i.e. there is no higher level queue and persistent
storage where data can be returned is case of permanent failure

There might be some other potential issue that was already tracked and
related to current configuration approach

##### Proposed solution
The easiest way to solve issues above - is to use standard approach for
queue, retry and timeout configuration using `exporterhelper`
This will bring queue, retry and timeout functionality to the top-level
of `loadbalancing` exporter, instead of `otlp` sub-exporters
Related to mentioned issues it will bring:
* Single Persistent Queue, that is used by all `otlp` sub-exporters (not
directly of course)
* Queue will not be discarded/destroyed if any (or all) of endpoint that
are unreachable anymore, top-level queue will keep data until new
endpoints will be available
* Scale-up and scale-down event for next layer of OpenTelemetry
Collectors in K8s environments will be more predictable, and will not
include data loss anymore (potential fix for open-telemetry#33959). There is still a
big chance of inconsistency when some data will be send to incorrect
endpoint, but it's already better state that we have right now

##### Noticeable changes
* `loadbalancing` exporter on top-level now uses `exporterhelper` with
all supported functionality by it
* `sending_queue` will be automatically disabled on `otlp` exporters
when it already present on top-level `loadbalancing` exporter. This
change is done to prevent data loss on `otlp` exporters because queue
there doesn't provide expected result. Also it will prevent potential
misconfiguration from user side and as result - irrelevant reported
issues
* `exporter` attribute for metrics generated from `otlp` sub-exporters
now includes endpoint for better visibility and to segregate them from
top-level `loadbalancing` exporter - was `"exporter": "loadbalancing"`,
now `"exporter": "loadbalancing/127.0.0.1:4317"`
* logs, generated by `otlp` sub-exporters now includes additional
attribute `endpoint` with endpoint value with the same reasons as for
metrics

#### Link to tracking issue
Fixes open-telemetry#35378
Fixes open-telemetry#16826

#### Testing
Proposed changes was heavily tested on large K8s environment with set of
different scale-up/scale-down event using persistent queue configuration
- no data loss were detected, everything works as expected

#### Documentation
`README.md` was updated to reflect new configuration parameters
available. Sample `config.yaml` was updated as well
@jamesmoessis
Copy link
Contributor Author

jamesmoessis commented Dec 9, 2024

@jpkrohling @Fiery-Fenix @MovieStoreGuy I commented this on the PR but since it's closed I'll re-iterate. I don't think this is actually fixed.

#36094 (comment)

@Fiery-Fenix @MovieStoreGuy revisiting this, I think we actually do need to have a way of disabling the sub exporter queue without blocking in series here.
It's not really solving the problem if the sub exporter has a queue remaining, because that's where the data loss occurred. Then, if we disable the sub-exporter queue, it's extremely slow (blocking exporting in series for each endpoint).
So I think this part of the code needs more work. Perhaps we should re open the issue or start a new issue.

The PR unfortunately needs more work for it to avoid data loss for us. Confirmed our production load cannot handle this, even with 500 consumers on the top level queue. With sub-level queue, data loss occurs on scaling. Without it, performance problems occur which cause data loss anyway.

@jamesmoessis
Copy link
Contributor Author

Will create a new issue for this

sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this issue Dec 17, 2024
…e and timeout settings (open-telemetry#36094)

#### Description
##### Problem statement
`loadbalancing` exporter is actually a wrapper that's creates and
manages set of actual `otlp` exporters
Those `otlp` exporters technically shares same configuration parameters
that are defined on `loadbalancing` exporter level, including
`sending_queue` configuration. The only difference is `endpoint`
parameter that are substituted by `loadbalancing` exporter itself
This means, that `sending_queue`, `retry_on_failure` and `timeout`
settings can be defined only on `otlp` sub-exporters, while top-level
`loadbalancing` exporter is missing all those settings
This configuration approach produces several issue, that are already
reported by users:
* Impossibility to use Persistent Queue in `loadbalancing` exporter (see
open-telemetry#16826). That's happens because `otlp` sub-exporters are sharing the
same configurations, including configuration of the queue, i.e. they all
are using the same `storage` instance at the same time which is not
possible at the moment
* Data loss even using `sending_queue` configuration (see open-telemetry#35378).
That's happens because Queue is defined on level of `otlp` sub-exporters
and if this exporter cannot flush data from queue (for example, endpoint
is not available anymore) there is no other options that just to discard
data from queue, i.e. there is no higher level queue and persistent
storage where data can be returned is case of permanent failure

There might be some other potential issue that was already tracked and
related to current configuration approach

##### Proposed solution
The easiest way to solve issues above - is to use standard approach for
queue, retry and timeout configuration using `exporterhelper`
This will bring queue, retry and timeout functionality to the top-level
of `loadbalancing` exporter, instead of `otlp` sub-exporters
Related to mentioned issues it will bring:
* Single Persistent Queue, that is used by all `otlp` sub-exporters (not
directly of course)
* Queue will not be discarded/destroyed if any (or all) of endpoint that
are unreachable anymore, top-level queue will keep data until new
endpoints will be available
* Scale-up and scale-down event for next layer of OpenTelemetry
Collectors in K8s environments will be more predictable, and will not
include data loss anymore (potential fix for open-telemetry#33959). There is still a
big chance of inconsistency when some data will be send to incorrect
endpoint, but it's already better state that we have right now

##### Noticeable changes
* `loadbalancing` exporter on top-level now uses `exporterhelper` with
all supported functionality by it
* `sending_queue` will be automatically disabled on `otlp` exporters
when it already present on top-level `loadbalancing` exporter. This
change is done to prevent data loss on `otlp` exporters because queue
there doesn't provide expected result. Also it will prevent potential
misconfiguration from user side and as result - irrelevant reported
issues
* `exporter` attribute for metrics generated from `otlp` sub-exporters
now includes endpoint for better visibility and to segregate them from
top-level `loadbalancing` exporter - was `"exporter": "loadbalancing"`,
now `"exporter": "loadbalancing/127.0.0.1:4317"`
* logs, generated by `otlp` sub-exporters now includes additional
attribute `endpoint` with endpoint value with the same reasons as for
metrics

#### Link to tracking issue
Fixes open-telemetry#35378
Fixes open-telemetry#16826

#### Testing
Proposed changes was heavily tested on large K8s environment with set of
different scale-up/scale-down event using persistent queue configuration
- no data loss were detected, everything works as expected

#### Documentation
`README.md` was updated to reflect new configuration parameters
available. Sample `config.yaml` was updated as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loadbalancing
Projects
None yet
5 participants