Timeout and retry of exporters is broken #4043

LarsMichelsen · 2024-07-10T21:26:00Z

Describe your environment

OS: Ubuntu 24.04
Python version: Python 3.12
SDK version: 1.25.1
API version: 1.25.1

What happened?

I am piloting the opentelemetry Python implementation in order to integrate it in several Python applications. During my experiments I observed major issues when a non reachable OTLP target is configured. During shutdown of the instrumented application, it seems to hang forever. Even if one might see this as edge case, to me this is really a blocker to start using the library as it affects the instrumented application in a quite negative way.

I tried to fix this locally, then researched here and came across many other reports (e.g #2663, #2402, #3309) and PRs (e.g. #3387, #3764). I read into the code, bugs and PRS and, tried to understand everything, including the history.

I am now writing to open the discussion how to approach this. I'll start with a rough conceptual proposal:

The exporter uses self._timeout (based on OTEL_EXPORTER_OTLP_TIMEOUT, defaults to 10 sec) as maximum time to spend on an export.
The exponential backoff retry is kept, but stays within the boundaries of self._timeout.

On a more technical level:

At the beginning of OTLPExporterMixin._export a deadline is calculated which is then the base for the further execution of this function.
The time needed for each self._client.Export call is subtracted from the deadline.
As long as the deadline is not reached, we do retries with growing waits in between.
In case shutdown is called, the signal is stored in self._shutdown.
Now, once the shutdown signal is set, we can either continue the export until it's deadline, incl. additional retries, and then shutdown or interrupt the export the next time _export gets control over it (when it returns from self._client.Export or we interrupt the retry sleep). I tend to the 2nd option. If there is no agreement on how to proceed with this, we could also make it configurable to give users more control depending on their use case.

This way, we would have a clear maximum time to shutdown. What do you think?

I also propose to change the semantic of shutdown. Instead of waiting for getting acceptance to set the shutdown signal and then setting the shutdown signal, I'd rather simplify it to just "set the shutdown signal". This would return immediately and not interrupt the export. It's totally up to _export to pick up the signal as needed. The caller of the shutdown function would have to join the thread anyways to be sure the export thread correctly shut down. Do you think this is acceptable?

Steps to Reproduce

Configure OTLP target
Export a span
Try to stop the application

Expected Result

The shutdown of my processes should not be influenced in an unexpected way. As a user of the library, ideally I get a clear understanding of the mechanics and control over the timeouts to make my own decisions depending on my use case.

Actual Result

A hanging application which would not shut down in a few seconds as before.

Additional context

No response

Would you like to implement a fix?

Yes

The text was updated successfully, but these errors were encountered:

xrmx · 2024-07-11T08:19:07Z

We failed at reviewing #3764 because it's huge, if you think the code is fine cutting it in more smaller pieces could be a path to move forward. If not still please still try to work in smaller batches.

LarsMichelsen · 2024-07-11T09:49:00Z

This change looks like the most promising of the ones I've seen so far. Of course, this is far too big a chunk for readers to understand easily. I also haven't had a closer look at the commit yet to understand if it fits my suggestion above. I will do that later.

Regardless of the PR, does my approach above sound like a viable path or are there any concerns?

xrmx · 2024-07-11T15:49:01Z

@LarsMichelsen I think that assuring the retries should happen inside the OTEL_EXPORTER_OTLP_TIMEOUT is correct so I agree with the conceptual proposal.

This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.

LarsMichelsen · 2024-08-26T15:40:42Z

I took a deep look into #3764, tried to separate the individual changes and bring them into a meaningful order. If you are interested, you can find the current state of my try in https://github.com/LarsMichelsen/opentelemetry-python/tree/exporter_timeout .

While I think the approach of the PR in general is good, I changed some aspects of the change. For example, I don't agree with the change picking the lowest of the given timeouts (see here).

Having the change split into multiple steps, it's still a big undertaking. So I guess a reviewer will have to take some time to get into it. I hope that someone will be found who feels up to it.

I'd also be fine applying the changes separately to get the set of pending changes smaller step by step. Don't know how you generally deal with such changes. Just let me know.

Overall, I must say that I had a hard time with several tests. I think it would help to get rid of mocks and patches and make the code more easily testable, for example by using more dependency injection. However, I preferred to hold back at this point, as the changes are already extensive enough.

Specifically I am currently facing issues on Windows test runs (https://github.com/LarsMichelsen/opentelemetry-python/actions/runs/10451069489). To me it is not clear whether this a known general weakness or my changes broke something. Any guidance on this?

In my organization this issue raises questions whether the library is production ready. So I see this as something that in the some or the other way needs to be solved sooner than later. If we can not solve it, it may even prevent adoption in our stack. I wonder if this is not that relevant to other environments and how others deal with the current behavior?

lzchen · 2024-08-26T19:40:35Z

@LarsMichelsen

Thanks for looking into this.

I'd also be fine applying the changes separately to get the set of pending changes smaller step by step.

If you are willing to take a stab at this, it would be greatly appreciated. We have had a lack of contributors/approvers who are knowledgeable enough around this topic so anything that can make the PR more readable and easier to comprehend would be most optimal. Also, if you have any design/architectural decisions you would like to run by the community to get more eyes on, feel free to join the weekly Python SIG (https://calendar.google.com/calendar/u/0/embed?src=c_2bf73e3b6b530da4babd444e72b76a6ad893a5c3f43cf40467abc7a9a897f977@group.calendar.google.com) and bring up this topic. Feel free to add your topic to the agenda. We meet every Thursday 9am PST.

LarsMichelsen · 2024-08-29T09:47:59Z

If you are willing to take a stab at this, it would be greatly appreciated. We have had a lack of contributors/approvers who are knowledgeable enough around this topic so anything that can make the PR more readable and easier to comprehend would be most optimal.

Have you had a look at the first link I provided? It points to a branch where I split the changes. Here is the list of curent changes: main...LarsMichelsen:opentelemetry-python:exporter_timeout

In principle, is this in a form which could be acceptable or do you see any obvious issues?

If not, I'd try to iron out the windows related issues reported by the CI and send a PR.

This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.

lzchen · 2024-09-04T15:05:58Z

I'd try to iron out the windows related issues reported by the CI and send a PR.

This would probably be preferred and would get more eyes on.

LarsMichelsen · 2024-09-05T18:16:03Z

Unfortunately I only get the error reported by the Windows jobs executed on the Github CI. Having no Windows system at hand right now, I am having a hard time figuring out whats going on. I'll have to get my hands on some Windows system to track this down. Impossible to do it via the CI, the turnaround is simply too long.

This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.

LarsMichelsen · 2024-09-08T15:07:12Z

Got my hands on a Windows system and debugged my way through it. Seems I had too tight timeouts set so that it could not finish successfully in all environments.

I just created the pull request #4183. Now I hope that someone has the time and patience to review these changes.

This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.

Would be nice if we could enable tracing in all cases. But the current timeout and retry logic of the Opentelemetry SDK does not work as intended, which can lead to longer hanging executions in case the configured collector (e.g. Jaeger) is not reachable. For now only enable tracing in case the caller provides a tracing context. This way we produce traces only when the caller is interested in them. This condition can be removed once the timeout and retry logic was fixed (open-telemetry/opentelemetry-python#4043). Change-Id: Id9ce70cf3232a5e1bbe9497e8c30a568ac693b78

LarsMichelsen added the bug Something isn't working label Jul 10, 2024

LarsMichelsen mentioned this issue Jul 11, 2024

Current handling of OTEL_EXPORTER_OTLP_TIMEOUT is not in line with spec #4044

Open

LarsMichelsen linked a pull request Sep 8, 2024 that will close this issue

Improve timeout and retry mechanic of exporters #4183

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout and retry of exporters is broken #4043

Timeout and retry of exporters is broken #4043

LarsMichelsen commented Jul 10, 2024

xrmx commented Jul 11, 2024

LarsMichelsen commented Jul 11, 2024

xrmx commented Jul 11, 2024

LarsMichelsen commented Aug 26, 2024

lzchen commented Aug 26, 2024

LarsMichelsen commented Aug 29, 2024

lzchen commented Sep 4, 2024 •

edited

Loading

LarsMichelsen commented Sep 5, 2024

LarsMichelsen commented Sep 8, 2024 •

edited

Loading

Timeout and retry of exporters is broken #4043

Timeout and retry of exporters is broken #4043

Comments

LarsMichelsen commented Jul 10, 2024

Describe your environment

What happened?

Steps to Reproduce

Expected Result

Actual Result

Additional context

Would you like to implement a fix?

xrmx commented Jul 11, 2024

LarsMichelsen commented Jul 11, 2024

xrmx commented Jul 11, 2024

LarsMichelsen commented Aug 26, 2024

lzchen commented Aug 26, 2024

LarsMichelsen commented Aug 29, 2024

lzchen commented Sep 4, 2024 • edited Loading

LarsMichelsen commented Sep 5, 2024

LarsMichelsen commented Sep 8, 2024 • edited Loading

lzchen commented Sep 4, 2024 •

edited

Loading

LarsMichelsen commented Sep 8, 2024 •

edited

Loading