Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout and retry of exporters is broken #4043

Open
LarsMichelsen opened this issue Jul 10, 2024 · 9 comments · May be fixed by #4183
Open

Timeout and retry of exporters is broken #4043

LarsMichelsen opened this issue Jul 10, 2024 · 9 comments · May be fixed by #4183
Labels
bug Something isn't working

Comments

@LarsMichelsen
Copy link

Describe your environment

OS: Ubuntu 24.04
Python version: Python 3.12
SDK version: 1.25.1
API version: 1.25.1

What happened?

I am piloting the opentelemetry Python implementation in order to integrate it in several Python applications. During my experiments I observed major issues when a non reachable OTLP target is configured. During shutdown of the instrumented application, it seems to hang forever. Even if one might see this as edge case, to me this is really a blocker to start using the library as it affects the instrumented application in a quite negative way.

I tried to fix this locally, then researched here and came across many other reports (e.g #2663, #2402, #3309) and PRs (e.g. #3387, #3764). I read into the code, bugs and PRS and, tried to understand everything, including the history.

I am now writing to open the discussion how to approach this. I'll start with a rough conceptual proposal:

  1. The exporter uses self._timeout (based on OTEL_EXPORTER_OTLP_TIMEOUT, defaults to 10 sec) as maximum time to spend on an export.
  2. The exponential backoff retry is kept, but stays within the boundaries of self._timeout.

On a more technical level:

  1. At the beginning of OTLPExporterMixin._export a deadline is calculated which is then the base for the further execution of this function.
  2. The time needed for each self._client.Export call is subtracted from the deadline.
  3. As long as the deadline is not reached, we do retries with growing waits in between.
  4. In case shutdown is called, the signal is stored in self._shutdown.
  5. Now, once the shutdown signal is set, we can either continue the export until it's deadline, incl. additional retries, and then shutdown or interrupt the export the next time _export gets control over it (when it returns from self._client.Export or we interrupt the retry sleep). I tend to the 2nd option. If there is no agreement on how to proceed with this, we could also make it configurable to give users more control depending on their use case.

This way, we would have a clear maximum time to shutdown. What do you think?

I also propose to change the semantic of shutdown. Instead of waiting for getting acceptance to set the shutdown signal and then setting the shutdown signal, I'd rather simplify it to just "set the shutdown signal". This would return immediately and not interrupt the export. It's totally up to _export to pick up the signal as needed. The caller of the shutdown function would have to join the thread anyways to be sure the export thread correctly shut down. Do you think this is acceptable?

Steps to Reproduce

  1. Configure OTLP target
  2. Export a span
  3. Try to stop the application

Expected Result

The shutdown of my processes should not be influenced in an unexpected way. As a user of the library, ideally I get a clear understanding of the mechanics and control over the timeouts to make my own decisions depending on my use case.

Actual Result

A hanging application which would not shut down in a few seconds as before.

Additional context

No response

Would you like to implement a fix?

Yes

@LarsMichelsen LarsMichelsen added the bug Something isn't working label Jul 10, 2024
@xrmx
Copy link
Contributor

xrmx commented Jul 11, 2024

We failed at reviewing #3764 because it's huge, if you think the code is fine cutting it in more smaller pieces could be a path to move forward. If not still please still try to work in smaller batches.

@LarsMichelsen
Copy link
Author

This change looks like the most promising of the ones I've seen so far. Of course, this is far too big a chunk for readers to understand easily. I also haven't had a closer look at the commit yet to understand if it fits my suggestion above. I will do that later.

Regardless of the PR, does my approach above sound like a viable path or are there any concerns?

@xrmx
Copy link
Contributor

xrmx commented Jul 11, 2024

@LarsMichelsen I think that assuring the retries should happen inside the OTEL_EXPORTER_OTLP_TIMEOUT is correct so I agree with the conceptual proposal.

LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this issue Aug 19, 2024
This unifies the implementation of the OTLP exporters and the HTTP log
exporter.

Next step is to consolidate the remaining HTTP exporters.

Fixes open-telemetry#4043.
LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this issue Aug 26, 2024
This unifies the implementation of the OTLP exporters and the HTTP log
exporter.

Next step is to consolidate the remaining HTTP exporters.

Fixes open-telemetry#4043.
@LarsMichelsen
Copy link
Author

I took a deep look into #3764, tried to separate the individual changes and bring them into a meaningful order. If you are interested, you can find the current state of my try in https://github.com/LarsMichelsen/opentelemetry-python/tree/exporter_timeout .

While I think the approach of the PR in general is good, I changed some aspects of the change. For example, I don't agree with the change picking the lowest of the given timeouts (see here).

Having the change split into multiple steps, it's still a big undertaking. So I guess a reviewer will have to take some time to get into it. I hope that someone will be found who feels up to it.

I'd also be fine applying the changes separately to get the set of pending changes smaller step by step. Don't know how you generally deal with such changes. Just let me know.

Overall, I must say that I had a hard time with several tests. I think it would help to get rid of mocks and patches and make the code more easily testable, for example by using more dependency injection. However, I preferred to hold back at this point, as the changes are already extensive enough.

Specifically I am currently facing issues on Windows test runs (https://github.com/LarsMichelsen/opentelemetry-python/actions/runs/10451069489). To me it is not clear whether this a known general weakness or my changes broke something. Any guidance on this?

In my organization this issue raises questions whether the library is production ready. So I see this as something that in the some or the other way needs to be solved sooner than later. If we can not solve it, it may even prevent adoption in our stack. I wonder if this is not that relevant to other environments and how others deal with the current behavior?

@lzchen
Copy link
Contributor

lzchen commented Aug 26, 2024

@LarsMichelsen

Thanks for looking into this.

I'd also be fine applying the changes separately to get the set of pending changes smaller step by step.

If you are willing to take a stab at this, it would be greatly appreciated. We have had a lack of contributors/approvers who are knowledgeable enough around this topic so anything that can make the PR more readable and easier to comprehend would be most optimal. Also, if you have any design/architectural decisions you would like to run by the community to get more eyes on, feel free to join the weekly Python SIG (https://calendar.google.com/calendar/u/0/embed?src=c_2bf73e3b6b530da4babd444e72b76a6ad893a5c3f43cf40467abc7a9a897f977@group.calendar.google.com) and bring up this topic. Feel free to add your topic to the agenda. We meet every Thursday 9am PST.

@LarsMichelsen
Copy link
Author

If you are willing to take a stab at this, it would be greatly appreciated. We have had a lack of contributors/approvers who are knowledgeable enough around this topic so anything that can make the PR more readable and easier to comprehend would be most optimal.

Have you had a look at the first link I provided? It points to a branch where I split the changes. Here is the list of curent changes: main...LarsMichelsen:opentelemetry-python:exporter_timeout

In principle, is this in a form which could be acceptable or do you see any obvious issues?

If not, I'd try to iron out the windows related issues reported by the CI and send a PR.

LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this issue Aug 29, 2024
This unifies the implementation of the OTLP exporters and the HTTP log
exporter.

Next step is to consolidate the remaining HTTP exporters.

Fixes open-telemetry#4043.
@lzchen
Copy link
Contributor

lzchen commented Sep 4, 2024

I'd try to iron out the windows related issues reported by the CI and send a PR.

This would probably be preferred and would get more eyes on.

@LarsMichelsen
Copy link
Author

Unfortunately I only get the error reported by the Windows jobs executed on the Github CI. Having no Windows system at hand right now, I am having a hard time figuring out whats going on. I'll have to get my hands on some Windows system to track this down. Impossible to do it via the CI, the turnaround is simply too long.

LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this issue Sep 7, 2024
This unifies the implementation of the OTLP exporters and the HTTP log
exporter.

Next step is to consolidate the remaining HTTP exporters.

Fixes open-telemetry#4043.
@LarsMichelsen LarsMichelsen linked a pull request Sep 8, 2024 that will close this issue
10 tasks
@LarsMichelsen
Copy link
Author

LarsMichelsen commented Sep 8, 2024

Got my hands on a Windows system and debugged my way through it. Seems I had too tight timeouts set so that it could not finish successfully in all environments.

I just created the pull request #4183. Now I hope that someone has the time and patience to review these changes.

LarsMichelsen added a commit to LarsMichelsen/opentelemetry-python that referenced this issue Sep 14, 2024
This unifies the implementation of the OTLP exporters and the HTTP log
exporter.

Next step is to consolidate the remaining HTTP exporters.

Fixes open-telemetry#4043.
CheckmkCI pushed a commit to Checkmk/checkmk that referenced this issue Oct 6, 2024
Would be nice if we could enable tracing in all cases. But the current timeout and retry
logic of the Opentelemetry SDK does not work as intended, which can lead to longer hanging
executions in case the configured collector (e.g. Jaeger) is not reachable.

For now only enable tracing in case the caller provides a tracing context. This way we produce
traces only when the caller is interested in them.

This condition can be removed once the timeout and retry logic was fixed
(open-telemetry/opentelemetry-python#4043).

Change-Id: Id9ce70cf3232a5e1bbe9497e8c30a568ac693b78
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants