-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout and retry of exporters is broken #4043
Comments
We failed at reviewing #3764 because it's huge, if you think the code is fine cutting it in more smaller pieces could be a path to move forward. If not still please still try to work in smaller batches. |
This change looks like the most promising of the ones I've seen so far. Of course, this is far too big a chunk for readers to understand easily. I also haven't had a closer look at the commit yet to understand if it fits my suggestion above. I will do that later. Regardless of the PR, does my approach above sound like a viable path or are there any concerns? |
@LarsMichelsen I think that assuring the retries should happen inside the |
This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.
This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.
I took a deep look into #3764, tried to separate the individual changes and bring them into a meaningful order. If you are interested, you can find the current state of my try in https://github.com/LarsMichelsen/opentelemetry-python/tree/exporter_timeout . While I think the approach of the PR in general is good, I changed some aspects of the change. For example, I don't agree with the change picking the lowest of the given timeouts (see here). Having the change split into multiple steps, it's still a big undertaking. So I guess a reviewer will have to take some time to get into it. I hope that someone will be found who feels up to it. I'd also be fine applying the changes separately to get the set of pending changes smaller step by step. Don't know how you generally deal with such changes. Just let me know. Overall, I must say that I had a hard time with several tests. I think it would help to get rid of mocks and patches and make the code more easily testable, for example by using more dependency injection. However, I preferred to hold back at this point, as the changes are already extensive enough. Specifically I am currently facing issues on Windows test runs (https://github.com/LarsMichelsen/opentelemetry-python/actions/runs/10451069489). To me it is not clear whether this a known general weakness or my changes broke something. Any guidance on this? In my organization this issue raises questions whether the library is production ready. So I see this as something that in the some or the other way needs to be solved sooner than later. If we can not solve it, it may even prevent adoption in our stack. I wonder if this is not that relevant to other environments and how others deal with the current behavior? |
Thanks for looking into this.
If you are willing to take a stab at this, it would be greatly appreciated. We have had a lack of contributors/approvers who are knowledgeable enough around this topic so anything that can make the PR more readable and easier to comprehend would be most optimal. Also, if you have any design/architectural decisions you would like to run by the community to get more eyes on, feel free to join the weekly Python SIG (https://calendar.google.com/calendar/u/0/embed?src=c_2bf73e3b6b530da4babd444e72b76a6ad893a5c3f43cf40467abc7a9a897f977@group.calendar.google.com) and bring up this topic. Feel free to add your topic to the agenda. We meet every Thursday 9am PST. |
Have you had a look at the first link I provided? It points to a branch where I split the changes. Here is the list of curent changes: main...LarsMichelsen:opentelemetry-python:exporter_timeout In principle, is this in a form which could be acceptable or do you see any obvious issues? If not, I'd try to iron out the windows related issues reported by the CI and send a PR. |
This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.
This would probably be preferred and would get more eyes on. |
Unfortunately I only get the error reported by the Windows jobs executed on the Github CI. Having no Windows system at hand right now, I am having a hard time figuring out whats going on. I'll have to get my hands on some Windows system to track this down. Impossible to do it via the CI, the turnaround is simply too long. |
This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.
Got my hands on a Windows system and debugged my way through it. Seems I had too tight timeouts set so that it could not finish successfully in all environments. I just created the pull request #4183. Now I hope that someone has the time and patience to review these changes. |
This unifies the implementation of the OTLP exporters and the HTTP log exporter. Next step is to consolidate the remaining HTTP exporters. Fixes open-telemetry#4043.
Would be nice if we could enable tracing in all cases. But the current timeout and retry logic of the Opentelemetry SDK does not work as intended, which can lead to longer hanging executions in case the configured collector (e.g. Jaeger) is not reachable. For now only enable tracing in case the caller provides a tracing context. This way we produce traces only when the caller is interested in them. This condition can be removed once the timeout and retry logic was fixed (open-telemetry/opentelemetry-python#4043). Change-Id: Id9ce70cf3232a5e1bbe9497e8c30a568ac693b78
Describe your environment
OS: Ubuntu 24.04
Python version: Python 3.12
SDK version: 1.25.1
API version: 1.25.1
What happened?
I am piloting the opentelemetry Python implementation in order to integrate it in several Python applications. During my experiments I observed major issues when a non reachable OTLP target is configured. During shutdown of the instrumented application, it seems to hang forever. Even if one might see this as edge case, to me this is really a blocker to start using the library as it affects the instrumented application in a quite negative way.
I tried to fix this locally, then researched here and came across many other reports (e.g #2663, #2402, #3309) and PRs (e.g. #3387, #3764). I read into the code, bugs and PRS and, tried to understand everything, including the history.
I am now writing to open the discussion how to approach this. I'll start with a rough conceptual proposal:
self._timeout
(based onOTEL_EXPORTER_OTLP_TIMEOUT
, defaults to 10 sec) as maximum time to spend on an export.self._timeout
.On a more technical level:
OTLPExporterMixin._export
a deadline is calculated which is then the base for the further execution of this function.self._client.Export
call is subtracted from the deadline.shutdown
is called, the signal is stored inself._shutdown
._export
gets control over it (when it returns fromself._client.Export
or we interrupt the retry sleep). I tend to the 2nd option. If there is no agreement on how to proceed with this, we could also make it configurable to give users more control depending on their use case.This way, we would have a clear maximum time to shutdown. What do you think?
I also propose to change the semantic of
shutdown
. Instead of waiting for getting acceptance to set the shutdown signal and then setting the shutdown signal, I'd rather simplify it to just "set the shutdown signal". This would return immediately and not interrupt the export. It's totally up to_export
to pick up the signal as needed. The caller of theshutdown
function would have to join the thread anyways to be sure the export thread correctly shut down. Do you think this is acceptable?Steps to Reproduce
Expected Result
The shutdown of my processes should not be influenced in an unexpected way. As a user of the library, ideally I get a clear understanding of the mechanics and control over the timeouts to make my own decisions depending on my use case.
Actual Result
A hanging application which would not shut down in a few seconds as before.
Additional context
No response
Would you like to implement a fix?
Yes
The text was updated successfully, but these errors were encountered: