Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uwsgi + flask, got Respawned uWSGI worker, very frequently #3640

Closed
Liubey opened this issue Jan 22, 2024 · 32 comments
Closed

uwsgi + flask, got Respawned uWSGI worker, very frequently #3640

Liubey opened this issue Jan 22, 2024 · 32 comments
Labels
bug Something isn't working

Comments

@Liubey
Copy link

Liubey commented Jan 22, 2024

Describe your environment Describe any aspect of your environment relevant to the problem, including your Python version, platform, version numbers of installed dependencies, information about your cloud hosting provider, etc. If you're reporting a problem with a specific version of a library in this repo, please check whether the problem has been fixed on main.

Steps to reproduce
I use the code like:https://github.com/open-telemetry/opentelemetry-python/blob/main/docs/examples/fork-process-model/flask-uwsgi/app.py

What is the expected behavior?
report span rightly
What is the actual behavior?
logs like:

..The work of process 534 is done. Seeya!
Mon Jan 22 17:17:43 2024 - worker 4 (pid: 449) is taking too much time to die...NO MERCY !!!
worker 4 killed successfully (pid: 449)
Respawned uWSGI worker 4 (new pid: 568)

very very frequently. This can't be normal. But uwsgi provided no more logs.
and config like this:

master = true
enable-threads = True
harakiri-graceful-timeout = 5
threads = 2
processes = 4

Additional context
Add any other context about the problem here.

@Liubey Liubey added the bug Something isn't working label Jan 22, 2024
@Liubey
Copy link
Author

Liubey commented Jan 22, 2024

update: threads = 1 will be ok. but threads = 2, THE Ninth Request will be hang till the worker Respawned.

@Liubey
Copy link
Author

Liubey commented Jan 24, 2024

same code, When I use thrift.JaegerExporter(replace OTLPSpanExporter(endpoint="http://localhost:6831", insecure=True)), everything is ok.
It’s wired.

@lzchen
Copy link
Contributor

lzchen commented Jan 25, 2024

@Liubey

What commands are you running to run the example? Is there a specific process runner you are using?

@Liubey
Copy link
Author

Liubey commented Jan 26, 2024

@lzchen
uwsgi --http :8000 --wsgi-file app.py --callable application --master --enable-threads --threads 2 --processes 4
It’s wired cause replace OTLPSpanExporter to JaegerExporter is ok. I don't know why OTLPSpanExporter does't work fine at uwsgi mutithread.

@methane
Copy link
Contributor

methane commented Feb 15, 2024

Try --lazy-apps

@Liubey
Copy link
Author

Liubey commented Mar 4, 2024

It's not a easy way to change base config for a mature business project😂

@methane
Copy link
Contributor

methane commented Mar 4, 2024

Is it harder than making whole otel-python[-contrib] fork-safe?

@Liubey
Copy link
Author

Liubey commented Mar 6, 2024

😭you are right. I will try to research and drive it.

@methane
Copy link
Contributor

methane commented Mar 6, 2024

@Liubey
Copy link
Author

Liubey commented Mar 6, 2024

@methane Oh, thanks for your addition information.
But strange thing is, I have two project, One use django + uwsgi + OTLPSpanExporter, it's ok and not set lazy-apps.

@methane
Copy link
Contributor

methane commented Mar 6, 2024

lazy-apps is the most straightforward.
But there are some options to avoid the fork after loading app.

@methane
Copy link
Contributor

methane commented Mar 6, 2024

@Liubey
Copy link
Author

Liubey commented Mar 7, 2024

@methane Yes, I saw that before. In my uwsgi + flask project, I use @postfork to test, It still not work, like I sad:"got Respawned uWSGI worker, very frequently"

@methane
Copy link
Contributor

methane commented Mar 7, 2024

I use the code like:https://github.com/open-telemetry/opentelemetry-python/blob/main/docs/examples/fork-process-model/flask-uwsgi/app.py

like? Did you reproduce it with exact the code?

Unless there is a reproducible step, all efforts to investigate will be in vain.

@Liubey
Copy link
Author

Liubey commented Mar 7, 2024

@methane yes, When I use
image
I got same thing when I wait a few minutes after server startup for receive requests.
the highlight is max-requests and thread set.

@Liubey
Copy link
Author

Liubey commented Mar 7, 2024

maybe the key setter is '''max-requests''', When I remove it, It looks normal.

@Liubey
Copy link
Author

Liubey commented Mar 7, 2024

When I set '''max-requests=50''' AND '''processes=4''', the 201~204 request will hang for 50 seconds+
image
and got "Respawned uWSGI worker " for the 4 processor
image

@Liubey
Copy link
Author

Liubey commented Mar 7, 2024

@jpkrohling @methane @lzchen Could you have any ideas for that? Sorry for disturb your.

@Liubey
Copy link
Author

Liubey commented Mar 7, 2024

the command is :
uwsgi --http :8000 --wsgi-file app.py --callable application --master --enable-threads --threads 2 --processes 4 --max-requests 50

@methane
Copy link
Contributor

methane commented Mar 7, 2024

OK. I got it.

  • BatchSpanProcessor uses daemon thread. So BatchSpanProcessor.shutdown() is needed to stop it.

  • TracerProvider uses atexit to call shutdown(). And its shutdown calls BatchSpanProcessor.shutdown().

if shutdown_on_exit:
self._atexit_handler = atexit.register(self.shutdown)

  • But uwsgi may not invoke atexit. So deamon thread is not stopped and Python process keep running.

unbit/uwsgi#2063

@Liubey
Copy link
Author

Liubey commented Mar 8, 2024

@methane Thank you very much for your careful analysis. This is unbelievable.
Is there an easy solution to bypass the bug, patch the OtelBatchSpanProcessor thread to deamon=False?

@Liubey
Copy link
Author

Liubey commented Mar 8, 2024

@methane Oh,sorry to mention, When I use JaegerExporter, I also use BatchSpanProcessor, so I think the key thing is OTLPExporter.shutdown or export?

@Liubey
Copy link
Author

Liubey commented Mar 8, 2024

test with http.OTLPSpanExporter, looks normal, not hanging, just
image

@methane
Copy link
Contributor

methane commented Mar 8, 2024

I can reproduce it with http.OTLPSpanExporter.
I don't test JaegerExporter yet.

grpc exporter shutdown is not related because it doesn't use daemon thread.

@Liubey
Copy link
Author

Liubey commented Mar 8, 2024

@methane
maybe the key thing is grpc, cause I use thrift.JaegerExporter, it's ok. and grpc.JaegerExporter is not ok.

I found some message on grpc rep, see-->grpc/grpc#23796 (comment)
He said :
numbsafari commented on Feb 2, 2022
Hi, 2022 checking in. It appears that gRPC doesn’t work with any kind of pre-fork (eg uWSGI) model system unless you specify the environment variables noted above.

Perhaps a solution would be to add some documentation?

@methane
Copy link
Contributor

methane commented Mar 8, 2024

It is separeted issue. It is fork-safe problem. And --lazy-apps (recommended) or @post_fork can avoid it.

@methane
Copy link
Contributor

methane commented Mar 10, 2024

I made a pull request to uwsgi that fixes no atexit call issue when mixing threads and max-requests.

@Liubey
Copy link
Author

Liubey commented Mar 12, 2024

@methane Thank you so much. Hope uwsgi could merge the request quickly.😂

@methane
Copy link
Contributor

methane commented Mar 19, 2024

UPDATE: This is not caused by BatchSpanProcessor. Python doesn't wait daemon threads at all. I'm sorry about it.

This is caused by uwsgi. uwsgi uses pthread_cancel() to stop worker even for graceful shutdown. It is really bad idea. pthread_cancel() is almost impossible to use safely. I think both of uwsgi and Python are cancel-safe.

For the record, I can reproduce it easily even I remove all opentelemetry imports.

@Liubey
Copy link
Author

Liubey commented Mar 21, 2024

@methane Thank you for your comment.
But what I want to know is that if the thread is set to 1, it doesn't seem to happen this case, even max-requests set.

for example:

 uwsgi --http :8000 --wsgi-file app.py --callable application --master --enable-threads --threads 1 --processes 4 --max-requests 50

stop the worker very fast.
image

@methane
Copy link
Contributor

methane commented Mar 21, 2024

When threads=1, there is only main thread. When main thread reached max-requests, main thread tries to stop other thread, but no other thread. pthread_cancel() is not called so no problem.

I guess the hungup is happens when worker thread excpet main thread (core_id=0) reaches max-requests and calls pthread_cancel() to main thread. But I am not sure. Threads is hard. But pthread_cancel() is too hard. Sometime, I can not even attaching broken process.

For fixing it, please try unbit/uwsgi#2615 and unbit/uwsgi#2619.

Anyway, I can reproduce it without any opentelemetry imports. So this is not an issue of otel.
I think this issue should be closed.

@Liubey
Copy link
Author

Liubey commented Mar 21, 2024

Thank you for your additional information. I will close this issue. Once again, I appreciate your help for this issue, and I wish you thread-safety forever. Haha😝

@Liubey Liubey closed this as completed Mar 21, 2024
xrmx pushed a commit to xrmx/uwsgi that referenced this issue Apr 6, 2024
worker stops when reached max_requests or reload_on_*.

https://github.com/unbit/uwsgi/blob/39f3ade88c88693f643e70ecf6c36f9b375f00a2/core/utils.c#L1216-L1251

`goodbye_cruel_world()` is not graceful. It caused `atexit` not called.
If atexit stops daemon threads, worker won't stop until killed from master.

Using a reproducer similar to tests/threads_atexit.py:

*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 93920)
spawned uWSGI worker 1 (pid: 93921, cores: 80)
...The work of process 93921 is done (max requests reached (641 >= 20)). Seeya!
worker 1 killed successfully (pid: 93921)
Respawned uWSGI worker 1 (new pid: 94019)
...The work of process 94019 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94019)
Respawned uWSGI worker 1 (new pid: 94099)
...The work of process 94099 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94099)
Respawned uWSGI worker 1 (new pid: 94179)
...The work of process 94179 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94179)
Respawned uWSGI worker 1 (new pid: 94260)
...The work of process 94260 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94260)
Respawned uWSGI worker 1 (new pid: 94340)

atexit is not called.

*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 94781)
spawned uWSGI worker 1 (pid: 94782, cores: 80)
...The work of process 94782 is done (max requests reached (402 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94782)
Respawned uWSGI worker 1 (new pid: 94880)
...The work of process 94880 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94880)
Respawned uWSGI worker 1 (new pid: 94960)
...The work of process 94960 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94960)
Respawned uWSGI worker 1 (new pid: 95040)
...The work of process 95040 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 95040)
Respawned uWSGI worker 1 (new pid: 95120)
...The work of process 95120 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 95120)
Respawned uWSGI worker 1 (new pid: 95200)

atexit is called

Related issue:

open-telemetry/opentelemetry-python#3640
xrmx pushed a commit to xrmx/uwsgi that referenced this issue Apr 7, 2024
worker stops when reached max_requests or reload_on_*.

https://github.com/unbit/uwsgi/blob/39f3ade88c88693f643e70ecf6c36f9b375f00a2/core/utils.c#L1216-L1251

`goodbye_cruel_world()` is not graceful. It caused `atexit` not called.
If atexit stops daemon threads, worker won't stop until killed from master.

Using a reproducer similar to tests/threads_atexit.py:

*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 93920)
spawned uWSGI worker 1 (pid: 93921, cores: 80)
...The work of process 93921 is done (max requests reached (641 >= 20)). Seeya!
worker 1 killed successfully (pid: 93921)
Respawned uWSGI worker 1 (new pid: 94019)
...The work of process 94019 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94019)
Respawned uWSGI worker 1 (new pid: 94099)
...The work of process 94099 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94099)
Respawned uWSGI worker 1 (new pid: 94179)
...The work of process 94179 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94179)
Respawned uWSGI worker 1 (new pid: 94260)
...The work of process 94260 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94260)
Respawned uWSGI worker 1 (new pid: 94340)

atexit is not called.

*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 94781)
spawned uWSGI worker 1 (pid: 94782, cores: 80)
...The work of process 94782 is done (max requests reached (402 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94782)
Respawned uWSGI worker 1 (new pid: 94880)
...The work of process 94880 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94880)
Respawned uWSGI worker 1 (new pid: 94960)
...The work of process 94960 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94960)
Respawned uWSGI worker 1 (new pid: 95040)
...The work of process 95040 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 95040)
Respawned uWSGI worker 1 (new pid: 95120)
...The work of process 95120 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 95120)
Respawned uWSGI worker 1 (new pid: 95200)

atexit is called

Related issue:

open-telemetry/opentelemetry-python#3640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants