Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graceful stop worker when max_requests/realod_on_* #2615

Closed
wants to merge 6 commits into from

Conversation

methane
Copy link
Contributor

@methane methane commented Mar 8, 2024

worker stops when reached max_requests or reload_on_*.

uwsgi/core/utils.c

Lines 1216 to 1251 in 39f3ade

if (uwsgi.max_requests > 0 && uwsgi.workers[uwsgi.mywid].delta_requests >= (uwsgi.max_requests + ((uwsgi.mywid-1) * uwsgi.max_requests_delta))
&& (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
goodbye_cruel_world("max requests reached (%llu >= %llu)",
(unsigned long long) uwsgi.workers[uwsgi.mywid].delta_requests,
(unsigned long long) (uwsgi.max_requests + ((uwsgi.mywid-1) * uwsgi.max_requests_delta))
);
}
if (uwsgi.reload_on_as && (rlim_t) vsz >= uwsgi.reload_on_as && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
goodbye_cruel_world("reload-on-as limit reached (%llu >= %llu)",
(unsigned long long) (rlim_t) vsz,
(unsigned long long) uwsgi.reload_on_as
);
}
if (uwsgi.reload_on_rss && (rlim_t) rss >= uwsgi.reload_on_rss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
goodbye_cruel_world("reload-on-rss limit reached (%llu >= %llu)",
(unsigned long long) (rlim_t) rss,
(unsigned long long) uwsgi.reload_on_rss
);
}
#ifdef __linux__
if (uwsgi.reload_on_uss && (rlim_t) uss >= uwsgi.reload_on_uss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
goodbye_cruel_world("reload-on-uss limit reached (%llu >= %llu)",
(unsigned long long) (rlim_t) uss,
(unsigned long long) uwsgi.reload_on_uss
);
}
if (uwsgi.reload_on_pss && (rlim_t) pss >= uwsgi.reload_on_pss && (end_of_request - (uwsgi.workers[uwsgi.mywid].last_spawn * 1000000) >= uwsgi.min_worker_lifetime * 1000000)) {
goodbye_cruel_world("reload-on-pss limit reached (%llu >= %llu)",
(unsigned long long) (rlim_t) pss,
(unsigned long long) uwsgi.reload_on_pss
);
}

goodbye_cruel_world() is not graceful. It caused atexit not called.
If atexit stops daemon threads, worker won't stop until killed from master.

Reproduce

# mt_sample.py
import atexit, time, sys
import uwsgi

@atexit.register
def on_exit():
    print(f"on_exit: {uwsgi.worker_id()=}")
    sys.stdout.flush()

def application(env, start_response):
    time.sleep(1)
    start_response('200 OK', [('Content-Type','text/html')])
    return [b"Hello World"]
./uwsgi --wsgi-file=mt_sample.py --master --http-socket=:8000 --workers=1 --threads=80 --max-requests=20 --min-worker-lifetime=10 -L

I used hey to request. But any other load testing tools is OK.

./hey -c 80 -z 1m  'http://127.0.0.1:8000/'

master branch

*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 93920)
spawned uWSGI worker 1 (pid: 93921, cores: 80)
...The work of process 93921 is done (max requests reached (641 >= 20)). Seeya!
worker 1 killed successfully (pid: 93921)
Respawned uWSGI worker 1 (new pid: 94019)
...The work of process 94019 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94019)
Respawned uWSGI worker 1 (new pid: 94099)
...The work of process 94099 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94099)
Respawned uWSGI worker 1 (new pid: 94179)
...The work of process 94179 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94179)
Respawned uWSGI worker 1 (new pid: 94260)
...The work of process 94260 is done (max requests reached (721 >= 20)). Seeya!
worker 1 killed successfully (pid: 94260)
Respawned uWSGI worker 1 (new pid: 94340)

atexit is not called.

After this patch

*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 94781)
spawned uWSGI worker 1 (pid: 94782, cores: 80)
...The work of process 94782 is done (max requests reached (402 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94782)
Respawned uWSGI worker 1 (new pid: 94880)
...The work of process 94880 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94880)
Respawned uWSGI worker 1 (new pid: 94960)
...The work of process 94960 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 94960)
Respawned uWSGI worker 1 (new pid: 95040)
...The work of process 95040 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 95040)
Respawned uWSGI worker 1 (new pid: 95120)                                                                                                                                   ...The work of process 95120 is done (max requests reached (721 >= 20)). Seeya!
on_exit: uwsgi.worker_id()=1
worker 1 killed successfully (pid: 95120)
Respawned uWSGI worker 1 (new pid: 95200)

atexit is called

Relating issues

open-telemetry/opentelemetry-python#3640

@methane
Copy link
Contributor Author

methane commented Mar 8, 2024

This PR includes #2484.

@methane
Copy link
Contributor Author

methane commented Mar 10, 2024

opentelemetry uses atexit not only stopping daemon thread, but also flushing buffered metrics/logs/traces.
Please consider this.

@xrmx
Copy link
Collaborator

xrmx commented Mar 13, 2024

@methane thanks for PR. Could you please add the reproducer to a file in tests/ and adding both the uwsgi line to call it and the hey invocation as comments at the top?

core/loop.c Outdated Show resolved Hide resolved
@methane methane requested a review from xrmx March 19, 2024 06:29
@methane
Copy link
Contributor Author

methane commented Mar 19, 2024

Although this PR fixes worker hangs up, hey still reports EOF error.
I believe pthread_cancel() is really bad API and uwsgi must not use it.
May I remove pthread_cancel() in wait_for_threads()?
This is the only caller of wait_for_threads().

If thread takes too long to be finish, master process sends KILL signal anyway.

@methane
Copy link
Contributor Author

methane commented Mar 19, 2024

For the record, atexit is not related at all. I can see many issues without atexit.
The root cause was pthread_cancel() from subthreads.

@svanoort
Copy link

@xrmx I know you're probably quite busy, but may I humbly request a re-review of this PR & merge/release if it looks good to you?

Unfortunately for us the issue this PR fixes causes a hard failure in production when our services using uWSGI are upgraded to Python versions containing https://bugs.python.org/issue44434 -- so the upgrade from 3.9.7 to 3.9.8 is a hard breaking change (and potentially 3.10 +upgrades) and we've been holding them back to 3.9.7 or lower. We're at the point where that upgrade is needed. I imagine we're probably not the only ones facing that situation since several others have reported the threading issue as well.

And kudos to @methane for doing the deep investigation to provide this fix!

Thank you so much, to both of you!

@xrmx
Copy link
Collaborator

xrmx commented Apr 6, 2024

Squashed and merged in #2626. Thanks!

@xrmx xrmx closed this Apr 6, 2024
xrmx pushed a commit to xrmx/uwsgi that referenced this pull request Apr 7, 2024
When working to reproduce unbit#2615 I saw many strange "defunct" (zombie)
workers.
The master called waitpid(-1, ...) but it return 0 even there are some
zombies.
Finally, master sends KILL signal (MERCY) and worker is restarted.

I believe this strange zombie was born from pthread_cancel. Subthreads
calls pthread_cancel() for main thread and it cause strange process.

pthread_cancel() is very hard to use and debug. I can not even attach the
strange zombie with gdb --pid. I think it is not maintainable.

In the end we can remove six_feet_under_lock and make wait_for_threads()
static.
xrmx pushed a commit to xrmx/uwsgi that referenced this pull request Apr 7, 2024
When working to reproduce unbit#2615 I saw many strange "defunct" (zombie)
workers.
The master called waitpid(-1, ...) but it return 0 even there are some
zombies.
Finally, master sends KILL signal (MERCY) and worker is restarted.

I believe this strange zombie was born from pthread_cancel. Subthreads
calls pthread_cancel() for main thread and it cause strange process.

pthread_cancel() is very hard to use and debug. I can not even attach the
strange zombie with gdb --pid. I think it is not maintainable.

In the end we can remove six_feet_under_lock and make wait_for_threads()
static.
@methane methane deleted the worker-graceful-stop branch July 8, 2024 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants