-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing #15989
Conversation
1e88cd4
to
d7e7035
Compare
We also had the same intermittent issues on our production servers. I wanted to help test this so I deployed this PR and the problem hasn't happened since. Previously, the scheduler would hang once or twice per day on average. It's been running fine for 48 hours now. I will update this comment if something happens, but you can assume that no news is good news. Thank you for figuring this out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks good. @ashb - maybe you want to take a look also with your multiprocessing experience.
Oh yes, will take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed the test hung for me without this change 👍
Could you just address the few small niggles (mostly to minimize the change so it is easier to backport, and easier to read.) |
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
d7e7035
to
048c493
Compare
Thanks. These have been addressed. |
048c493
to
7caba4f
Compare
…n multiprocessing (#15989) Fixes #15938 multiprocessing.Pool is known to often become stuck. It causes celery_executor to hang randomly. This happens at least on Debian, Ubuntu using Python 3.8.7 and Python 3.8.10. The issue is reproducible by running test_send_tasks_to_celery_hang in this PR several times (with db backend set to something other than sqlite because sqlite disables some parallelization) The issue goes away once switched to concurrent.futures.ProcessPoolExecutor. In python 3.6 and earlier, ProcessPoolExecutor has no initializer argument. Fortunately, it's not needed because reset_signal is no longer needed because the signal handler now checks if the current process is the parent. (cherry picked from commit f75dd7a)
The test_send_tasks_to_celery_hang hangs on self-hosted runners more often than not. It's been introduced in apache#15989 and while the test does not usually hang on regular GitHub runners, or in case of running it locally (I could not make it fail), it does hang almost always when run on self-hosted runners. Marking it as quarantined for now. Issue apache#16168 created to keep track of it.
The test_send_tasks_to_celery_hang hangs on self-hosted runners more often than not. It's been introduced in #15989 and while the test does not usually hang on regular GitHub runners, or in case of running it locally (I could not make it fail), it does hang almost always when run on self-hosted runners. Marking it as quarantined for now. Issue #16168 created to keep track of it.
…n multiprocessing (apache#15989) Fixes apache#15938 multiprocessing.Pool is known to often become stuck. It causes celery_executor to hang randomly. This happens at least on Debian, Ubuntu using Python 3.8.7 and Python 3.8.10. The issue is reproducible by running test_send_tasks_to_celery_hang in this PR several times (with db backend set to something other than sqlite because sqlite disables some parallelization) The issue goes away once switched to concurrent.futures.ProcessPoolExecutor. In python 3.6 and earlier, ProcessPoolExecutor has no initializer argument. Fortunately, it's not needed because reset_signal is no longer needed because the signal handler now checks if the current process is the parent. (cherry picked from commit f75dd7a)
Fixes #15938
multiprocessing.Pool
is known to often become stuck. It causes celery_executor to hang randomly. This happens at least on Debian, Ubuntu using Python 3.8.7 and Python 3.8.10. The issue is reproducible by runningtest_send_tasks_to_celery_hang
in this PR several times (with db backend set to something other than sqlite because sqlite disables some parallelization)The issue goes away once switched to
concurrent.futures.ProcessPoolExecutor
. In python 3.6 and earlier,ProcessPoolExecutor
has noinitializer
argument. Fortunately, it's not needed becausereset_signal
is no longer needed because the signal handler now checks if the current process is the parent.