Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Celery executor getting stuck randomly because of reset_signals in multiprocessing #15989

Merged
merged 6 commits into from
May 29, 2021

Conversation

yuqian90
Copy link
Contributor

Fixes #15938

multiprocessing.Pool is known to often become stuck. It causes celery_executor to hang randomly. This happens at least on Debian, Ubuntu using Python 3.8.7 and Python 3.8.10. The issue is reproducible by running test_send_tasks_to_celery_hang in this PR several times (with db backend set to something other than sqlite because sqlite disables some parallelization)

The issue goes away once switched to concurrent.futures.ProcessPoolExecutor. In python 3.6 and earlier, ProcessPoolExecutor has no initializer argument. Fortunately, it's not needed because reset_signal is no longer needed because the signal handler now checks if the current process is the parent.

@yuqian90 yuqian90 force-pushed the celery_executor_stuck branch from 1e88cd4 to d7e7035 Compare May 22, 2021 11:45
@yuqian90 yuqian90 added this to the Airflow 2.1.1 milestone May 22, 2021
@davidcaron
Copy link
Contributor

We also had the same intermittent issues on our production servers. I wanted to help test this so I deployed this PR and the problem hasn't happened since.

Previously, the scheduler would hang once or twice per day on average. It's been running fine for 48 hours now. I will update this comment if something happens, but you can assume that no news is good news.

Thank you for figuring this out!

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks good. @ashb - maybe you want to take a look also with your multiprocessing experience.

scripts/ci/docker-compose/base.yml Show resolved Hide resolved
@ashb
Copy link
Member

ashb commented May 27, 2021

Oh yes, will take a look.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the test hung for me without this change 👍

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved
tests/executors/test_celery_executor.py Outdated Show resolved Hide resolved
@ashb
Copy link
Member

ashb commented May 27, 2021

Could you just address the few small niggles (mostly to minimize the change so it is easier to backport, and easier to read.)

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label May 27, 2021
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@yuqian90 yuqian90 force-pushed the celery_executor_stuck branch from d7e7035 to 048c493 Compare May 28, 2021 07:22
@yuqian90
Copy link
Contributor Author

Could you just address the few small niggles (mostly to minimize the change so it is easier to backport, and easier to read.)

Thanks. These have been addressed.

@yuqian90 yuqian90 force-pushed the celery_executor_stuck branch from 048c493 to 7caba4f Compare May 28, 2021 13:59
@yuqian90 yuqian90 merged commit f75dd7a into apache:master May 29, 2021
yuqian90 added a commit that referenced this pull request May 29, 2021
…n multiprocessing (#15989)

Fixes #15938

multiprocessing.Pool is known to often become stuck. It causes celery_executor to hang randomly. This happens at least on Debian, Ubuntu using Python 3.8.7 and Python 3.8.10. The issue is reproducible by running test_send_tasks_to_celery_hang in this PR several times (with db backend set to something other than sqlite because sqlite disables some parallelization)

The issue goes away once switched to concurrent.futures.ProcessPoolExecutor. In python 3.6 and earlier, ProcessPoolExecutor has no initializer argument. Fortunately, it's not needed because reset_signal is no longer needed because the signal handler now checks if the current process is the parent.

(cherry picked from commit f75dd7a)
potiuk added a commit to potiuk/airflow that referenced this pull request May 30, 2021
The test_send_tasks_to_celery_hang hangs on self-hosted runners more
often than not.

It's been introduced in apache#15989 and while the test does not usually hang
on regular GitHub runners, or in case of running it locally (I could not
make it fail), it does hang almost always when run on self-hosted
runners.

Marking it as quarantined for now.

Issue apache#16168 created to keep track of it.
potiuk added a commit that referenced this pull request May 30, 2021
The test_send_tasks_to_celery_hang hangs on self-hosted runners more
often than not.

It's been introduced in #15989 and while the test does not usually hang
on regular GitHub runners, or in case of running it locally (I could not
make it fail), it does hang almost always when run on self-hosted
runners.

Marking it as quarantined for now.

Issue #16168 created to keep track of it.
jhtimmins pushed a commit to astronomer/airflow that referenced this pull request Jun 3, 2021
…n multiprocessing (apache#15989)

Fixes apache#15938

multiprocessing.Pool is known to often become stuck. It causes celery_executor to hang randomly. This happens at least on Debian, Ubuntu using Python 3.8.7 and Python 3.8.10. The issue is reproducible by running test_send_tasks_to_celery_hang in this PR several times (with db backend set to something other than sqlite because sqlite disables some parallelization)

The issue goes away once switched to concurrent.futures.ProcessPoolExecutor. In python 3.6 and earlier, ProcessPoolExecutor has no initializer argument. Fortunately, it's not needed because reset_signal is no longer needed because the signal handler now checks if the current process is the parent.

(cherry picked from commit f75dd7a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

celery_executor becomes stuck if child process receives signal before reset_signals is called
4 participants