Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark task as failed when it fails "sending" in Celery #10881

Merged
merged 1 commit into from
Sep 14, 2020

Conversation

ashb
Copy link
Member

@ashb ashb commented Sep 11, 2020

If a task failed hard on celery, before being able to execute the
airflow code the task would end up stuck in queued state. This change
makes it get retried.

This was discovered in load testing the HA work (but unrelated to HA
changes), where I swamped the kube-dns pod, meaning the worker was
sometimes unable to resolve the db name via DNS, so the state in the DB
was never updated.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Sep 11, 2020
@ashb ashb force-pushed the hard-celery-error-fails-task branch from 7f8dcfd to 4300c2b Compare September 11, 2020 13:55
@ashb ashb marked this pull request as draft September 11, 2020 14:12
@kaxil
Copy link
Member

kaxil commented Sep 11, 2020

One test is failing

FAILED tests/executors/test_celery_executor.py::TestCeleryExecutor::test_error_sending_task

@turbaszek
Copy link
Member

@olchas this sounds like it may solve the issue you observed when cluster was scaling up

@ashb ashb force-pushed the hard-celery-error-fails-task branch from 4300c2b to fc119a0 Compare September 14, 2020 08:00
If a task failed hard on celery, _before_ being able to execute the
airflow code the task would end up stuck in queued state. This change
makes it get retried.

This was discovered in load testing the HA work (but unrelated to HA
changes), where I swamped the kube-dns pod, meaning the worker was
sometimes unable to resolve the db name via DNS, so the state in the DB
was never updated
@ashb ashb force-pushed the hard-celery-error-fails-task branch from fc119a0 to 77cb73d Compare September 14, 2020 08:03
@ashb ashb marked this pull request as ready for review September 14, 2020 08:03
@ashb ashb merged commit 9e42a97 into apache:master Sep 14, 2020
@ashb ashb deleted the hard-celery-error-fails-task branch September 14, 2020 09:40
@mik-laj mik-laj added the AIP-15 label Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants