Mark task as failed when it fails "sending" in Celery #10881

ashb · 2020-09-11T13:52:46Z

If a task failed hard on celery, before being able to execute the
airflow code the task would end up stuck in queued state. This change
makes it get retried.

This was discovered in load testing the HA work (but unrelated to HA
changes), where I swamped the kube-dns pod, meaning the worker was
sometimes unable to resolve the db name via DNS, so the state in the DB
was never updated.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

kaxil · 2020-09-11T15:50:38Z

One test is failing

FAILED tests/executors/test_celery_executor.py::TestCeleryExecutor::test_error_sending_task

turbaszek · 2020-09-11T17:52:41Z

@olchas this sounds like it may solve the issue you observed when cluster was scaling up

airflow/executors/celery_executor.py

If a task failed hard on celery, _before_ being able to execute the airflow code the task would end up stuck in queued state. This change makes it get retried. This was discovered in load testing the HA work (but unrelated to HA changes), where I swamped the kube-dns pod, meaning the worker was sometimes unable to resolve the db name via DNS, so the state in the DB was never updated

ashb requested review from potiuk, kaxil and mik-laj September 11, 2020 13:52

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Sep 11, 2020

ashb force-pushed the hard-celery-error-fails-task branch from 7f8dcfd to 4300c2b Compare September 11, 2020 13:55

ashb marked this pull request as draft September 11, 2020 14:12

kaxil approved these changes Sep 11, 2020

View reviewed changes

turbaszek reviewed Sep 11, 2020

View reviewed changes

airflow/executors/celery_executor.py Outdated Show resolved Hide resolved

turbaszek approved these changes Sep 11, 2020

View reviewed changes

ashb force-pushed the hard-celery-error-fails-task branch from 4300c2b to fc119a0 Compare September 14, 2020 08:00

ashb force-pushed the hard-celery-error-fails-task branch from fc119a0 to 77cb73d Compare September 14, 2020 08:03

ashb marked this pull request as ready for review September 14, 2020 08:03

ashb merged commit 9e42a97 into apache:master Sep 14, 2020

ashb deleted the hard-celery-error-fails-task branch September 14, 2020 09:40

mik-laj added the AIP-15 label Sep 14, 2020

ashb mentioned this pull request Sep 16, 2020

Officially support HA for scheduler component (AIP-15) #9630

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark task as failed when it fails "sending" in Celery #10881

Mark task as failed when it fails "sending" in Celery #10881

ashb commented Sep 11, 2020 •

edited

Loading

kaxil commented Sep 11, 2020

turbaszek commented Sep 11, 2020

Mark task as failed when it fails "sending" in Celery #10881

Mark task as failed when it fails "sending" in Celery #10881

Conversation

ashb commented Sep 11, 2020 • edited Loading

kaxil commented Sep 11, 2020

turbaszek commented Sep 11, 2020

ashb commented Sep 11, 2020 •

edited

Loading