Infinite loop on scheduler when kubernetes state event is None along with state in database also None #35888
Closed
2 tasks done
Labels
area:core
kind:bug
This is a clearly a bug
needs-triage
label for new issues that we didn't triage yet
Apache Airflow version
2.7.3
What happened
We are facing an issue using Kubernetes Executor where
process_watcher_task
that gets None state and is pushed toresult_queue
. On fetching the state from queue inkubernetes_executor.py
it's passed to_change_state
and if the state is None then state is fetched from database which when is also None due to some reason theTaskInstanceState(state)
throwsValueError
which is caught in the exception and the result is again added to the queue causing scheduler to go into infinite loop trying to set state. We need to restart the scheduler to make it run. If state is None database query too then we shouldn't set the state or to catchValueError
instead of generic exception handling to not retry by pushing the same result to queue. The validation was introduced by this change 9556d6d#diff-11bb8713bf2f01502e66ffa91136f939cc8445839517187f818f044233414f7eR459airflow/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py
Lines 453 to 465 in 5d74ffb
airflow/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Lines 379 to 393 in f3ddefc
airflow/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Lines 478 to 485 in 5d74ffb
What you think should happen instead
scheduler should not retry infinitely
How to reproduce
We are not sure of the exact scenario where this reproducible. We tried running a task that returns an event which k8s returns None in rare case when pod is deleted or killed and also delete the task instance to make sure db query also returns None but we are not able to consistently get to the case that causes this.
Operating System
Ubuntu
Versions of Apache Airflow Providers
No response
Deployment
Virtualenv installation
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: