-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mysterious hanging of the test_retry_handling_job for sqlite on self-hosted/local env #35204
Closed
1 task done
Labels
kind:meta
High-level information important to the community
Comments
potiuk
added a commit
to potiuk/airflow
that referenced
this issue
Oct 26, 2023
This test started To fail ONLY on sqlite but only on self-hosted runners and locally (not on public runners) We should uncomment it when we figure out what's going on. Related: apache#35204
potiuk
added a commit
that referenced
this issue
Oct 26, 2023
This test started To fail ONLY on sqlite but only on self-hosted runners and locally (not on public runners) We should uncomment it when we figure out what's going on. Related: #35204
potiuk
added a commit
to potiuk/airflow
that referenced
this issue
Oct 27, 2023
Some of the scheduler tests tried to prevent DAG processor processing DAGs from "tests/dags" directory by setting processor_agent to Mock object: ```python self.job_runner.processor_agent = mock.MagicMock() ``` This, in connection with scheduler job cleaning all the tables and approach similar to: ```python dag = self.dagbag.get_dag("test_retry_handling_job") dag_task1 = dag.get_task("test_retry_handling_op") dag.clear() dag.sync_to_db() ``` Allowed the test to run in isolated space where only one or few DAGS were present in the DB. This probably worked perfectly in the past, but after some changes in how DAGFileProcessor works this did not prevent DAGFileProcessor from running when _execute method in scheduler_job_runner has been executed, and standalone dag processor was not running, the processor_agent has been overwritten by a new DagFileProcessor in the `_execute` method of scheduler_job_runner. ```python if not self._standalone_dag_processor: self.processor_agent = DagFileProcessorAgent( dag_directory=Path(self.subdir), max_runs=self.num_times_parse_dags, processor_timeout=processor_timeout, dag_ids=[], pickle_dags=pickle_dags, async_mode=async_mode, ) ``` This led to a very subtle race condition which was more likely on machines with multiple cores/faster disk (so for example it led to apache#35204 which appeared on self-hosted (8 core) runners and did not appear on Public (2-core runners) or it could appear on an 8 core ARM Mac but not appear on 6 core Intel Mac (only on sqlite) If the DAGFileProcessor managed to start and spawn some parsing processes and grab the DB write access for sqlite and those processes managed to parse some of the DAG files from tests/dags/ folder, those DAGs could have polutted the DAGs in the DB - leading to undesired effects (for example with test hanging while the scheduler job run attempted to process an unwanted subdag and got deadlocked in case of apache#35204. The solution to that is to only set the processor_agent if not set already. This can only happen in unit tests when the `processor_agent` sets it to Mock object. For "production" the agent is only set once in the `_execute` methods so there is no risk involved in checking if it is not set already. Fixes: apache#35204
potiuk
added a commit
that referenced
this issue
Oct 27, 2023
Some of the scheduler tests tried to prevent DAG processor processing DAGs from "tests/dags" directory by setting processor_agent to Mock object: ```python self.job_runner.processor_agent = mock.MagicMock() ``` This, in connection with scheduler job cleaning all the tables and approach similar to: ```python dag = self.dagbag.get_dag("test_retry_handling_job") dag_task1 = dag.get_task("test_retry_handling_op") dag.clear() dag.sync_to_db() ``` Allowed the test to run in isolated space where only one or few DAGS were present in the DB. This probably worked perfectly in the past, but after some changes in how DAGFileProcessor works this did not prevent DAGFileProcessor from running when _execute method in scheduler_job_runner has been executed, and standalone dag processor was not running, the processor_agent has been overwritten by a new DagFileProcessor in the `_execute` method of scheduler_job_runner. ```python if not self._standalone_dag_processor: self.processor_agent = DagFileProcessorAgent( dag_directory=Path(self.subdir), max_runs=self.num_times_parse_dags, processor_timeout=processor_timeout, dag_ids=[], pickle_dags=pickle_dags, async_mode=async_mode, ) ``` This led to a very subtle race condition which was more likely on machines with multiple cores/faster disk (so for example it led to #35204 which appeared on self-hosted (8 core) runners and did not appear on Public (2-core runners) or it could appear on an 8 core ARM Mac but not appear on 6 core Intel Mac (only on sqlite) If the DAGFileProcessor managed to start and spawn some parsing processes and grab the DB write access for sqlite and those processes managed to parse some of the DAG files from tests/dags/ folder, those DAGs could have polutted the DAGs in the DB - leading to undesired effects (for example with test hanging while the scheduler job run attempted to process an unwanted subdag and got deadlocked in case of #35204. The solution to that is to only set the processor_agent if not set already. This can only happen in unit tests when the `processor_agent` sets it to Mock object. For "production" the agent is only set once in the `_execute` methods so there is no risk involved in checking if it is not set already. Fixes: #35204
ephraimbuddy
pushed a commit
that referenced
this issue
Oct 29, 2023
Some of the scheduler tests tried to prevent DAG processor processing DAGs from "tests/dags" directory by setting processor_agent to Mock object: ```python self.job_runner.processor_agent = mock.MagicMock() ``` This, in connection with scheduler job cleaning all the tables and approach similar to: ```python dag = self.dagbag.get_dag("test_retry_handling_job") dag_task1 = dag.get_task("test_retry_handling_op") dag.clear() dag.sync_to_db() ``` Allowed the test to run in isolated space where only one or few DAGS were present in the DB. This probably worked perfectly in the past, but after some changes in how DAGFileProcessor works this did not prevent DAGFileProcessor from running when _execute method in scheduler_job_runner has been executed, and standalone dag processor was not running, the processor_agent has been overwritten by a new DagFileProcessor in the `_execute` method of scheduler_job_runner. ```python if not self._standalone_dag_processor: self.processor_agent = DagFileProcessorAgent( dag_directory=Path(self.subdir), max_runs=self.num_times_parse_dags, processor_timeout=processor_timeout, dag_ids=[], pickle_dags=pickle_dags, async_mode=async_mode, ) ``` This led to a very subtle race condition which was more likely on machines with multiple cores/faster disk (so for example it led to #35204 which appeared on self-hosted (8 core) runners and did not appear on Public (2-core runners) or it could appear on an 8 core ARM Mac but not appear on 6 core Intel Mac (only on sqlite) If the DAGFileProcessor managed to start and spawn some parsing processes and grab the DB write access for sqlite and those processes managed to parse some of the DAG files from tests/dags/ folder, those DAGs could have polutted the DAGs in the DB - leading to undesired effects (for example with test hanging while the scheduler job run attempted to process an unwanted subdag and got deadlocked in case of #35204. The solution to that is to only set the processor_agent if not set already. This can only happen in unit tests when the `processor_agent` sets it to Mock object. For "production" the agent is only set once in the `_execute` methods so there is no risk involved in checking if it is not set already. Fixes: #35204 (cherry picked from commit 6f3d294)
ephraimbuddy
pushed a commit
that referenced
this issue
Oct 30, 2023
Some of the scheduler tests tried to prevent DAG processor processing DAGs from "tests/dags" directory by setting processor_agent to Mock object: ```python self.job_runner.processor_agent = mock.MagicMock() ``` This, in connection with scheduler job cleaning all the tables and approach similar to: ```python dag = self.dagbag.get_dag("test_retry_handling_job") dag_task1 = dag.get_task("test_retry_handling_op") dag.clear() dag.sync_to_db() ``` Allowed the test to run in isolated space where only one or few DAGS were present in the DB. This probably worked perfectly in the past, but after some changes in how DAGFileProcessor works this did not prevent DAGFileProcessor from running when _execute method in scheduler_job_runner has been executed, and standalone dag processor was not running, the processor_agent has been overwritten by a new DagFileProcessor in the `_execute` method of scheduler_job_runner. ```python if not self._standalone_dag_processor: self.processor_agent = DagFileProcessorAgent( dag_directory=Path(self.subdir), max_runs=self.num_times_parse_dags, processor_timeout=processor_timeout, dag_ids=[], pickle_dags=pickle_dags, async_mode=async_mode, ) ``` This led to a very subtle race condition which was more likely on machines with multiple cores/faster disk (so for example it led to #35204 which appeared on self-hosted (8 core) runners and did not appear on Public (2-core runners) or it could appear on an 8 core ARM Mac but not appear on 6 core Intel Mac (only on sqlite) If the DAGFileProcessor managed to start and spawn some parsing processes and grab the DB write access for sqlite and those processes managed to parse some of the DAG files from tests/dags/ folder, those DAGs could have polutted the DAGs in the DB - leading to undesired effects (for example with test hanging while the scheduler job run attempted to process an unwanted subdag and got deadlocked in case of #35204. The solution to that is to only set the processor_agent if not set already. This can only happen in unit tests when the `processor_agent` sets it to Mock object. For "production" the agent is only set once in the `_execute` methods so there is no risk involved in checking if it is not set already. Fixes: #35204 (cherry picked from commit 6f3d294)
40 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Body
Problem
The test in question:
Started to timeout - mysteriously - on October 18, 2023:
Successes in the (recent past)
The last time it's known to succeeded was
https://github.com/apache/airflow/actions/runs/6638965943/job/18039945807
This test toook just 2.77s
Since then it is consistently handling for all runs on self-hosted runners of ours, while it consistenly succeeds on Public runnners.
Reproducing locally
Reproducing is super easy with breeze:
Pressing Ctrl-C (so sending INT to all processes in the group) "unhangs" the test and it succeeds quickly (????)
What's so strange
It is super-mysterious:
Do this (020691f is the image used in
Looks like there is something very strange going on with the environment of the test - something is apparently triggering a very nasty race condition (kernel version ? - this is the only idea I have) that is not yet avaiale on public runners.
Committer
The text was updated successfully, but these errors were encountered: