Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysterious hanging of the test_retry_handling_job for sqlite on self-hosted/local env #35204

Closed
1 task done
potiuk opened this issue Oct 26, 2023 · 0 comments · Fixed by #35221
Closed
1 task done

Mysterious hanging of the test_retry_handling_job for sqlite on self-hosted/local env #35204

potiuk opened this issue Oct 26, 2023 · 0 comments · Fixed by #35221
Labels
kind:meta High-level information important to the community

Comments

@potiuk
Copy link
Member

potiuk commented Oct 26, 2023

Body

Problem

The test in question:

tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job

Started to timeout - mysteriously - on October 18, 2023:

  • only for self-hosted instances od ours (not for Public runners)
  • only for sqlite not for Postgres / MySQL
  • for local execution on Llinux it can be reproduced as well only with sqlite
  • for local execution on MacOS it can be reproduced as well only with sqlite

Successes in the (recent past)

The last time it's known to succeeded was
https://github.com/apache/airflow/actions/runs/6638965943/job/18039945807

This test toook just 2.77s

2.77s call     tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job

Since then it is consistently handling for all runs on self-hosted runners of ours, while it consistenly succeeds on Public runnners.

Reproducing locally

Reproducing is super easy with breeze:

pytest tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job -s --with-db-init

Pressing Ctrl-C (so sending INT to all processes in the group) "unhangs" the test and it succeeds quickly (????)

What's so strange

It is super-mysterious:

  • There does not seem to be any significant difference in the dependencies. there are a few dependencies beign upgraded in main - but going back to the versions they are coming from does not change anything:
--- /files/constraints-3.8/original-constraints-3.8.txt	2023-10-26 11:32:47.167610348 +0000
+++ /files/constraints-3.8/constraints-3.8.txt	2023-10-26 11:32:48.763610466 +0000
@@ -184 +184 @@
-asttokens==2.4.0
+asttokens==2.4.1
@@ -249 +249 @@
-confluent-kafka==2.2.0
+confluent-kafka==2.3.0
@@ -352 +352 @@
-greenlet==3.0.0
+greenlet==3.0.1
@@ -510 +510 @@
-pyOpenSSL==23.2.0
+pyOpenSSL==23.3.0
@@ -619 +619 @@
-spython==0.3.0
+spython==0.3.1
@@ -687 +687 @@
-yandexcloud==0.238.0
+yandexcloud==0.240.0
  • Even going back the very same image that was used in the job that succeeded does not fix the problem. It still hangs.

Do this (020691f is the image used in

docker pull ghcr.io/apache/airflow/main/ci/python3.8:020691f5cc0935af91a09b052de6122073518b4e
docker tag ghcr.io/apache/airflow/main/ci/python3.8:020691f5cc0935af91a09b052de6122073518b4e ghcr.io/apache/airflow/main/ci/python3.8:latest
breeze
pytest tests/jobs/test_scheduler_job.py::TestSchedulerJob::test_retry_handling_job -s --with-db-init

Looks like there is something very strange going on with the environment of the test - something is apparently triggering a very nasty race condition (kernel version ? - this is the only idea I have) that is not yet avaiale on public runners.

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.
@potiuk potiuk added the kind:meta High-level information important to the community label Oct 26, 2023
potiuk added a commit to potiuk/airflow that referenced this issue Oct 26, 2023
This test started To fail ONLY on sqlite but only on self-hosted runners
and locally (not on public runners) We should uncomment it when we
figure out what's going on.

Related: apache#35204
potiuk added a commit that referenced this issue Oct 26, 2023
This test started To fail ONLY on sqlite but only on self-hosted runners
and locally (not on public runners) We should uncomment it when we
figure out what's going on.

Related: #35204
potiuk added a commit to potiuk/airflow that referenced this issue Oct 27, 2023
Some of the scheduler tests tried to prevent DAG processor processing DAGs
from "tests/dags" directory by setting processor_agent to Mock object:

```python
   self.job_runner.processor_agent = mock.MagicMock()
```

This, in connection with scheduler job cleaning all the tables and
approach similar to:

```python
        dag = self.dagbag.get_dag("test_retry_handling_job")
        dag_task1 = dag.get_task("test_retry_handling_op")
        dag.clear()
        dag.sync_to_db()
```

Allowed the test to run in isolated space where only one or few
DAGS were present in the DB.

This probably worked perfectly in the past, but after some changes
in how DAGFileProcessor works this did not prevent DAGFileProcessor
from running when _execute method in scheduler_job_runner has been
executed, and standalone dag processor was not running, the
processor_agent has been overwritten by a new DagFileProcessor
in the `_execute` method of scheduler_job_runner.

```python
        if not self._standalone_dag_processor:
            self.processor_agent = DagFileProcessorAgent(
                dag_directory=Path(self.subdir),
                max_runs=self.num_times_parse_dags,
                processor_timeout=processor_timeout,
                dag_ids=[],
                pickle_dags=pickle_dags,
                async_mode=async_mode,
            )
```

This led to a very subtle race condition which was more likely on
machines with multiple cores/faster disk (so for example it
led to apache#35204 which appeared on self-hosted (8 core) runners and
did not appear on Public (2-core runners) or it could appear on
an 8 core ARM Mac but not appear on 6 core Intel Mac (only on
sqlite)

If the DAGFileProcessor managed to start and spawn some
parsing processes and grab the DB write access for sqlite and those
processes managed to parse some of the DAG files from tests/dags/
folder, those DAGs could have polutted the DAGs in the DB - leading
to undesired effects (for example with test hanging while the
scheduler job run attempted to process an unwanted subdag and
got deadlocked in case of apache#35204.

The solution to that is to only set the processor_agent if not
set already. This can only happen in unit tests when the
`processor_agent` sets it to Mock object. For "production" the
agent is only set once in the `_execute` methods so there is no
risk involved in checking if it is not set already.

Fixes: apache#35204
potiuk added a commit that referenced this issue Oct 27, 2023
Some of the scheduler tests tried to prevent DAG processor processing DAGs
from "tests/dags" directory by setting processor_agent to Mock object:

```python
   self.job_runner.processor_agent = mock.MagicMock()
```

This, in connection with scheduler job cleaning all the tables and
approach similar to:

```python
        dag = self.dagbag.get_dag("test_retry_handling_job")
        dag_task1 = dag.get_task("test_retry_handling_op")
        dag.clear()
        dag.sync_to_db()
```

Allowed the test to run in isolated space where only one or few
DAGS were present in the DB.

This probably worked perfectly in the past, but after some changes
in how DAGFileProcessor works this did not prevent DAGFileProcessor
from running when _execute method in scheduler_job_runner has been
executed, and standalone dag processor was not running, the
processor_agent has been overwritten by a new DagFileProcessor
in the `_execute` method of scheduler_job_runner.

```python
        if not self._standalone_dag_processor:
            self.processor_agent = DagFileProcessorAgent(
                dag_directory=Path(self.subdir),
                max_runs=self.num_times_parse_dags,
                processor_timeout=processor_timeout,
                dag_ids=[],
                pickle_dags=pickle_dags,
                async_mode=async_mode,
            )
```

This led to a very subtle race condition which was more likely on
machines with multiple cores/faster disk (so for example it
led to #35204 which appeared on self-hosted (8 core) runners and
did not appear on Public (2-core runners) or it could appear on
an 8 core ARM Mac but not appear on 6 core Intel Mac (only on
sqlite)

If the DAGFileProcessor managed to start and spawn some
parsing processes and grab the DB write access for sqlite and those
processes managed to parse some of the DAG files from tests/dags/
folder, those DAGs could have polutted the DAGs in the DB - leading
to undesired effects (for example with test hanging while the
scheduler job run attempted to process an unwanted subdag and
got deadlocked in case of #35204.

The solution to that is to only set the processor_agent if not
set already. This can only happen in unit tests when the
`processor_agent` sets it to Mock object. For "production" the
agent is only set once in the `_execute` methods so there is no
risk involved in checking if it is not set already.

Fixes: #35204
ephraimbuddy pushed a commit that referenced this issue Oct 29, 2023
Some of the scheduler tests tried to prevent DAG processor processing DAGs
from "tests/dags" directory by setting processor_agent to Mock object:

```python
   self.job_runner.processor_agent = mock.MagicMock()
```

This, in connection with scheduler job cleaning all the tables and
approach similar to:

```python
        dag = self.dagbag.get_dag("test_retry_handling_job")
        dag_task1 = dag.get_task("test_retry_handling_op")
        dag.clear()
        dag.sync_to_db()
```

Allowed the test to run in isolated space where only one or few
DAGS were present in the DB.

This probably worked perfectly in the past, but after some changes
in how DAGFileProcessor works this did not prevent DAGFileProcessor
from running when _execute method in scheduler_job_runner has been
executed, and standalone dag processor was not running, the
processor_agent has been overwritten by a new DagFileProcessor
in the `_execute` method of scheduler_job_runner.

```python
        if not self._standalone_dag_processor:
            self.processor_agent = DagFileProcessorAgent(
                dag_directory=Path(self.subdir),
                max_runs=self.num_times_parse_dags,
                processor_timeout=processor_timeout,
                dag_ids=[],
                pickle_dags=pickle_dags,
                async_mode=async_mode,
            )
```

This led to a very subtle race condition which was more likely on
machines with multiple cores/faster disk (so for example it
led to #35204 which appeared on self-hosted (8 core) runners and
did not appear on Public (2-core runners) or it could appear on
an 8 core ARM Mac but not appear on 6 core Intel Mac (only on
sqlite)

If the DAGFileProcessor managed to start and spawn some
parsing processes and grab the DB write access for sqlite and those
processes managed to parse some of the DAG files from tests/dags/
folder, those DAGs could have polutted the DAGs in the DB - leading
to undesired effects (for example with test hanging while the
scheduler job run attempted to process an unwanted subdag and
got deadlocked in case of #35204.

The solution to that is to only set the processor_agent if not
set already. This can only happen in unit tests when the
`processor_agent` sets it to Mock object. For "production" the
agent is only set once in the `_execute` methods so there is no
risk involved in checking if it is not set already.

Fixes: #35204
(cherry picked from commit 6f3d294)
ephraimbuddy pushed a commit that referenced this issue Oct 30, 2023
Some of the scheduler tests tried to prevent DAG processor processing DAGs
from "tests/dags" directory by setting processor_agent to Mock object:

```python
   self.job_runner.processor_agent = mock.MagicMock()
```

This, in connection with scheduler job cleaning all the tables and
approach similar to:

```python
        dag = self.dagbag.get_dag("test_retry_handling_job")
        dag_task1 = dag.get_task("test_retry_handling_op")
        dag.clear()
        dag.sync_to_db()
```

Allowed the test to run in isolated space where only one or few
DAGS were present in the DB.

This probably worked perfectly in the past, but after some changes
in how DAGFileProcessor works this did not prevent DAGFileProcessor
from running when _execute method in scheduler_job_runner has been
executed, and standalone dag processor was not running, the
processor_agent has been overwritten by a new DagFileProcessor
in the `_execute` method of scheduler_job_runner.

```python
        if not self._standalone_dag_processor:
            self.processor_agent = DagFileProcessorAgent(
                dag_directory=Path(self.subdir),
                max_runs=self.num_times_parse_dags,
                processor_timeout=processor_timeout,
                dag_ids=[],
                pickle_dags=pickle_dags,
                async_mode=async_mode,
            )
```

This led to a very subtle race condition which was more likely on
machines with multiple cores/faster disk (so for example it
led to #35204 which appeared on self-hosted (8 core) runners and
did not appear on Public (2-core runners) or it could appear on
an 8 core ARM Mac but not appear on 6 core Intel Mac (only on
sqlite)

If the DAGFileProcessor managed to start and spawn some
parsing processes and grab the DB write access for sqlite and those
processes managed to parse some of the DAG files from tests/dags/
folder, those DAGs could have polutted the DAGs in the DB - leading
to undesired effects (for example with test hanging while the
scheduler job run attempted to process an unwanted subdag and
got deadlocked in case of #35204.

The solution to that is to only set the processor_agent if not
set already. This can only happen in unit tests when the
`processor_agent` sets it to Mock object. For "production" the
agent is only set once in the `_execute` methods so there is no
risk involved in checking if it is not set already.

Fixes: #35204
(cherry picked from commit 6f3d294)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:meta High-level information important to the community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant