-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task processes killed with WARNING - Recorded pid does not match the current pid #17507
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
We added a fix contained in: #16301 that would be released in 2.1.3. Can you try if you can reproduce with the current main branch |
We're using 2.1.3, see exactly same message sometimes in task runs. Using celery executor. |
@crazyproger do you have a dag to reproduce this behaviour? please share |
We met this on different dags. Clearing dagrun helps in all cases.
function that creates dag is here: https://gist.github.com/crazyproger/a2a516f8e6b757b88d29f6fccca16990 |
Yes. I downgraded to 2.1.2 and the issue is gone away for now: See related #10026 |
@noelmcloughlin In the issue you mention 2.1.2 -- did you mean 2.1.3? |
Sorry I was on 2.1.2 (okay), took 2.1.3 (issue seen), downgraded to 2.1.2 (okay again). |
That's what I thought, just wanted to confirm. |
Hi, that should be link with that PR ? #17333 . |
So for info, I incorporated the fix proposed by @nmehraein here unfortunately I am still getting the same error ... |
Also as mentionned here it seems that the run_as_user is part of the issue: https://www.gitmemory.com/issue/apache/airflow/17394/892065842 As @potiuk mentioned it is related but having passed airflow in the config for the default impersonation did not help |
I was not using run_as_user #10026 (comment) so I think we need to collect more data. There is probably a common attribute but I'm not sure run_as_user is important. |
Is there any workaround since my dags are no longer working properly ? tks |
I tried disable |
For me for the moment the pb is on backfill mode. tonight my processes will run on this new airflow session I will see If I get the same errors. UPDATE: I have modified the param |
In our case with Kubernetes Executor it definitely seems scheduler related. In a DAG with 55 tasks, around a third receives sigterm shortly after starting and then goes into a retry loop with Pid X does not match Pid Y. It was fixed after I reduced pool size from 128 (all tasks queued at the same time) to 32, so 23 tasks were left in scheduled state. After I reverted the pool change, issue came back |
I will revert also to lastest 2.0 version available: however I do not know what impact it will have on the metadata db (need a downgrade or sth like that) ... Does someone know if there is an official guide to downgrade ? So what I did is:
With this set up my dags are working "properly" (remaining the below bug) ! I no longer have mismatch of pid and no longer have random sigterm on task execution relatively large queries. |
this proble is still there in airflow 2.13!! |
How did you increase the heartbeat signal? I am running Airflow on a Kubernetes cluster and run into this issue on 2.1.3, 2.1.2 and 2.0.2. |
Hello @MarvinSchenkel , to increase the scheduler heartbeat signal you can either export it in your env variable when installing airflow or after having already installed it modifiy the param your airflow.cfg file.
to make sure that the values are correctly set you can use And then restart the scheduler and the webserver. |
In other to use run_as_user, you must follow this guide http://airflow.apache.org/docs/apache-airflow/stable/security/workload.html#impersonation |
Hi all, I am experiencing this on 2.3.2 with LocalExecutor (4 schedulers), Postgres, and Ubuntu 22.04. This is, however, running a clone of our staging environment of dags that run fine on 2.1.4 and Ubuntu 16.04. I'm also running on a much smaller and less powerful instance, and so it may be exacerbating race conditions. I did some investigation into the process state, and when this error leads to a failure, this is what I see in process executions:
I came to wonder, since this error happens because (a) the final As I've investigated further, I've found on task failures for RUN_AS_USER tasks in which this fails, the I'm testing this right now and it seems to work - and if that seems to hold up I'll put in a PR. |
This does seem to be working consistently for LocalExecutor - I haven't checked Celery or Kubernetes. It will take me a little while to set up the dev environment and do the testing before submitting a PR, but feel free to give it a whirl, I have a tentative branch set up here: https://github.com/krcrouse/airflow/tree/fix-pid-check |
I see the same issue happening on 2.4.3 Python 3.10 (setup through miniconda) OS details. Seen this in the Tasks logs.
Airflow Logs
|
uranusjr can you add 2.4.3 also as an affected version ? |
@uranusjr can you add 2.4.3 also as an affected version ? |
Could you tell more about you case @shubhampatel94 ? did it happen once, ? Did it start to happen continuously? Did you (or the deployment that you are running) experienced some kind of event (restart, being terminated, or similar?) I am just trying to see if there is someone who can explain the circumstances it happens. IT does not seem a common occurence, people are experiencing it occasionally and I think it is caused by some race condition involved in starting and TERMINATING processes quickly If it happened once and it was accompanied by some deployment issue that caused termination of running processes, I would not be surprised to see similar issue. |
@shubhampatel94 and @potiuk - Note that since (my) patch was applied that fixed the race condition, I have occasionally seen this error when the process was killed for another reason - for example, we have OS monitors that will kill processes that are being bad citizens, or certain times when the task had an unexpected exception and died by itself. I've verified in these cases that the task failed to complete and Airflow throws this error message, but Airflow is not the ultimate culprit that is terminating the tasks, and this is the message terminating the containing process of the dead task. Is it possible that this is what is happening with you? At some point I was going to try to dig deeper to verify the situation and propose a better way to identify it and send a better error message, but I haven't had the time. |
Precisely. What you explains is what I suspected. Very rare event that is externally triggered. That's how it looks like from the logs. It actually looks like something just killed a bunch of tasks running but the original local task jobs have not been killed and then it complained about those "child" processes missing there. If that is happening only occasionally as a resut of some unbounded killing of processes. I would be for just closing this one. We are not able to handle all the scenarios when somethign randomly kills some processes. Airlfow is not a 99.999% available system that is supposed to handle absolutely all such situations - this is extremely costly to develop such systems, and there is little incentive to spend a lot of time on perfecting it, when there is nice UI and monitoring that can warn in such situations and have a human to fix it by re-running the tasks. |
@potiuk @kcphila Thanks for looking into what I have pointed out.
What I have observed is that airflow terminated all running tasks within 6 min the Window, and then it was business as usual. |
So I woul dnot really worry about it. I think we can close that one and we might re-open it if we have more reproducible cases other than occasional failures like that (which just might happen). |
Thanks, @potiuk will re-raise again if the observed issue again. |
Apache Airflow version: 2.1.3
Apache Airflow Provider versions (please include all providers that are relevant to your bug):
Environment:
I'm using the airflow-2.1.2 container from dockerhub.
Debian GNU/Linux 10 (buster)
uname -a
):Linux fe52079d9ade 5.12.13-200.fc33.x86_64 #1 SMP Wed Jun 23 16:20:26 UTC 2021 x86_64 GNU/Linux
What happened:
When using the EMRStepSensor (set to reschedule mode) to monitor EMR steps, the task will sometimes fail while the EMR step sucessfully ran. Most of the time the sensor will work fine, but every so often this issue occurs (on the same DAG, without modifications).
EMRStepSensor task instance debug log
What you expected to happen:
I'd expect the EMRStepSensor to run until the EMR step succeeded, and report a succesful run.
If my understanding is correct, these final lines in the log show the runner terminating the task process. If I'm reading the log correctly, 8339 is the correct PID for the task, and the recorded pid 7972 is the pid for a previous run. Could it be possible that this pid is not correctly being updated?
Anything else we need to know:
The symptoms look very similar to #17394, but I'm not using run_as_user, and the reported pids are not the same, so I'm not sure whether this is the same issue.
The text was updated successfully, but these errors were encountered: