-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks intermittently gets terminated with SIGTERM on kubernetes executor #18041
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Please provide the full task logs and also the scheduler logs. What you added above is only the part where the task failed but I believe something must have caused it to receive the sigterm |
Seems to be failing while waiting or when they are time-consuming (long-running). Task logs :
Scheduler logs:
|
For the task to receive sigterm means something is killing your pods. Task runner receives SIGTERM when Pod is deleted. Can you check if something else is deleting your pods? |
True, when a pod gets deleted they are getting SIGTERM. I have tried to get the cause of pods getting deleted, but could not find any reason for it yet. It happens just randomly. |
Is your DAG paused? |
DAG is in ON state. |
@Nimesh-K-Makwana can you set |
Hello, I am facing the same issue: I have modified the variables killed_task_cleanup_time and schedule_after_task_execution to resp. 100000 and False. My tasks are getting constantly killed in backfill mode with the traceback:
Honestly, I am a bit discouraged at this point, could you help me please ? tks tasks logs:
scheduler log:
I alos face the issue of 17507 Tks, Pierre |
Can be related to : ed99eaa |
So I tested the code that generated the issue described above with the change in the commit and it did not solve the issue: the first backfill worked but all the following did not: still getting the same error. |
@laserpedro , I'm not able to reproduce your case with backfill. How long is your task running? I will also appreciate if you would make a dag to reproduce this. |
Hello @ephraimbuddy, Thank you for you answer ! So I thought the same as you so I did a full re install today (new machine, new database, new environment, mew user).
First question: in terms of user does that seem correct to you ? I ported one dags that generated those errors on backfill mode and I am still getting them in an erratic way: not the same to fail, whatever the number of dags launched in backfill ....(30, 16, 10). I would say that for the classic backfill (16 dag runs in parallel for me) the time to finish them all is 7-10 minutes: not any big stuff is done but the insertion of 40k lines in a postgresql db at the end ... (so kind of small actually). However it is true that it seems that the time variable seems to be a factor here since when backfill is performed with very few tasks it is working. For the code sample I will provide you sth asap to reproduce the pb (sent by mail :)) Tks, Pierre |
@laserpedro Yes. that makes sense for a user setup. |
@laserpedro It will be very helpful if you can provide a simple dag to reproduce this behaviour. |
Hello @ephraimbuddy, Since I had to focus on solving the issue of my airflow session I made the below modifications and it seems to be properly working now:
With this new set up my airflow session has been working correctly working for 2 days now. |
Thanks @laserpedro. |
Hello @ephraimbuddy, We are facing the same issue... Here it is an example dag to reproduce it: from airflow import DAG
from airflow.operators.python import PythonOperator
import time
from datetime import datetime
dag = DAG('dag_test',
description='test',
schedule_interval=None,
start_date=datetime(2021, 4, 1),
max_active_runs=1,
concurrency=40,
catchup=False)
def my_sleeping_function(t):
time.sleep(t)
tasks = []
for i in range(400):
task = PythonOperator(task_id='sleep_for_' + str(i),
python_callable=my_sleeping_function,
op_kwargs={'t': 60},
dag=dag) With AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION=true
With AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION=false EDIT: it seems that the error message below appears when I relaunch the task after its failure. It may not be related to SCHEDULE_AFTER_TASK_EXECUTION config
Some tasks succeed and others are randomly killed. |
Thanks, @felipeangelimvieira for coming through with the dag! |
So I found out that my metadata database was using 100% cpu while running dags with multiple tasks in parallel, such as the example above. I'm using Azure PostgresSQL and the official airflow helm chart with pgbouncer enabled. Increasing the database may solve the issue, but instead I increased the AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC and the problem was solved (the default is 5 sec). While I'm not sure, it's possible that the heatbeat method of BaseJob is the one overloading the database. When the database is running out of CPU, the default heartbeat takes longer than heartrate * 2.1 (2.1 is the default grace_multiplier in is_alive method of BaseJob) and the scheduler kills the tasks. |
@felipeangelimvieira have you noticed some pattern? Like the error happens if your task exec time > heart beat rate ? |
@laserpedro unfortunately no patterns... it worked with the example dag once. I have no idea how to solve it, altough the database running out of CPU seems to play a role. Could you verify if it is also the case for your database? I've tried different configurations (LocalExecutor, CeleryExecutor), and the problem keeps appearing randomly with those dags with many tasks in parallel. |
Same thing for me. I suspect my database might be getting slower also.
El El dom, sep. 19, 2021 a las 12:17, Felipe Angelim <
***@***.***> escribió:
… @laserpedro <https://github.com/laserpedro> unfortunately no patterns...
it worked with the example dag once. I have no idea how to solve it,
altough the database running out of CPU seems to play a role. Could you
verify if it is also the case for your database?
I've tried different configurations (LocalExecutor, CeleryExecutor), and
the problem keeps appearing randomly with those dags with many tasks in
parallel.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18041 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJUVYTQ4DRZ7PCKUSSEME3UCX5IDANCNFSM5DPYGX6A>
.
|
I see this issue and it is totally correlates which CPU spikes on my PostgreSQL DB. |
Just updated to 2.1.4 and the error message has changed. I'm seeing less dags receiving SIGTERM signal, although this error still appears. Now some tasks are being randomly killed and marked as failed before execution.
I've also verified some succeeded tasks marked as failed. In addition, I was able to reproduce the SIGTERM issue localy with docker-compose by limiting the CPU usage of the PostgresSQL container. Indeed, the meta database may cause the SIGTERM error. However, I wasn't expecting high CPU usage in PostgreSQL database while running with the official helm chart since it has a Pgbouncer. I can share the code of the docker-compose if you find it helpful. |
The SIGTERM issue came back for me in 2.0.2 so yes I really think that it is backend related. From that, I don't know what could be the solution... a fine tuning on the postgresql server maybe ... I am currently using the standard postgre cfg. @felipeangelimvieira : interesting stuff found on stackoverflow: https://stackoverflow.com/questions/42419834/airbnb-airflow-using-all-system-resources |
We started having this issue after we upgraded to v2.2.3. We did not experience this issue when we were at v2.0.2. Here is the sample dag that we used:
Error message:
Successful tasks were also intermittently flagged as failed:
Environment information:
|
@eduardchai, your case seems different. It seems that the tasks are taking a long time to start. Try whether setting |
@ephraimbuddy it does reduce the number of errors by a lot! Thank you! There are some errors where the jobs are stuck in |
If you are not on kubernetes then I'm not sure how it worked for you(I think it shouldn't work). What formed my opinion was that your pods were taking time to start and queued tasks were being moved to scheduled. So I don't know how it worked for your case. Maybe @jedcunningham can explain better. Maybe you should increase |
I am seeing similar issue in our Airflow with kubernetes environments. Airflow Version : 2.1.3 What Happens We have 1000s of task and this happens only for couple of tasks so far. Tasks Log: Task Pod Log I have not set configuration params and all have default value. I see a CPU spike but unable to relate it. Thanks in advance. |
@bparhy have you checked kubelet logs to detect node pressure evictions or other related evictions? |
@morhook I check with my K8s and they dont find anything unusual. I tried increasing the metadata DB size (Aurora) and that also did not help . Any solution in this direction please. We are currently running Airflow 2.1.3 using k8s. Please let me know. |
We are also facing the same problem and facing the issue with belo stack trace : **ERROR - Received SIGTERM. Terminating subprocesses.
{process_utils.py:124} ERROR - Process psutil.Process(pid=54556, name='python3', status='zombie', started='20:09:10') (54556) could not be killed. Giving up.** We also have thousands of tasks and happens to some intermittently |
We got SIGETERM errors on about 250 dags solved this link |
@GHGHGHKO thanks for the reply. We are seeing the issue on our task pods which even after success are in the failed state in k8s. So the output of this is Error pods in k8s. |
Possible fix I was having the same problem after upgrading from Airflow Apparently, the The solution for me was to increase the Note that this variable is effective only for scheduled tasks (in other words with DAGs with a specified |
I have the same problem Driver is running But the sensor displays the following logs until the number of retries exceeds the threshold
|
Did you try the earlier suggestions with dagrun_timeout? Do you know what is sending SIGTERM to this task? |
Hi all, From the discussion over at issue 17507, I may have identified issue when the SIGTERM is sent with the Recorded pid<> does not match the current pid <> error, but I'm running For me, I think this is happening when I don't know if this will address the issue with kubernetes or celery executor, but it seems very likely to be the same issue. It will take me a little while to set up the dev environment and do the testing before submitting a PR, but if you want to try doing a local install, feel free to give it a whirl. I have a tentative branch set up here: https://github.com/krcrouse/airflow/tree/fix-pid-check |
thank you @potiuk
airflow:
config:
# if other ns, u should config a new sa
AIRFLOW__KUBERNETES__NAMESPACE: "airflow"
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "false"
AIRFLOW__WEBSERVER__LOG_FETCH_TIMEOUT_SEC: "15"
AIRFLOW__LOGGING__LOGGING_LEVEL: "DEBUG"
AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "s3://airflow-logs/"
AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "openaios_airflow_log"
AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.basic_auth"
#AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC: 600
#AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 200
#AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD: 600
AIRFLOW__KUBERNETES__WORKER_PODS_QUEUED_CHECK_INTERVAL: "86400"
AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME: "604800"
AIRFLOW__CORE__HOSTNAME_CALLABLE: socket.gethostname
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: "30"
AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"
## a list of users to create
|
the change indicates problem with sheduler healthcheck - which I believe in 2.3.* (currently we are voting on 2.3.3) it was already addressed. I will close it provisionaly. And have a big request - can any of the people who had the problem migrate to 2.3.3 (or even try the 2.3.3rc1 which we are testing here #24806 and remove the configuration to default (@allenhaozi - maybe you can try it). |
@potiuk I am on version 2.3.3 and am having the same issue described here. |
Then provide an information: logs and analysis and description of your circumstances in a separate issue. It does not bring anyone closer by stating "I have the same issue" without providign any more details that can help with diagnosis of the problem you have. This might be different issue manifesting similarly - but if you do not create a new issue with your symptomps and description you pretty much removes the chance for anyone fixing your problem - because it might be a different one. So if you want to help with diagnosis of the problem - please do your part and report details that might help with the diagnosis. |
@potiuk I tried to update the following variables and I still have the issue :
and also tried with I don't understand what I'm doing wrong because other dags work fine 😢 Any clue 🙏 ? |
I suggest to migrate to latest - 2.4 (or in a few days 2.5) version. There are 100s of related fixes since and it is the easiest way to see if things got better. This is most efficient way for everyone. |
@potiuk after migrating to 2.5.0 i still get the issue |
Can you please open a new issue with description of cirscumstances and logs describing when and how it happens, That ask from above does not change:
|
cc: @yannibenoit ^^ |
@potiuk Thank you for your help I created an issue but i will resolve it haha 😂 -> Tasks intermittently gets terminated with SIGTERM on Celery Executor · Issue #27885 · apache/airflow Found a fix after looking at a stack overflow post -> Celery Executor - Airflow Impersonation "run_as_user" Recorded pid xxx does not match the current pid - Stack Overflow I was running my bash operator with a |
Ah. I would say that should have been fixed already. Is it possible @yannibenoit - to make an issue and submit some logs from BEFORE the run_as_user was commented out? I guess this might be a problem others might also have and run_as_user is kinda useful. |
Hello, we were experiencing a similar issue on v2.2.5 so we migrated to v2.4.3 but the problem still exists.
We're using Postgres DB and during the DAG execution, the CPU utilization of the DB is spiked up to 100%. (we're using |
@shaurya-sood - can you please (asking it again ) - open a new issue wifh more details - what is your deployment what you are doing, what you experience, more logs, what happens in the UI, whether you use run_as_user, is it happening alwasys or sometimes only, when it hppens etc. It really does not help to add a comment on a closed issue that might get just similar message, but might not necessarily be the same issue. Thanks in dvance. |
Opened a new issue #28201 |
Apache Airflow version
2.1.3 (latest released)
Operating System
Linux
Versions of Apache Airflow Providers
No response
Deployment
Other
Deployment details
Have tried env variables as given in this github issue issues/14672:
AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME: "604800"
AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: "False"
What happened
[2021-09-04 10:28:50,536] {local_task_job.py:80} ERROR - Received SIGTERM. Terminating subprocesses
[2021-09-04 10:28:50,536] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 33
[2021-09-04 10:28:50,537] {taskinstance.py:1235} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-09-04 10:28:52,568] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1307, in _execute_task
result = task_copy.execute(context=context)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 150, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/operators/python.py", line 161, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/airflow/dags/repo/dags/elastit_schedular/waiting_task_processor.py", line 59, in trigger_task
time.sleep(1)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1237, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
What you expected to happen
Dag must get executed successfully without any sigterm signal.
How to reproduce
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: