Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers silently crash after memory build up #16703

Closed
dan-origami opened this issue Jun 29, 2021 · 21 comments
Closed

Workers silently crash after memory build up #16703

dan-origami opened this issue Jun 29, 2021 · 21 comments
Labels
affected_version:2.0 Issues Reported for 2.0 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug pending-response stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@dan-origami
Copy link

dan-origami commented Jun 29, 2021

Apache Airflow version: 2.0.2

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.18.15

Environment:

  • Cloud provider or hardware configuration: AWS, ec2 servers deployed by kops
  • OS (e.g. from /etc/os-release): Ubuntu 20.04
  • Kernel (e.g. uname -a): Linux 5.4.0-1024-aws # 24-Ubuntu
  • Install tools: Dockerfile
  • Others: Custom Dockerfile (not official airflow image from dockerhub)
    Celery Workers

What happened:

Memory usage builds up on our celery worker pods until they silently crash. Resource usage flat lines and no logs are created by the worker. The process is still running and Celery (verified via ping and flower) thinks the workers are up and running.
No tasks are finished by Airflow, the schedulers are running fine and still logging appropriately but the workers are doing nothing. Workers do not accept any tasks and inflight jobs hang.
They do not log an error message and the pod is not restarted as the process hasn't crashed.
Our workers do not all crash at the same time, it happens over a couple of hours even if they were all restarted at the same time, so it seems to be related to how many jobs the worker has done/logs/other-non-time event.

I believe this is related to the logs generated by the workers, Airflow appears to be reading in the existing log files to memory. Memory usage drops massively when the log files are deleted and then resume to build up again.

There doesn't appear to be a definite upper limit of memory that the pod hits when it crashes, but its around the 8 or 10GB mark (there is 14 available to the pods but they dont hit that).

Log size on disk correlates to more memory usage by a worker pod than one with smaller log size on disk.

What you expected to happen:

If the worker has crashed/ceased functioning it should either be able to log an appropriate message if the process is up or crash cleanly and be able to be restarted.
Existing log files should not contribute to the memory usage of the airflow process either.
Celery should also be able to detect that the worker is no longer functional.

How to reproduce it:

Run an airflow cluster with 40+ DAGs with several hundred tasks in total in an environment that has observable metrics, we use k8s with Prometheus.
We have 5x worker pods.
Monitor the memory usage of the worker containers/pods over time as well as the size of the airflow task logs. The trend should only increase.

Anything else we need to know:

This problem occurs constantly, after a clean deployment and in multiple environments.
The official Airflow docker image contains a log-cleaner so its possible this has been avoided but in general 15 days default would be far too long. Our workers crash between 2 or 3 days.
Resorting to an aggressive log cleaning script has mitigated the problem for us but without proper error logs or reasons for the crash it hard to be definite that we are safe.

This is our airflow.cfg logging config, we aren't doing anything radical just storing in a bucket.

[logging]
# Airflow can store logs remotely in AWS S3, Google Cloud Storage or Elastic Search.
# Users must supply an Airflow connection id that provides access to the storage
# location. If remote_logging is set to true, see UPDATING.md for additional
# configuration requirements.

# remote_logging = $ENABLE_REMOTE_LOGGING
# remote_log_conn_id = s3conn
# remote_base_log_folder = $LOGS_S3_BUCKET
# encrypt_s3_logs = False

remote_logging = True
remote_log_conn_id = s3conn
remote_base_log_folder = $AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER
encrypt_s3_logs = False

# Log format
log_format = [%%(asctime)s] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s
simple_log_format = %%(asctime)s %%(levelname)s - %%(message)s

# Logging level
logging_level = INFO

# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
logging_config_class =

# The folder where airflow should store its log files
# This path must be absolute
base_log_folder = /usr/local/airflow/logs

# Name of handler to read task instance logs.
# Default to use file task handler.
# task_log_reader = file.task
task_log_reader = task

Here is a memory usage graph of a crashed worker pod, the flat line is when it is in a crashed state and then restarted. There is also a big cliff on the right of the graph at about 0900 on June 29th where I manually cleaned the log files from the disk.

Crashed airflow worker

The last few log lines before it crashed:

Jun 25, 2021 @ 04:28:01.831 | [2021-06-25 03:28:01,830: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[5f802ffb-d5af-40ae-9e99-5e0501bf7d1c]

 Jun 25, 2021 @ 04:27:36.769 | [2021-06-25 03:27:36,769: INFO/MainProcess] Received task: airflow.executors.celery_executor.execute_command[737d4310-c6ae-450f-889a-ffee53e94d33]

  Jun 25, 2021 @ 04:27:25.565 | [2021-06-25 03:27:25,564: WARNING/ForkPoolWorker-13] Running <TaskInstance: a_task_name 2021-06-25T02:18:00+00:00 [queued]> on host airflow-worker-3.airflow-worker.airflow.svc.cluster.local

  Jun 25, 2021 @ 04:27:25.403 | [2021-06-25 03:27:25,402: INFO/ForkPoolWorker-13] Filling up the DagBag from /usr/local/airflow/dags/a_dag.py

  Jun 25, 2021 @ 04:27:25.337 | [2021-06-25 03:27:25,337: INFO/ForkPoolWorker-13] Executing command in Celery: ['airflow', 'tasks', 'run', 'task_name_redacted', 'task, '2021-06-25T02:18:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/usr/local/airflow/dags/a_dag.py']

  Jun 25, 2021 @ 04:27:25.327 | [2021-06-25 03:27:25,326: INFO/ForkPoolWorker-13] Task airflow.executors.celery_executor.execute_command[4d9ee684-4ae3-41d2-8a00-e8071179a1b1] succeeded in 5.212706514168531s: None

  Jun 25, 2021 @ 04:27:24.980 | [2021-06-25 03:27:24,979: INFO/ForkPoolWorker-13] role_arn is None

   Jun 25, 2021 @ 04:27:24.968 | [2021-06-25 03:27:24,968: INFO/ForkPoolWorker-13] No credentials retrieved from Connection

   Jun 25, 2021 @ 04:27:24.968 | [2021-06-25 03:27:24,968: INFO/ForkPoolWorker-13] Creating session with aws_access_key_id=None region_name=None

   Jun 25, 2021 @ 04:27:24.954 | [2021-06-25 03:27:24,953: INFO/ForkPoolWorker-13] Airflow Connection: aws_conn_id=s3conn

  Jun 25, 2021 @ 04:27:20.610 | [2021-06-25 03:27:20,610: WARNING/ForkPoolWorker-13] Running <TaskInstance: task_name_redacted 2021-06-25T03:10:00+00:00 [queued]> on host airflow-worker-3.airflow-worker.airflow.svc.cluster.local
@dan-origami dan-origami added the kind:bug This is a clearly a bug label Jun 29, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 29, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@ashb
Copy link
Member

ashb commented Jun 29, 2021

@dan-origami Hmmm not good!

What specific metric is it that you are showing in your graph please?

@ashb
Copy link
Member

ashb commented Jun 29, 2021

Could you also check the output of dmesg on the host (likely won't have info from two days ago now, so might need to trigger this again) at around the time the worker "stalls" -- specifically I am interested to see if there is anything in the logs related to the linux OOM killer being triggered.

@dan-origami
Copy link
Author

dan-origami commented Jun 29, 2021

@ashb its container_memory_usage_bytes with full query below

sum(container_memory_usage_bytes{container!="POD",pod=~"airflow.*",image!=""}) by (pod) I had just the worker selected for that view of course

Nothing on the dmesg at all, i checked and due to our affinity/selector the pod was always on the same server

[Tue Jun 22 13:05:51 2021] device veth7323bc8f entered promiscuous mode
[Tue Jun 22 13:05:51 2021] cni0: port 6(veth7323bc8f) entered blocking state
[Tue Jun 22 13:05:51 2021] cni0: port 6(veth7323bc8f) entered forwarding state
[Tue Jun 22 13:08:10 2021] cni0: port 2(veth742b6e37) entered disabled state
[Tue Jun 22 13:08:10 2021] device veth742b6e37 left promiscuous mode
[Tue Jun 22 13:08:10 2021] cni0: port 2(veth742b6e37) entered disabled state
[Tue Jun 22 13:08:19 2021] IPv6: ADDRCONF(NETDEV_CHANGE): vethab64c6d9: link becomes ready
[Tue Jun 22 13:08:19 2021] cni0: port 2(vethab64c6d9) entered blocking state
[Tue Jun 22 13:08:19 2021] cni0: port 2(vethab64c6d9) entered disabled state
[Tue Jun 22 13:08:19 2021] device vethab64c6d9 entered promiscuous mode
[Tue Jun 22 13:08:19 2021] cni0: port 2(vethab64c6d9) entered blocking state
[Tue Jun 22 13:08:19 2021] cni0: port 2(vethab64c6d9) entered forwarding state
[Tue Jun 22 14:57:36 2021] cni0: port 5(vethda1d06e5) entered disabled state
[Tue Jun 22 14:57:36 2021] device vethda1d06e5 left promiscuous mode
[Tue Jun 22 14:57:36 2021] cni0: port 5(vethda1d06e5) entered disabled state
[Fri Jun 25 09:38:38 2021] cni0: port 3(vethfa6a98a7) entered disabled state
[Fri Jun 25 09:38:38 2021] device vethfa6a98a7 left promiscuous mode
[Fri Jun 25 09:38:38 2021] cni0: port 3(vethfa6a98a7) entered disabled state
[Mon Jun 28 08:43:32 2021] IPv6: ADDRCONF(NETDEV_CHANGE): vethae505a5a: link becomes ready
[Mon Jun 28 08:43:32 2021] cni0: port 3(vethae505a5a) entered blocking state
[Mon Jun 28 08:43:32 2021] cni0: port 3(vethae505a5a) entered disabled state
[Mon Jun 28 08:43:32 2021] device vethae505a5a entered promiscuous mode
[Mon Jun 28 08:43:32 2021] cni0: port 3(vethae505a5a) entered blocking state
[Mon Jun 28 08:43:32 2021] cni0: port 3(vethae505a5a) entered forwarding state
[Mon Jun 28 08:43:40 2021] cni0: port 6(veth7323bc8f) entered disabled state
[Mon Jun 28 08:43:40 2021] device veth7323bc8f left promiscuous mode
[Mon Jun 28 08:43:40 2021] cni0: port 6(veth7323bc8f) entered disabled state

Verification of node staying the same

@potiuk
Copy link
Member

potiuk commented Jun 29, 2021

Nice analysis. Will help us to investigate. Thanks @dan-origami !

@dan-origami
Copy link
Author

No problem, happy to provide full Dockerfile, metrics or anything else you want as well.

@ashb
Copy link
Member

ashb commented Jun 29, 2021

Oh one extra thing -- are you able to check the RSS of the processes in the container and see if they are all growing equally, if one is clearly using more than another, or if no process's RSS actually shows any growth?

(I've been trying to track down a different memory issue in the scheduler where the working_set_bytes is growing, but no process shows the RSS, and would like to check they are different behaviours.)

@dan-origami
Copy link
Author

using container_memory_working_set_bytes

sum(container_memory_working_set_bytes{pod=~"airflow-worker-.*",pod!="",container!="POD",image!="",container!="statsd"}) by (pod)

Working bytes seems to grow but roughly similar across all workers.
(Bad graph title)

Worker-3
worker-3 as above

All workers
All workers

And now using
container_memory_rss

worker-3

all workers

I've also checked the pid files in /proc/pid/status and this seems to be reflected but I don't have the historical pid ones of course for now.

@eladkal eladkal added area:Scheduler including HA (high availability) scheduler affected_version:2.0 Issues Reported for 2.0 labels Jul 4, 2021
@eladkal eladkal added this to the Airflow 2.1.2 milestone Jul 4, 2021
@dan-origami
Copy link
Author

dan-origami commented Jul 5, 2021

FWIW; I've just noticed this happened on a worker pod over the weekend that was not experiencing abnormal memory usage or build ups.

It does seem to still log occasionally with a celery sync.

[2021-07-04 16:56:50,297: INFO/ForkPoolWorker-14] Executing command in Celery: ['airflow', 'tasks', 'run', 'a_task', 'api', '2021-07-04T16:22:00+00:00', '--local', '--pool', 'default_pool', '--subdir', '/usr/local/airflow/dags/dag.py']
[2021-07-04 16:56:50,415: INFO/ForkPoolWorker-14] Filling up the DagBag from /usr/local/airflow/dags/dag.py
[2021-07-04 16:56:50,691: WARNING/ForkPoolWorker-14] Running <TaskInstance: task task 2021-07-04T16:22:00+00:00 [queued]> on host airflow-worker-0.airflow-worker.airflow.svc.cluster.local
[2021-07-05 03:10:55,663: INFO/MainProcess] sync with celery@airflow-worker-1
[2021-07-05 08:13:04,899: INFO/MainProcess] sync with celery@airflow-worker-1

I will probably upgrade these airflows to 2.1 to see if that makes any difference

@dan-origami
Copy link
Author

Around the time it failed i see some tmp files /tmp

-rw-------  1 airflow   89 Jul  4 15:47 tmpai_jn6o_
-rw-------  1 airflow   89 Jul  4 16:47 tmpyqxc49z8
-rw-------  1 airflow   89 Jul  4 17:17 tmpuyxaiqwj
airflow@airflow-worker-0:/tmp$ cat tmpuyxaiqwj
��N�airflow.exceptions��AirflowException����Task received SIGTERM signal���R�.airflow@airflow-worker-0:/tmp$

There is a bunch of these appearing over time on the workers, again hard to know if its related.

@ashb ashb modified the milestones: Airflow 2.1.2, Airflow 2.1.3 Jul 7, 2021
@kaxil
Copy link
Member

kaxil commented Jul 27, 2021

@dan-origami Were you able to upgrade to latest 2.1 release and see if this is still the issue

@dan-origami
Copy link
Author

@kaxil we are on 2.1.1 at the moment and it seems better, there were some fixes listed in the release for CeleryExecutors so went for it. I can't definitely say that its fixed though as we do churn our airflows quite a lot at the moment so it doesn't always get prolonged runtime without being redeployed.

I see this issue is added for 2.1.3 do you know if there is anything specific that has been found around this?

@dan-origami
Copy link
Author

If you want us to try 2.1.2 as well we can definitely do that.

@kaxil
Copy link
Member

kaxil commented Jul 30, 2021

Yes please

@ephraimbuddy
Copy link
Contributor

Hi @dan-origami, have you tried airflow 2.1.2?

@dan-origami
Copy link
Author

@ephraimbuddy I am trying it this week

@kaxil kaxil modified the milestones: Airflow 2.1.3, Airflow 2.2 Aug 20, 2021
@dan-origami
Copy link
Author

dan-origami commented Aug 31, 2021

@kaxil @ephraimbuddy Just to give you a bit of an update, I think I have found what the actual cause of this is.

I noticed that we seem to hit a problem with the number of Active Tasks on a Celery Worker (all our settings here are currently default) so max 16 per worker and 32 across the airflow setup.

However I noticed that when this problem manifests we don't schedule anything so started looking into our workers via Flower.

Screenshots are below, but basically we have these fairly big DAGs that run some spark jobs on a spark cluster in the same kubernetes cluster (pyspark, so the driver exists as part of the airflow worker, we can do more details on spark if you want but its BashOperator and not SparkOperator for a number of reasons).

Sometimes the tasks in these DAGs fail, which is picked up by Airflow as the task is marked as Failed. However these tasks sit on the Celery worker as an active task still and are not removed.

We can manually delete them and it works, so the celery worker itself is still active and not crashed. They just do not seem to log anything when they are not picking up any new tasks/running them. Active PIDs etc as listed in Flower also seem to match up.

It's not clear why the Task failed but we have the logs of it being picked up by the Worker (i've removed a few bits).

It also explains why I was down the memory/resource issue rabbithole as these tasks sit around on the worker(s).

There are some parameters that we can tune I think to include timeouts on the tasks and stuff on the Celery side, do you know if there is any known issues with this disconnect between a Failed Task in Airflow and it not being removed from the Celery Worker?

The worker was not rebooted/crashed at any point during this time.

Also this investigation was carried out today (31st Aug) and all the dates for stuff stuck since the 28th is correct that its been there for over 4 days.

[2021-08-28 06:54:43,865: INFO/ForkPoolWorker-8] Executing command in Celery: ['airflow', 'tasks', 'run', 'dagname', 'timeseries_api', '2021-08-28T06:18:00+00:00', '--local', '--pool', 'default_pool', '--subdir', 'dagfolder']
[2021-08-28 06:54:44,122: WARNING/ForkPoolWorker-8] Running <TaskInstance:dagname.timeseries_api 2021-08-28T06:18:00+00:00 [queued]> on host airflow-worker-2.airflow-worker.airflow.svc.cluster.local

Screenshot 2021-08-31 at 13 16 40

Screenshot 2021-08-31 at 13 17 18

Screenshot 2021-08-31 at 13 18 21

Screenshot 2021-08-31 at 13 18 03

@ashb ashb modified the milestones: Airflow 2.2.1, Airflow 2.2.2 Oct 14, 2021
@jedcunningham jedcunningham removed this from the Airflow 2.2.2 milestone Nov 5, 2021
@jedcunningham jedcunningham added this to the Airflow 2.2.3 milestone Nov 5, 2021
@ashb
Copy link
Member

ashb commented Apr 22, 2022

@potiuk Did you have a fix related to memory usage of logging? @dan-origami Are you able to test with 2.2.x?

@ashb ashb removed this from the Airflow 2.3.0 milestone Apr 22, 2022
@potiuk
Copy link
Member

potiuk commented Apr 25, 2022

@potiuk Did you have a fix related to memory usage of logging? @dan-origami Are you able to test with 2.2.x?

Yes. The graphs look very much like what I've fixed. But ut was not (and could not) cause crash - as it was not a real memory usage of the application but the kernel cache for log files (it would never be a cause for crash - the memory is freed and removed if needed). The observed memory behaviour matches what my change addressed (overal growth of memory for the working set but the container RSS memory staying flat).

If that was the case, then the crash could be caused by something else and observed memory build-up is unrelated.

The kernel advisory (to not cache the written logs) has been released in https://airflow.apache.org/docs/apache-airflow/stable/changelog.html#airflow-2-2-0-2021-10-11 with this fix: #18054 and the history of the issue is here #14924

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label May 26, 2022
@github-actions
Copy link

github-actions bot commented Jun 2, 2022

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as completed Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.0 Issues Reported for 2.0 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug pending-response stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

No branches or pull requests

7 participants