-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workers silently crash after memory build up #16703
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
@dan-origami Hmmm not good! What specific metric is it that you are showing in your graph please? |
Could you also check the output of |
@ashb its
Nothing on the dmesg at all, i checked and due to our affinity/selector the pod was always on the same server
|
Nice analysis. Will help us to investigate. Thanks @dan-origami ! |
No problem, happy to provide full Dockerfile, metrics or anything else you want as well. |
Oh one extra thing -- are you able to check the RSS of the processes in the container and see if they are all growing equally, if one is clearly using more than another, or if no process's RSS actually shows any growth? (I've been trying to track down a different memory issue in the scheduler where the working_set_bytes is growing, but no process shows the RSS, and would like to check they are different behaviours.) |
using
Working bytes seems to grow but roughly similar across all workers. And now using
I've also checked the pid files in /proc/pid/status and this seems to be reflected but I don't have the historical pid ones of course for now. |
FWIW; I've just noticed this happened on a worker pod over the weekend that was not experiencing abnormal memory usage or build ups. It does seem to still log occasionally with a celery sync.
I will probably upgrade these airflows to 2.1 to see if that makes any difference |
Around the time it failed i see some tmp files /tmp
There is a bunch of these appearing over time on the workers, again hard to know if its related. |
@dan-origami Were you able to upgrade to latest 2.1 release and see if this is still the issue |
@kaxil we are on 2.1.1 at the moment and it seems better, there were some fixes listed in the release for CeleryExecutors so went for it. I can't definitely say that its fixed though as we do churn our airflows quite a lot at the moment so it doesn't always get prolonged runtime without being redeployed. I see this issue is added for 2.1.3 do you know if there is anything specific that has been found around this? |
If you want us to try 2.1.2 as well we can definitely do that. |
Yes please |
Hi @dan-origami, have you tried airflow 2.1.2? |
@ephraimbuddy I am trying it this week |
@kaxil @ephraimbuddy Just to give you a bit of an update, I think I have found what the actual cause of this is. I noticed that we seem to hit a problem with the number of Active Tasks on a Celery Worker (all our settings here are currently default) so max 16 per worker and 32 across the airflow setup. However I noticed that when this problem manifests we don't schedule anything so started looking into our workers via Flower. Screenshots are below, but basically we have these fairly big DAGs that run some spark jobs on a spark cluster in the same kubernetes cluster (pyspark, so the driver exists as part of the airflow worker, we can do more details on spark if you want but its BashOperator and not SparkOperator for a number of reasons). Sometimes the tasks in these DAGs fail, which is picked up by Airflow as the task is marked as Failed. However these tasks sit on the Celery worker as an active task still and are not removed. We can manually delete them and it works, so the celery worker itself is still active and not crashed. They just do not seem to log anything when they are not picking up any new tasks/running them. Active PIDs etc as listed in Flower also seem to match up. It's not clear why the Task failed but we have the logs of it being picked up by the Worker (i've removed a few bits). It also explains why I was down the memory/resource issue rabbithole as these tasks sit around on the worker(s). There are some parameters that we can tune I think to include timeouts on the tasks and stuff on the Celery side, do you know if there is any known issues with this disconnect between a Failed Task in Airflow and it not being removed from the Celery Worker? The worker was not rebooted/crashed at any point during this time. Also this investigation was carried out today (31st Aug) and all the dates for stuff stuck since the 28th is correct that its been there for over 4 days.
|
@potiuk Did you have a fix related to memory usage of logging? @dan-origami Are you able to test with 2.2.x? |
Yes. The graphs look very much like what I've fixed. But ut was not (and could not) cause crash - as it was not a real memory usage of the application but the kernel cache for log files (it would never be a cause for crash - the memory is freed and removed if needed). The observed memory behaviour matches what my change addressed (overal growth of memory for the working set but the container RSS memory staying flat). If that was the case, then the crash could be caused by something else and observed memory build-up is unrelated. The kernel advisory (to not cache the written logs) has been released in https://airflow.apache.org/docs/apache-airflow/stable/changelog.html#airflow-2-2-0-2021-10-11 with this fix: #18054 and the history of the issue is here #14924 |
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version: 2.0.2
Kubernetes version (if you are using kubernetes) (use
kubectl version
): 1.18.15Environment:
uname -a
): Linux 5.4.0-1024-aws # 24-UbuntuCelery Workers
What happened:
Memory usage builds up on our celery worker pods until they silently crash. Resource usage flat lines and no logs are created by the worker. The process is still running and Celery (verified via ping and flower) thinks the workers are up and running.
No tasks are finished by Airflow, the schedulers are running fine and still logging appropriately but the workers are doing nothing. Workers do not accept any tasks and inflight jobs hang.
They do not log an error message and the pod is not restarted as the process hasn't crashed.
Our workers do not all crash at the same time, it happens over a couple of hours even if they were all restarted at the same time, so it seems to be related to how many jobs the worker has done/logs/other-non-time event.
I believe this is related to the logs generated by the workers, Airflow appears to be reading in the existing log files to memory. Memory usage drops massively when the log files are deleted and then resume to build up again.
There doesn't appear to be a definite upper limit of memory that the pod hits when it crashes, but its around the 8 or 10GB mark (there is 14 available to the pods but they dont hit that).
Log size on disk correlates to more memory usage by a worker pod than one with smaller log size on disk.
What you expected to happen:
If the worker has crashed/ceased functioning it should either be able to log an appropriate message if the process is up or crash cleanly and be able to be restarted.
Existing log files should not contribute to the memory usage of the airflow process either.
Celery should also be able to detect that the worker is no longer functional.
How to reproduce it:
Run an airflow cluster with 40+ DAGs with several hundred tasks in total in an environment that has observable metrics, we use k8s with Prometheus.
We have 5x worker pods.
Monitor the memory usage of the worker containers/pods over time as well as the size of the airflow task logs. The trend should only increase.
Anything else we need to know:
This problem occurs constantly, after a clean deployment and in multiple environments.
The official Airflow docker image contains a log-cleaner so its possible this has been avoided but in general 15 days default would be far too long. Our workers crash between 2 or 3 days.
Resorting to an aggressive log cleaning script has mitigated the problem for us but without proper error logs or reasons for the crash it hard to be definite that we are safe.
This is our airflow.cfg logging config, we aren't doing anything radical just storing in a bucket.
Here is a memory usage graph of a crashed worker pod, the flat line is when it is in a crashed state and then restarted. There is also a big cliff on the right of the graph at about 0900 on June 29th where I manually cleaned the log files from the disk.
The last few log lines before it crashed:
The text was updated successfully, but these errors were encountered: