Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Are there any policies for yarn to clean the completed or failed job's data? #1760

Closed
ydye opened this issue Nov 27, 2018 · 5 comments
Closed
Assignees
Labels

Comments

@ydye
Copy link
Contributor

ydye commented Nov 27, 2018

  1. Clean the finished container.

Recently we found some jobs write large data into the container file system. Actually, docker map the container's file system into a path of local host. So if the finished job's data isn't cleaned, it may take a lot of disk space.

  1. Clean yarn local data in the path /tmp/pai-root/

In our bed, we found a lot of history data in the path. And some of them are very large. We do believe that we need some policies to remove those data. But not sure whether yarn or pai have implemented it or not.

root@xxxxxxxxxxxxxx:/# du -h --max-depth=1 /tmp/pai-root/code | grep G
1.3G    /tmp/pai-root/code/application_1534124332808_1913
1.3G    /tmp/pai-root/code/application_1534124332808_1918
1.3G    /tmp/pai-root/code/application_1534124332808_1920
103G    /tmp/pai-root/code/application_1535954247490_2193
1.3G    /tmp/pai-root/code/application_1534124332808_1929
@ydye ydye changed the title Are there any polices for yarn to clean the completed or failed job's data? Are there any policies for yarn to clean the completed or failed job's data? Nov 27, 2018
@mzmssg
Copy link
Member

mzmssg commented Nov 27, 2018

Is this container path or host path? We have moved codeDir to yarn local path in #1100, I suspect these dirs are residue from older jobs.

@ydye
Copy link
Contributor Author

ydye commented Nov 27, 2018

Hostpath

@mzmssg
Copy link
Member

mzmssg commented Nov 27, 2018

For 1: We add --rm when launch job, shouldn't it be cleaned by docker daemon?
For 2: Current release we download code to yarn local directory, yarn should clean it regardless of results. But unfortunately, due to disk pressure, k8s might kill yarn before cleanup, which is a severe conflict between 2 system. Especially by default, we enable fancy retry. Then such a job with large data might keep retry on every node, until all nodes down. Need a better design for it.

Back to this issue, these dirs belong to old version pai, I think we could delete them directly.

@scarlett2018
Copy link
Member

@mzmssg @ydye - is this issue trying to automatic clean up failed jobs? If so, I'm having concerns, as lots of customers are asking for keeping those logs for digging out the root cause.

@mzmssg
Copy link
Member

mzmssg commented Feb 21, 2019

@scarlett2018 Nope, this issue is about disk management.
Now we have cleaner for it.

Closed it for resolved #2119

@mzmssg mzmssg closed this as completed Feb 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants