Are there any policies for yarn to clean the completed or failed job's data? #1760

ydye · 2018-11-27T02:29:46Z

Clean the finished container.

Recently we found some jobs write large data into the container file system. Actually, docker map the container's file system into a path of local host. So if the finished job's data isn't cleaned, it may take a lot of disk space.

Clean yarn local data in the path /tmp/pai-root/

In our bed, we found a lot of history data in the path. And some of them are very large. We do believe that we need some policies to remove those data. But not sure whether yarn or pai have implemented it or not.

root@xxxxxxxxxxxxxx:/# du -h --max-depth=1 /tmp/pai-root/code | grep G
1.3G    /tmp/pai-root/code/application_1534124332808_1913
1.3G    /tmp/pai-root/code/application_1534124332808_1918
1.3G    /tmp/pai-root/code/application_1534124332808_1920
103G    /tmp/pai-root/code/application_1535954247490_2193
1.3G    /tmp/pai-root/code/application_1534124332808_1929

The text was updated successfully, but these errors were encountered:

mzmssg · 2018-11-27T03:47:50Z

Is this container path or host path? We have moved codeDir to yarn local path in #1100, I suspect these dirs are residue from older jobs.

ydye · 2018-11-27T03:54:48Z

Hostpath

mzmssg · 2018-11-27T14:14:23Z

For 1: We add --rm when launch job, shouldn't it be cleaned by docker daemon?
For 2: Current release we download code to yarn local directory, yarn should clean it regardless of results. But unfortunately, due to disk pressure, k8s might kill yarn before cleanup, which is a severe conflict between 2 system. Especially by default, we enable fancy retry. Then such a job with large data might keep retry on every node, until all nodes down. Need a better design for it.

Back to this issue, these dirs belong to old version pai, I think we could delete them directly.

scarlett2018 · 2019-01-21T07:25:07Z

@mzmssg @ydye - is this issue trying to automatic clean up failed jobs? If so, I'm having concerns, as lots of customers are asking for keeping those logs for digging out the root cause.

mzmssg · 2019-02-21T03:10:18Z

@scarlett2018 Nope, this issue is about disk management.
Now we have cleaner for it.

Closed it for resolved #2119

ydye assigned mzmssg Nov 27, 2018

ydye changed the title ~~Are there any polices for yarn to clean the completed or failed job's data?~~ Are there any policies for yarn to clean the completed or failed job's data? Nov 27, 2018

scarlett2018 added the pai-dev label Jan 21, 2019

mzmssg closed this as completed Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are there any policies for yarn to clean the completed or failed job's data? #1760

Are there any policies for yarn to clean the completed or failed job's data? #1760

ydye commented Nov 27, 2018 •

edited

Loading

mzmssg commented Nov 27, 2018 •

edited

Loading

ydye commented Nov 27, 2018

mzmssg commented Nov 27, 2018 •

edited

Loading

scarlett2018 commented Jan 21, 2019

mzmssg commented Feb 21, 2019

Are there any policies for yarn to clean the completed or failed job's data? #1760

Are there any policies for yarn to clean the completed or failed job's data? #1760

Comments

ydye commented Nov 27, 2018 • edited Loading

mzmssg commented Nov 27, 2018 • edited Loading

ydye commented Nov 27, 2018

mzmssg commented Nov 27, 2018 • edited Loading

scarlett2018 commented Jan 21, 2019

mzmssg commented Feb 21, 2019

ydye commented Nov 27, 2018 •

edited

Loading

mzmssg commented Nov 27, 2018 •

edited

Loading

mzmssg commented Nov 27, 2018 •

edited

Loading