Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gCNV WDLs should clean up intermediate files and directories. #5382

Closed
vruano opened this issue Oct 31, 2018 · 4 comments
Closed

gCNV WDLs should clean up intermediate files and directories. #5382

vruano opened this issue Oct 31, 2018 · 4 comments

Comments

@vruano
Copy link
Contributor

vruano commented Oct 31, 2018

Due to the way the calling process is sharded upstream ... the current wdl expends 80-90% of the time copying files over ... so a job that takes around 1h only 10 minutes are expend in running the GATK Tool, the result is used to stage the input files.

For example if the genome intervals where shared into 600 ~ different batches this results in 3600 ~ files transferred one by one using their own gsutil command. The reason why these are not batched is in part because files share name and multi-file gsutil cp does not provide the means to indicate indepedendent destination names for each input file. Recursive copy of a parent directory would drag the information from all the samples when the each task just deals with one.

@samuelklee
Copy link
Contributor

samuelklee commented Nov 1, 2018

Yes, also recall that @asmirnov239 tried to address #4397, but that led to #5217, so we reverted. We can try to address all of these issues again correctly if it's low-hanging fruit (which it probably is) and if it'll bring the overall cost of the pipeline down significantly. However, for the most part, I think bringing down costs in the gCNV step will have more impact.

Thanks for diagnosing and pointing out these issues. You should feel free to open PRs against the gCNV code as well!

@samuelklee
Copy link
Contributor

@vruano As pointed out by @sooheelee, the current WDL does not clean up the intermediate CALLS_* and MODEL_* directories. This is fine for running on the cloud, but we should clean them up when running locally. Can you take care of this as well?

@samuelklee
Copy link
Contributor

Dupe of #4397, but changing the name to reflect the issue mentioned in the previous comment.

@samuelklee samuelklee changed the title PostprocessGermlineCNVCalls (the enclosing wdl): too slow in Firecloud/cromwell. gCNV WDLs should clean up intermediate files and directories. Jan 31, 2019
@samuelklee samuelklee assigned samuelklee and unassigned asmirnov239 Feb 1, 2019
@samuelklee
Copy link
Contributor

samuelklee commented Feb 15, 2019

CALLS_* and MODEL_* directories are actually cleaned up in #5414, but there are a few places where contig-ploidy calls are not cleaned up.

We could also clean up the out directories generated by DetermineGermlineContigPloidy and GermlineCNVCaller, since the contents of these are sliced and tarred, but it's arguably nice to have all of the output for each shard in a single directory.

samuelklee added a commit that referenced this issue Mar 12, 2019
* Cleaned up intermediate files in gCNV WDL and fixed miscellaneous typos. (#5382)

* Added output of MAD values as floats in somatic CNV WDL. (#5591)

* Exposed boot disk space for Oncotator in somatic CNV WDL. (#3566)

* Added check to skip outlier truncation if number of matrix elements exceeds Integer.MAX_VALUE in CreateReadCountPanelOfNormals. (#4734)

* Miscellaneous boy scout activities.

* Fixed some issues concerning intervals in DetermineGermlineContigPloidy documentation.

* Fixed non-kebab-case argument in CollectAllelicCountsSpark and other minor issues.

* Improved consistency of style and input/output validation across CNV tools. (#4825)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants