Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring gCNV WDL #5176

Merged
merged 1 commit into from
Sep 24, 2018
Merged

Refactoring gCNV WDL #5176

merged 1 commit into from
Sep 24, 2018

Conversation

asmirnov239
Copy link
Collaborator

@asmirnov239 asmirnov239 commented Sep 11, 2018

Addresses #4397 and #5054.

Restructured gCNV WDLs to pass data more efficiently to postprocessing tasks, and added a wrapper workflow for gCNV case WDL that scatters samples in multiple blocks. Also cleaned up some of the unused cromwell travis tests.

@codecov-io
Copy link

codecov-io commented Sep 11, 2018

Codecov Report

Merging #5176 into master will increase coverage by 0.223%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##              master    #5176       +/-   ##
==============================================
+ Coverage     86.798%   87.02%   +0.223%     
- Complexity     29711    31801     +2090     
==============================================
  Files           1820     1834       +14     
  Lines         137406   143956     +6550     
  Branches       15160    16642     +1482     
==============================================
+ Hits          119265   125271     +6006     
- Misses         12637    13037      +400     
- Partials        5504     5648      +144
Impacted Files Coverage Δ Complexity Δ
...ead/markduplicates/sparkrecords/EmptyFragment.java 80% <0%> (-4.615%) 10% <0%> (+5%)
...stitute/hellbender/engine/ReferenceFileSource.java 60.87% <0%> (-2.767%) 7% <0%> (+3%)
...ils/read/markduplicates/sparkrecords/Fragment.java 90.909% <0%> (-1.948%) 12% <0%> (+6%)
...tute/hellbender/utils/GenomeLocParserUnitTest.java 86.582% <0%> (-1.853%) 52% <0%> (+6%)
...tools/walkers/haplotypecaller/HaplotypeCaller.java 98.182% <0%> (-1.818%) 37% <0%> (+13%)
...s/copynumber/AnnotateIntervalsIntegrationTest.java 91.026% <0%> (-1.78%) 14% <0%> (+3%)
...rg/broadinstitute/hellbender/utils/Nucleotide.java 92.614% <0%> (-1.46%) 75% <0%> (+21%)
...ols/copynumber/formats/CopyNumberFormatsUtils.java 63.636% <0%> (-1.364%) 5% <0%> (+1%)
...es/xsv/LocatableXsvFuncotationFactoryUnitTest.java 90.411% <0%> (-1.256%) 25% <0%> (+12%)
...der/engine/spark/ShuffleJoinReadsWithRefBases.java 95.745% <0%> (-1.225%) 14% <0%> (+4%)
... and 88 more

@asmirnov239
Copy link
Collaborator Author

@samuelklee Could you please review? Also @mwalker174 can you see if this conflicts with your changes?

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor typos and WDL style issues. Otherwise, looks good to me---thanks for making these changes! Perhaps @mwalker174 can take a run through?

- Calling a cohort of samples and building a model for denoising further case samples: ``cnv_germline_cohort_workflow.wdl``
- Calling a case sample using a previously built model for denoising: ``cnv_germline_case_workflow.wdl``
- Cohort WDL: Calling a cohort of samples and building a model for denoising further case samples: ``cnv_germline_cohort_workflow.wdl``
- Case WDL: calling case samples using a previously built model for denoising: ``cnv_germline_case_workflow.wdl``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling -> Calling

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- Calling a case sample using a previously built model for denoising: ``cnv_germline_case_workflow.wdl``
- Cohort WDL: Calling a cohort of samples and building a model for denoising further case samples: ``cnv_germline_cohort_workflow.wdl``
- Case WDL: calling case samples using a previously built model for denoising: ``cnv_germline_case_workflow.wdl``
- Scattered case WDL(recommended): functionally equivalent to case WDL, written for reducing cloud compute cost (see below) and wall-clock time ``cnv_germline_case_scattered_workflow.wdl``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDL(recommended): functionally -> WDL (recommended): Functionally

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


- ``CNVGermlineCaseScatteredWorkflow.num_samples_per_scatter_block`` -- (recommended WES value=25) number of samples to process in a single block; blocks of this size will be sent to germline case workflow and processed in a batch;
- ``CNVGermlineCaseScatteredWorkflow.preemptible_attempts`` -- (recommended value=5) this allows to reduce cost by using preemptible instances
- ``CNVGermlineCaseScatteredWorkflow.mem_gb_for_determine_germline_contig_ploidy`` -- amount of memory alloted for ploidy determination tasks (the lower the cheaper)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alloted -> allotted (here and below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


#### Required parameters in the scattered germline case workflow

Same required parameters as in germline case workflow, however in order to reduce wall-clock time and compute cost it is recommended to optimize for the following parameters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same required parameters as in germline case workflow, however in order to reduce wall-clock time and compute cost it is recommended to optimize for the following parameters: -> Same required parameters as in the germline case workflow. However, in order to reduce wall-clock time and compute cost, it is recommended to optimize for the following parameters:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Same required parameters as in germline case workflow, however in order to reduce wall-clock time and compute cost it is recommended to optimize for the following parameters:

- ``CNVGermlineCaseScatteredWorkflow.num_samples_per_scatter_block`` -- (recommended WES value=25) number of samples to process in a single block; blocks of this size will be sent to germline case workflow and processed in a batch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to germline case -> to the germline case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

File ref_fasta_fai
File ref_fasta
String gatk_docker
Int num_samples_per_scatter_block = 25 #recommended value for WES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I've lost track of how we are supposed to specify default parameter values in WDLs. Having these default values here seems idiosyncratic compared to the other CNV WDLs. Can you not push this down to the task level? Perhaps @LeeTL1220 can comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems more consistent to keep num_samples_per_scatter_block as required with no default (since this parameter is pretty fundamental to this WDL), and specify the default for preemptible_attempts in the task using select_first as we do elsewhere. You've given the defaults for these parameters in the documentation, and we can eventually provide them in the default json for FireCloud, so I think we don't need to set them here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that's better. Left num_samples_per_scatter_block with no default and pushed down default for preemptible_attempts into select_first of each task

#### optional basic arguments ####
##################################
File? gatk4_jar_override
Int preemptible_attempts = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

File input_array_file = write_lines(input_array)

command <<<
python <<CODE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a comment here describing what the python code is doing would be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, done

@@ -1,4 +1,5 @@
# Workflow for running GATK GermlineCNVCaller on a single case sample. Supports both WGS and WES.
# Workflow for running GATK GermlineCNVCaller on a multiple case samples using a trained model (obtained from running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a multiple case samples -> on multiple case samples

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Int num_inputs_in_scatter_block
String gatk_docker

Int machine_mem_mb = 4000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I think it is probably fine to hardcode defaults for this simple task (rather than use select_first and expose these parameters at the workflow level), since it is unlikely that we'll need to adjust these for any given run.

}

output {
Array[Array[String]] splitArray = read_tsv("input_array_split.tsv")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitArray -> split_array

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

##############################################
#### optional arguments for CollectCounts ####
##############################################
String? collect_counts_format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason why you changed format -> collect_counts_format? should we make the same change over in the somatic WDLs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was confused by it's occurrence somewhere out of the context of CollectCounts task, so I decided to make it more specific. I can make the same change in the somatic WDLs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in the somatic workflows as well

-Added multi-sample functionality to gCNV case mode WDL, and added a wrapper for gCNV case mode WDL to help optimize cloud computation cost. Also optimized how data is sent to postprocessing task in gCNV WDLs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants