Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FuncotateSegments (minus WDL) #5941

Merged
merged 8 commits into from
May 24, 2019
Merged

FuncotateSegments (minus WDL) #5941

merged 8 commits into from
May 24, 2019

Conversation

LeeTL1220
Copy link
Contributor

Adds the FuncotateSegments tool.

FuncotateSegments does not support VCF input!

This tool will create two output files from a GATK seg file:

  • A simple TSV which has each segment of the input file funcotated with all the genes it overlaps and which gene/exon covers each breakpoint. The output format is meant to (closely) match Oncotator.
  • A gene list which has every gene, covered by a segment, listed with the segment that covers it. A gene can appear more than once if a segment breakpoint overlaps the gene (i.e. more than one segment overlaps the gene). The output format is meant to (closely) match Oncotator.
  • Output formats may change.
  • Input format is only seg files such as those generated from ModelSegments.

Dev and reviewer notes:

  • Includes refactoring to drive much of the GencodeFuncotation data solely from the transcript. As opposed to a mix of the transcript and gene. This does cause some changes to sorting of the GencodeFuncotations (easily seen in the other transcripts field). It turns out that the transcript type field has different values for each transcript. This causes many transcripts to no longer be categorized as protein coding. Therefore, the ground truth (mostly/totally in FuncotatorIntegrationTest) had to be modified. Please carefully review the ground truth changes.
  • Introduces the CompsiteOutputRenderer, which is composed of multiple output renderers. This is used when output type is SEG, so that it can write both output files simultaneously.
  • Introduces the GeneListOutputRenderer. This does not write anything to disk until the entire input file is processed. The actual writing happens during the close() command. This is necessary since it cannot actually render its output until all segments have been seen. This output renderer also relies heavily on specific funcotation fields being in the input FuncotationMap. Internally, the gene list output renderer uses the SimpleTsvOutputRenderer (see below) to do the actual writing.
  • Introduces the SimpleTsvOutputRenderer. This output renderer is very flexible and renders a tab-separated text file based on several output rules. Formats are driven through config files. And developers can limit the output columns to ignore extraneous funcotation fields. Note that excluded fields are honored, regardless. If a configuration + parameter combination would result in this class producing an empty file, an exception is thrown. More notes are in the javadocs of the class.
  • Currently, only the GencodeFuncotationFactory can actually funcotate segments.
  • Code base currently enforces only small mutations when running Funcotator (segs are funcotated as CANNOT_DETERMINE) and only segments when running FuncotateSegments (small mutations produce exception). This is enforced with flags in the code. The backend does not disallow a mixture for future use. This may prove important when funcotating CNVs from VCFs produced by tools other than ModelSegments.
  • Added copy creation method for FuncotationMap based on Kryo. Also, added the necessary Kryo registrations. This induced a new unit test to enforce any concrete implementations of Funcotation to be Kryo serializable. The unit test does a recursive search of the funcotator package. For all concrete implementations, it tracks whether this unit test tests the serialization. If not, it fails. Instructions for developers is present as comments in the code. This is a bit fragile, especially for developers that are using GATK as a library
  • See Add WDL support for FuncotateSegments #5921 for tracking FuncotateSegments WDL development.

Closes #4609

Output formats may change.

*`FuncotateSegments` does not support VCF input!*

This tool will create two output files from a GATK seg file:
- A simple TSV which has each segment of the input file funcotated with all the genes it overlaps and which gene/exon covers each breakpoint.  The output format is meant to (closely) match Oncotator.
- A gene list which has every gene, covered by a segment, listed with the segment that covers it.  A gene can appear more than once if a segment breakpoint overlaps the gene (i.e. more than one segment overlaps the gene).  The output format is meant to (closely) match Oncotator.
- Output formats may change.
- Input format is only seg files such as those generated from `ModelSegments`.

Dev and reviewer notes:
- Includes refactoring to drive much of the GencodeFuncotation data solely from the transcript.  As opposed to a mix of the transcript and gene.  This does cause some changes to sorting of the GencodeFuncotations (easily seen in the other transcripts field).  It turns out that the transcript type field has different values for each transcript.  This causes many transcripts to no longer be categorized as protein coding.  Therefore, the ground truth (mostly/totally in `FuncotatorIntegrationTest`) had to be modified.  *Please carefully review the ground truth changes*.
- Introduces the `CompsiteOutputRenderer`, which is composed of multiple output renderers.  This is used when output type is `SEG`, so that it can write both output files simultaneously.
- Introduces the `GeneListOutputRenderer`.  This does not write anything to disk until the entire input file is processed.  The actual writing happens during the `close()` command.  This is necessary since it cannot actually render its output until all segments have been seen.  This output renderer also relies heavily on specific funcotation fields being in the input `FuncotationMap`.  Internally, the gene list output renderer uses the `SimpleTsvOutputRenderer` (see below) to do the actual writing.
- Introduces the `SimpleTsvOutputRenderer`.  This output renderer is very flexible and renders a tab-separated text file based on several output rules.  Formats are driven through config files.  And developers can limit the output columns to ignore extraneous funcotation fields.  Note that excluded fields are honored, regardless.  If a configuration + parameter combination would result in this class producing an empty file, an exception is thrown.  More notes are in the javadocs of the class.
- Currently, only the `GencodeFuncotationFactory` can actually funcotate segments.
- Code base currently enforces only small mutations when running `Funcotator` (segs are funcotated as CANNOT_DETERMINE) and only segments when running `FuncotateSegments` (small mutations produce exception).  This is enforced with flags in the code.  The backend does not disallow a mixture for future use.  This may prove important when funcotating CNVs from VCFs produced by tools other than `ModelSegments`.
- Added copy creation method for FuncotationMap based on Kryo.  Also, added the necessary Kryo registrations.  This induced a new unit test to enforce any concrete implementations of `Funcotation` to be Kryo serializable.  The unit test does a recursive search of the funcotator package.  For all concrete implementations, it tracks whether this unit test tests the serialization.  If not, it fails.  Instructions for developers is present as comments in the code.  This is a bit fragile, especially for developers that are using GATK as a library
- See #5921 for tracking `FuncotateSegments` WDL development.

Closes #4609

Output formats may change.
@codecov
Copy link

codecov bot commented May 15, 2019

Codecov Report

Merging #5941 into master will increase coverage by 0.105%.
The diff coverage is 95.272%.

@@               Coverage Diff               @@
##              master     #5941       +/-   ##
===============================================
+ Coverage     86.823%   86.928%   +0.105%     
- Complexity     32344     32714      +370     
===============================================
  Files           1993      2013       +20     
  Lines         149460    151268     +1808     
  Branches       16502     16604      +102     
===============================================
+ Hits          129766    131495     +1729     
- Misses         13679     13718       +39     
- Partials        6015      6055       +40
Impacted Files Coverage Δ Complexity Δ
...decs/xsvLocatableTable/XsvLocatableTableCodec.java 81.675% <ø> (ø) 66 <0> (ø) ⬇️
...notatedinterval/SimpleAnnotatedIntervalWriter.java 77.778% <ø> (ø) 7 <0> (ø) ⬇️
...ncotator/mafOutput/MafOutputRendererConstants.java 99.029% <ø> (ø) 1 <0> (ø) ⬇️
...itute/hellbender/tools/funcotator/Funcotation.java 50% <ø> (ø) 3 <0> (ø) ⬇️
...llbender/tools/funcotator/FuncotatorConstants.java 75% <ø> (ø) 1 <0> (ø) ⬇️
...nterval/SimpleAnnotatedIntervalWriterUnitTest.java 98.148% <ø> (ø) 17 <0> (ø) ⬇️
...lbender/utils/variant/GATKVariantContextUtils.java 87.462% <ø> (+0.153%) 245 <0> (+1) ⬆️
...s/funcotator/BaseFuncotatorArgumentCollection.java 100% <100%> (ø) 1 <1> (?)
...es/xsv/LocatableXsvFuncotationFactoryUnitTest.java 96.815% <100%> (+0.083%) 16 <1> (+1) ⬆️
...uncotator/FuncotatorSegmentArgumentCollection.java 100% <100%> (ø) 1 <1> (?)
... and 79 more

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few requested changes and questions. Might be a few design discussions in there, too.

I performed a spot check of the validation data. It seems to be different just in the order of other transcripts, except in the HG38 data set which now has actual information in it... It may be an artifact of when you created this branch.

One additional thought - any reason we don't create a SegmentVariantContext class that inherits from VariantContext for doing all this stuff? That way we can make it a little more specific.

build.gradle Show resolved Hide resolved
@@ -25,6 +40,41 @@

public GATKRegistrator() {}

/**
* Make sure that all FuncotationMap (which incl. all Funcotation concrete classes and members) classes are registered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect to do this you'd need to use reflection to get at all concrete Funcotation classes available at runtime.

What is preventing someone from creating a new Funcotation that doesn't support kryo (and not updating this method)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also creates registration for dependencies such as Allele, etc. A new funcotation class (in theory) could need to add more to this list.

I could scan for all of the concrete classes, but then how would I create the function for a new instance?

The only thing preventing a developer from doing that is a failed test and/or a runtime error.

No action.

@@ -39,7 +40,6 @@
private static final Logger logger = LogManager.getLogger(AnnotatedIntervalCollection.class);

// Rename to annotated interval default.config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this comment now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import java.util.Arrays;
import java.util.List;

import static htsjdk.variant.variantcontext.Allele.UNSPECIFIED_ALTERNATE_ALLELE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this static import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private boolean isWriterInitialized;

private LinkedHashMap<String, String> columnNameToFuncotationFieldMap;
private Set<String> excludedOutputFields;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


private LinkedHashMap<String, String> columnNameToFuncotationFieldMap;
private Set<String> excludedOutputFields;
private Path outputFilePath;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
* When true, all funcotation fields (not used in an alias nor column name) will appear in the output file.
*/
private boolean isWriteFuncotationFieldsNotInConfig;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Ensure that all transcript-allele combinations have the same fields inside the matching funcotations.
if (!txToFuncotationMap.doAllTxAlleleCombinationsHaveTheSameFields()) {
throw new GATKException.ShouldNeverReachHereException("The funcotation map cannot be written by this simple output renderer. This is almost certainly an issue for the GATK development team.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the error message include the fact that the fields differ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

@VisibleForTesting
static String[] splitAndTrim( String text, String separator ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input args can be final

@LeeTL1220
Copy link
Contributor Author

@droazen @jonn-smith I am rerunning failed travis checks, since the error was something about picard piping, which is quite unrelated to this branch, so I am betting a transient error.

@LeeTL1220
Copy link
Contributor Author

@jonn-smith Back to you...

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine now. I don't know why our codecov is failing, but I think you can go ahead and merge since the other tests pass.

@@ -96,6 +96,10 @@ public void traverse() {
});
}

protected <T extends Feature> SimpleInterval makeFeatureInterval(final T feature) {
return new SimpleInterval(feature);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of having this trivial method in the walker base class?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadoc explaining when someone might want to override this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I am still open to reverting it.

@@ -255,6 +255,8 @@ dependencies {
compile 'org.apache.commons:commons-math3:3.5'
compile 'org.apache.commons:commons-collections4:4.1'
compile 'org.apache.commons:commons-vfs2:2.0'
compile 'org.apache.commons:commons-configuration2:2.4'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you migrate to Owner instead of adding a new config library? I've suggested an approach that would allow you to use Owner in this ticket: #5963

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below.

No action.

@@ -255,6 +255,8 @@ dependencies {
compile 'org.apache.commons:commons-math3:3.5'
compile 'org.apache.commons:commons-collections4:4.1'
compile 'org.apache.commons:commons-vfs2:2.0'
compile 'org.apache.commons:commons-configuration2:2.4'
compile 'commons-beanutils:commons-beanutils:1.9.3'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you migrate to Owner instead of adding a new config library? I've suggested an approach that would allow you to use Owner in this ticket: #5963

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After offline discussions, we are not going to remove this dependency. It can be done in another PR.

No action.

@@ -325,6 +327,9 @@ dependencies {
// Required for SV Discovery machine learning
compile group: 'biz.k11i', name: 'xgboost-predictor', version: '0.3.0'

// natural sort
compile('net.grey-panther:natural-comparator:1.1')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's spend another 15 minutes or so trying to find a way to avoid adding this dependency...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why can't we just use sequence dictionary order for the contigs? That seems safer, and we should have a sequence dictionary from the reference in this tool...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for gene names, not contigs. We have plenty of GATK code for sorting contigs.

No action.

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LeeTL1220 I've added a few comments of my own, mostly related to the new dependencies. I've proposed a possible way to use Owner for configuration in a comment above.

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, provided we make an effort in the near future to address #5963 before we are locked in to the current format forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Funcotator should support segment annotations.
3 participants