-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FuncotateSegments
(minus WDL)
#5941
Conversation
*`FuncotateSegments` does not support VCF input!* This tool will create two output files from a GATK seg file: - A simple TSV which has each segment of the input file funcotated with all the genes it overlaps and which gene/exon covers each breakpoint. The output format is meant to (closely) match Oncotator. - A gene list which has every gene, covered by a segment, listed with the segment that covers it. A gene can appear more than once if a segment breakpoint overlaps the gene (i.e. more than one segment overlaps the gene). The output format is meant to (closely) match Oncotator. - Output formats may change. - Input format is only seg files such as those generated from `ModelSegments`. Dev and reviewer notes: - Includes refactoring to drive much of the GencodeFuncotation data solely from the transcript. As opposed to a mix of the transcript and gene. This does cause some changes to sorting of the GencodeFuncotations (easily seen in the other transcripts field). It turns out that the transcript type field has different values for each transcript. This causes many transcripts to no longer be categorized as protein coding. Therefore, the ground truth (mostly/totally in `FuncotatorIntegrationTest`) had to be modified. *Please carefully review the ground truth changes*. - Introduces the `CompsiteOutputRenderer`, which is composed of multiple output renderers. This is used when output type is `SEG`, so that it can write both output files simultaneously. - Introduces the `GeneListOutputRenderer`. This does not write anything to disk until the entire input file is processed. The actual writing happens during the `close()` command. This is necessary since it cannot actually render its output until all segments have been seen. This output renderer also relies heavily on specific funcotation fields being in the input `FuncotationMap`. Internally, the gene list output renderer uses the `SimpleTsvOutputRenderer` (see below) to do the actual writing. - Introduces the `SimpleTsvOutputRenderer`. This output renderer is very flexible and renders a tab-separated text file based on several output rules. Formats are driven through config files. And developers can limit the output columns to ignore extraneous funcotation fields. Note that excluded fields are honored, regardless. If a configuration + parameter combination would result in this class producing an empty file, an exception is thrown. More notes are in the javadocs of the class. - Currently, only the `GencodeFuncotationFactory` can actually funcotate segments. - Code base currently enforces only small mutations when running `Funcotator` (segs are funcotated as CANNOT_DETERMINE) and only segments when running `FuncotateSegments` (small mutations produce exception). This is enforced with flags in the code. The backend does not disallow a mixture for future use. This may prove important when funcotating CNVs from VCFs produced by tools other than `ModelSegments`. - Added copy creation method for FuncotationMap based on Kryo. Also, added the necessary Kryo registrations. This induced a new unit test to enforce any concrete implementations of `Funcotation` to be Kryo serializable. The unit test does a recursive search of the funcotator package. For all concrete implementations, it tracks whether this unit test tests the serialization. If not, it fails. Instructions for developers is present as comments in the code. This is a bit fragile, especially for developers that are using GATK as a library - See #5921 for tracking `FuncotateSegments` WDL development. Closes #4609 Output formats may change.
Codecov Report
@@ Coverage Diff @@
## master #5941 +/- ##
===============================================
+ Coverage 86.823% 86.928% +0.105%
- Complexity 32344 32714 +370
===============================================
Files 1993 2013 +20
Lines 149460 151268 +1808
Branches 16502 16604 +102
===============================================
+ Hits 129766 131495 +1729
- Misses 13679 13718 +39
- Partials 6015 6055 +40
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few requested changes and questions. Might be a few design discussions in there, too.
I performed a spot check of the validation data. It seems to be different just in the order of other transcripts, except in the HG38 data set which now has actual information in it... It may be an artifact of when you created this branch.
One additional thought - any reason we don't create a SegmentVariantContext class that inherits from VariantContext for doing all this stuff? That way we can make it a little more specific.
src/main/java/org/broadinstitute/hellbender/engine/FeatureWalker.java
Outdated
Show resolved
Hide resolved
@@ -25,6 +40,41 @@ | |||
|
|||
public GATKRegistrator() {} | |||
|
|||
/** | |||
* Make sure that all FuncotationMap (which incl. all Funcotation concrete classes and members) classes are registered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect to do this you'd need to use reflection to get at all concrete Funcotation
classes available at runtime.
What is preventing someone from creating a new Funcotation
that doesn't support kryo (and not updating this method)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also creates registration for dependencies such as Allele
, etc. A new funcotation class (in theory) could need to add more to this list.
I could scan for all of the concrete classes, but then how would I create the function for a new instance?
The only thing preventing a developer from doing that is a failed test and/or a runtime error.
No action.
@@ -39,7 +40,6 @@ | |||
private static final Logger logger = LogManager.getLogger(AnnotatedIntervalCollection.class); | |||
|
|||
// Rename to annotated interval default.config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this comment now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import java.util.Arrays; | ||
import java.util.List; | ||
|
||
import static htsjdk.variant.variantcontext.Allele.UNSPECIFIED_ALTERNATE_ALLELE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this static import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
private boolean isWriterInitialized; | ||
|
||
private LinkedHashMap<String, String> columnNameToFuncotationFieldMap; | ||
private Set<String> excludedOutputFields; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
private LinkedHashMap<String, String> columnNameToFuncotationFieldMap; | ||
private Set<String> excludedOutputFields; | ||
private Path outputFilePath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* | ||
* When true, all funcotation fields (not used in an alias nor column name) will appear in the output file. | ||
*/ | ||
private boolean isWriteFuncotationFieldsNotInConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// Ensure that all transcript-allele combinations have the same fields inside the matching funcotations. | ||
if (!txToFuncotationMap.doAllTxAlleleCombinationsHaveTheSameFields()) { | ||
throw new GATKException.ShouldNeverReachHereException("The funcotation map cannot be written by this simple output renderer. This is almost certainly an issue for the GATK development team."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the error message include the fact that the fields differ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
@VisibleForTesting | ||
static String[] splitAndTrim( String text, String separator ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input args can be final
@droazen @jonn-smith I am rerunning failed travis checks, since the error was something about picard piping, which is quite unrelated to this branch, so I am betting a transient error. |
@jonn-smith Back to you... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine now. I don't know why our codecov is failing, but I think you can go ahead and merge since the other tests pass.
@@ -96,6 +96,10 @@ public void traverse() { | |||
}); | |||
} | |||
|
|||
protected <T extends Feature> SimpleInterval makeFeatureInterval(final T feature) { | |||
return new SimpleInterval(feature); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of having this trivial method in the walker base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add javadoc explaining when someone might want to override this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I am still open to reverting it.
@@ -255,6 +255,8 @@ dependencies { | |||
compile 'org.apache.commons:commons-math3:3.5' | |||
compile 'org.apache.commons:commons-collections4:4.1' | |||
compile 'org.apache.commons:commons-vfs2:2.0' | |||
compile 'org.apache.commons:commons-configuration2:2.4' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you migrate to Owner instead of adding a new config library? I've suggested an approach that would allow you to use Owner in this ticket: #5963
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below.
No action.
@@ -255,6 +255,8 @@ dependencies { | |||
compile 'org.apache.commons:commons-math3:3.5' | |||
compile 'org.apache.commons:commons-collections4:4.1' | |||
compile 'org.apache.commons:commons-vfs2:2.0' | |||
compile 'org.apache.commons:commons-configuration2:2.4' | |||
compile 'commons-beanutils:commons-beanutils:1.9.3' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you migrate to Owner instead of adding a new config library? I've suggested an approach that would allow you to use Owner in this ticket: #5963
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After offline discussions, we are not going to remove this dependency. It can be done in another PR.
No action.
@@ -325,6 +327,9 @@ dependencies { | |||
// Required for SV Discovery machine learning | |||
compile group: 'biz.k11i', name: 'xgboost-predictor', version: '0.3.0' | |||
|
|||
// natural sort | |||
compile('net.grey-panther:natural-comparator:1.1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's spend another 15 minutes or so trying to find a way to avoid adding this dependency...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why can't we just use sequence dictionary order for the contigs? That seems safer, and we should have a sequence dictionary from the reference in this tool...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for gene names, not contigs. We have plenty of GATK code for sorting contigs.
No action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LeeTL1220 I've added a few comments of my own, mostly related to the new dependencies. I've proposed a possible way to use Owner for configuration in a comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, provided we make an effort in the near future to address #5963 before we are locked in to the current format forever.
Adds the
FuncotateSegments
tool.FuncotateSegments
does not support VCF input!This tool will create two output files from a GATK seg file:
ModelSegments
.Dev and reviewer notes:
FuncotatorIntegrationTest
) had to be modified. Please carefully review the ground truth changes.CompsiteOutputRenderer
, which is composed of multiple output renderers. This is used when output type isSEG
, so that it can write both output files simultaneously.GeneListOutputRenderer
. This does not write anything to disk until the entire input file is processed. The actual writing happens during theclose()
command. This is necessary since it cannot actually render its output until all segments have been seen. This output renderer also relies heavily on specific funcotation fields being in the inputFuncotationMap
. Internally, the gene list output renderer uses theSimpleTsvOutputRenderer
(see below) to do the actual writing.SimpleTsvOutputRenderer
. This output renderer is very flexible and renders a tab-separated text file based on several output rules. Formats are driven through config files. And developers can limit the output columns to ignore extraneous funcotation fields. Note that excluded fields are honored, regardless. If a configuration + parameter combination would result in this class producing an empty file, an exception is thrown. More notes are in the javadocs of the class.GencodeFuncotationFactory
can actually funcotate segments.Funcotator
(segs are funcotated as CANNOT_DETERMINE) and only segments when runningFuncotateSegments
(small mutations produce exception). This is enforced with flags in the code. The backend does not disallow a mixture for future use. This may prove important when funcotating CNVs from VCFs produced by tools other thanModelSegments
.Funcotation
to be Kryo serializable. The unit test does a recursive search of the funcotator package. For all concrete implementations, it tracks whether this unit test tests the serialization. If not, it fails. Instructions for developers is present as comments in the code. This is a bit fragile, especially for developers that are using GATK as a libraryFuncotateSegments
WDL development.Closes #4609
Output formats may change.