Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More funcotator fixes #4783

Merged
merged 8 commits into from
May 24, 2018
Merged

More funcotator fixes #4783

merged 8 commits into from
May 24, 2018

Conversation

LeeTL1220
Copy link
Contributor

These fixes are addressing issues for annotating germline VCFs.

@codecov-io
Copy link

codecov-io commented May 18, 2018

Codecov Report

Merging #4783 into master will increase coverage by 0.275%.
The diff coverage is 68.889%.

@@               Coverage Diff               @@
##              master     #4783       +/-   ##
===============================================
+ Coverage     80.111%   80.385%   +0.275%     
- Complexity     17442     17860      +418     
===============================================
  Files           1083      1087        +4     
  Lines          63165     64360     +1195     
  Branches       10193     10540      +347     
===============================================
+ Hits           50602     51736     +1134     
- Misses          8574      8581        +7     
- Partials        3989      4043       +54
Impacted Files Coverage Δ Complexity Δ
...ools/funcotator/FuncotatorArgumentDefinitions.java 87.234% <ø> (+2.128%) 1 <0> (ø) ⬇️
...org/broadinstitute/hellbender/engine/GATKTool.java 91.509% <ø> (ø) 94 <0> (ø) ⬇️
...itute/hellbender/tools/funcotator/Funcotation.java 33.333% <ø> (ø) 3 <0> (ø) ⬇️
.../tools/funcotator/vcfOutput/VcfOutputRenderer.java 81.818% <100%> (+0.686%) 9 <3> (+1) ⬆️
...tools/funcotator/DataSourceFuncotationFactory.java 90% <100%> (ø) 17 <0> (ø) ⬇️
.../tools/funcotator/dataSources/DataSourceUtils.java 63.095% <100%> (+6.305%) 40 <3> (+5) ⬆️
...llbender/utils/codecs/gencode/GencodeGtfCodec.java 55.833% <8.333%> (+0.505%) 49 <2> (ø) ⬇️
...titute/hellbender/tools/funcotator/Funcotator.java 84.722% <84.211%> (+8.371%) 32 <7> (+2) ⬆️
...utils/smithwaterman/SmithWatermanIntelAligner.java 50% <0%> (-30%) 1% <0%> (-2%)
.../utils/read/markduplicates/LibraryIdGenerator.java 91.228% <0%> (-3.216%) 27% <0%> (+13%)
... and 32 more

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

Most importantly, I don't think the new VCF file you added (0816201804HC0_R01C01.pik3ca.vcf) is publicly releasable. Louis said that it wasn't and that we'd need to sanitize it first.

gencodeFuncotationFactories.add( (GencodeFuncotationFactory) factory );
}
}
dataSourceFactories.sort(DataSourceUtils::datasourceComparator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add to this comment that gencode factories will always be first as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -471,8 +470,13 @@ private void enqueueAndHandleVariant(final VariantContext variant, final Referen
for ( final FeatureInput<? extends Feature> featureInput : manualLocatableFeatureInputs ) {
@SuppressWarnings("unchecked")
final List<Feature> featureList = (List<Feature>)featureContext.getValues(featureInput, hg19Interval);

// TODO: This is a little sloppy, since it checks every datasource twice. Once for hg19 contig names and once for b37 contig names.
featureList.addAll(featureContext.getValues(featureInput));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not detect the reference and only add in the features with the correct names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only going to file an issue for this.

#4798

Done.

// Make sure we don't double up on the Gencodes:
if ( funcotationFactory.getType().equals(FuncotatorArgumentDefinitions.DataSourceType.GENCODE) ) {
continue;
// Perform the actual annotation. Note that we leverage the ordering of datasources here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add another comment here explicitly calling out that the comparator guarantees that gencode data sources are all first in the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


private static String retrieveSanitizedFuncotation(final Funcotation funcotation, final String manualAnnotationSerializedString) {
final String initialString = funcotation.serializeToVcfString(manualAnnotationSerializedString);
return StringUtils.replaceEach(initialString, new String[]{",", ";", "=", "\t", "|"}, new String[]{"_%2C_", "_%3B_", "_%3D_", "_%09_", "_%7C_"});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the HTML escape sequences are what we want to put in here? Do we need to preserve all the original values for the escape chars or can we replace them all with some other char and move on (e.g. mapping most restricted chars to _ and another to something else)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonn-smith Previous experience has taught me that the mapping should be invertible. I'm open to suggestions for changing this...

No action (for now).

@@ -63,7 +63,7 @@
static final Logger logger = LogManager.getLogger(GencodeGtfCodec.class);

public static final int GENCODE_GTF_MIN_VERSION_NUM_INCLUSIVE = 19;
public static final int GENCODE_GTF_MAX_VERSION_NUM_INCLUSIVE = 27;
public static final int GENCODE_GTF_MAX_VERSION_NUM_INCLUSIVE = 28;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our conversation earlier today, I think it makes sense to remove the GENCODE_GTF_MAX_VERSION_NUM_INCLUSIVE field altogether and just attempt to use the specified gencode.

We can add in a warning / exception when an error occurs to investigate the Gencode version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private static final boolean doDebugTests = false;
private static final String LARGE_DATASOURCES_FOLDER = "funcotator_dataSources_latest";

private static final String PIK3CA_VCF_HG19 = toolsTestDir + "funcotator/0816201804HC0_R01C01.pik3ca.vcf";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this file is publicly releasable - I believe it needs to be sanitized before inclusion in our tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is publicly releasable...

No action.

variantContexts.add(vc);
}

int NUM_VARIANTS = 21;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

variantContexts.add(vc);
}

int NUM_VARIANTS = 100;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor comments. Is it really necessary to add ~400mb of additional references for this? Can we work with the existing reference files that we have or shrink them somehow?

*
* @return The {@link FeatureInput} used as the key for this data source.
*/
protected FeatureInput<? extends Feature> addFeatureInputsAfterInitialization(final String filePath, final String name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unused, lets delete it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -63,6 +63,7 @@ default void setFieldSerializationOverrideValues(final Map<String,String> fieldS
}

/**
* TODO: This should not be specific to a VCF. That should be the job of the VCFOutputRenderer to sanitize any strings. A Funcotation should not care whether it is being rendered to a VCF or MAF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a todo for the future or meant to be addressed in this pr? If it's for the future could we create a ticket and refer to it in the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just forgot to file the issue. Done...
#4797

)
protected boolean includeFilteredVariants = false;
protected boolean ignoreFilteredVariants = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would removeFilteredVariants be a better name? From this it's unclear if it means don't funcotate them or don't emit them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Done


private static String retrieveSanitizedFuncotation(final Funcotation funcotation, final String manualAnnotationSerializedString) {
final String initialString = funcotation.serializeToVcfString(manualAnnotationSerializedString);
return StringUtils.replaceEach(initialString, new String[]{",", ";", "=", "\t", "|"}, new String[]{"_%2C_", "_%3B_", "_%3D_", "_%09_", "_%7C_"});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't actually support percent encoding in vcf yet, it's a 4.3 feature and we're only up to 4.2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that this comment was for our information. I just wanted a reversible transform. We will probably have to adjust for VCF 4.3. @lbergelson Should I take any more steps?

No action (for now).

- Ordering specified in the header did not match the variants and hg19/b37 flexibility was being applied to VCF datasources and inducing a lot of missed annotations.
- Sanitizing funcotations in VCF.
- Implemented lookahead caching and merged with master.
- Added Funcotator test for clinvar and Gencode v28 in hg38
- Added funcotator test to make sure that mixed chr/no-chr GENCODE and datasources would still annotate properly for hg19.
- Eased restrictions so that Gencode v28 would be recognized as a valid gtf
- Added warning so that future versions of Gencode would not fail just based on the version number and warning would be emitted instead (@jonn-smith)
- Altered the lookahead default value to specific for backend.
@LeeTL1220
Copy link
Contributor Author

@jonn-smith @lbergelson Why wouldn't it be publicly releasable? It's NA12878. I also confirmed with one of the engineers that there is no issue.

@LeeTL1220
Copy link
Contributor Author

@lbergelson @jonn-smith Regarding references: I am open to suggestions. These made life much easier. We would have to do some work to get rid of them. Can we file an issue and address later? Since these are in the large file repo, we do not need to worry about history bloat, correct?

@LeeTL1220
Copy link
Contributor Author

@jonn-smith @lbergelson Back to you guys. I have a couple of questions for y'all.

@LeeTL1220
Copy link
Contributor Author

LeeTL1220 commented May 22, 2018

@lbergelson @jonn-smith As per offline discussion, I will targz the references and untar as needed.

And fix tests.

@LeeTL1220
Copy link
Contributor Author

@jonn-smith @lbergelson Back to you.

Copy link
Collaborator

@jonn-smith jonn-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one quick comment, but I think after that's addressed you should be good to go.

ArchiveEntry entry = null;
while ((entry = i.getNextEntry()) != null) {
if (!i.canReadEntryData(entry)) {
// log something?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - we should log something in this case. The tests could be affected by this, but may not be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@LeeTL1220 LeeTL1220 merged commit c8f53cd into master May 24, 2018
@LeeTL1220 LeeTL1220 deleted the ll_rando_funcotator_fixes branch May 24, 2018 14:33
LeeTL1220 added a commit that referenced this pull request May 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants