Refining handling of transcripts with missing sequence info and other fixes. #4817

jonn-smith · 2018-05-25T20:01:01Z

Fixes #4739
Refactored UTR VariantClassification handling.
Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
Added tests to prevent regression on data source date comparison bug.
Now can run on large data.
Fixed DNA Repair Genes getter script.
Fixed an issue in COSMIC to make it robust to bad COSMIC data.
Gencode no longer crashes when given an indel that starts just before an exon.
Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes).
Modified several methods to allow for negative start positions in
preparation for allowing indels that start outside exons.

Fixes #4739 Refactored UTR VariantClassification handling. Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions). Added tests to prevent regression on data source date comparison bug. Now can run on large data. Fixed DNA Repair Genes getter script. Fixed an issue in COSMIC to make it robust to bad COSMIC data. Gencode no longer crashes when given an indel that starts just before an exon. Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes). Modified several methods to allow for negative start positions in preparation for allowing indels that start outside exons.

codecov-io · 2018-05-25T21:52:44Z

Codecov Report

Merging #4817 into master will decrease coverage by 0.116%.
The diff coverage is 57.971%.

@@              Coverage Diff               @@
##             master     #4817       +/-   ##
==============================================
- Coverage     80.36%   80.244%   -0.116%     
- Complexity    17606     17611        +5     
==============================================
  Files          1087      1088        +1     
  Lines         63748     63849      +101     
  Branches      10262     10276       +14     
==============================================
+ Hits          51228     51235        +7     
- Misses         8528      8619       +91     
- Partials       3992      3995        +3

Impacted Files	Coverage Δ	Complexity Δ
...ataSources/xsv/SimpleKeyXsvFuncotationFactory.java	`87.097% <100%> (ø)`	`27 <0> (ø)`	⬇️
...r/dataSources/cosmic/CosmicFuncotationFactory.java	`64.167% <20.833%> (-12.071%)`	`21 <0> (ø)`
...dataSources/gencode/GencodeFuncotationFactory.java	`82.504% <74.286%> (-0.478%)`	`152 <3> (ø)`
...e/hellbender/tools/funcotator/FuncotatorUtils.java	`80.029% <88.889%> (-0.058%)`	`157 <2> (+3)`
...spark/sv/utils/SingleSequenceReferenceAligner.java	`0% <0%> (ø)`	`0% <0%> (?)`
...utils/smithwaterman/SmithWatermanIntelAligner.java	`80% <0%> (+30%)`	`3% <0%> (+2%)`	⬆️

LeeTL1220

@jonn-smith Minor stuff and a couple of questions

LeeTL1220 · 2018-05-29T13:58:59Z

scripts/funcotator/data_sources/getDnaRepairGenes.py

@@ -53,23 +53,35 @@
        writer = csv.writer(f, delimiter='|', lineterminator="\n")

        isFirstRow = True


Thanks for additional comments.

LeeTL1220 · 2018-05-29T13:59:41Z

src/main/java/org/broadinstitute/hellbender/tools/funcotator/FuncotatorUtils.java

@@ -327,15 +327,20 @@ public static int getAlignedEndPosition(final int alleleEndPosition) {
    /**
     * Gets the sequence aligned position (1-based, inclusive) for the given coding sequence position.
     * This will produce the next lowest position evenly divisible by 3, such that a codon starting at this returned
-     * position would include the given position.
-     * @param position A sequence starting coordinate for which to produce an coding-aligned position.  Must be > 0.
+     * position would include the given position.  This can be a negative number, in which case the codon would start


Can you add a comment regarding UTRs and Flanks?

Since this is a general utility method I don't think it makes sense to talk much about specific use cases here. I'll add an example for upstream UTRs / Flanks in the negative position section.

LeeTL1220 · 2018-05-29T14:00:49Z

src/main/java/org/broadinstitute/hellbender/tools/funcotator/FuncotatorUtils.java

@@ -346,10 +351,21 @@ public static int getAlignedPosition(final int position) {
     */
    public static boolean isInFrameWithEndOfRegion(final int startPosition, final int regionLength) {



The javadoc is no longer correct for the start position.

LeeTL1220 · 2018-05-29T14:01:37Z

.../broadinstitute/hellbender/tools/funcotator/dataSources/cosmic/CosmicFuncotationFactory.java

+                }
+                catch (final IllegalArgumentException ex) {
+                    // If we have poorly bounded genomic positions, we need to warn the user and move on.
+                    // These may occur occasionally in the data.


Do you see this warning come up much on real data?

Yes. I added this in because of an issue I found in the COSMIC data set. I logged this as #4812

LeeTL1220 · 2018-05-29T14:02:34Z

.../broadinstitute/hellbender/tools/funcotator/dataSources/cosmic/CosmicFuncotationFactory.java

            }
        }
        return null;
    }

+    /**
+     * Print the given {@link ResultSet} to stdout.
+     * @param resultSet The {@link ResultSet} to print.


Why do we need this method? Typically, we should log it (as a debug if necessary)

This is a debug method that I created to help track down issue #4812.

I'd prefer to leave it in since I've already created it. It may be useful later to debug other issues.

LeeTL1220 · 2018-05-29T14:03:12Z

...roadinstitute/hellbender/tools/funcotator/dataSources/gencode/GencodeFuncotationFactory.java

@@ -597,16 +597,27 @@ private GencodeFuncotation createGencodeFuncotationOnTranscript(final VariantCon
        // Find the sub-feature of transcript that contains our variant:
        final GencodeGtfFeature containingSubfeature = getContainingGtfSubfeature(variant, transcript);

-        // Make sure the start and end of the variant are both in the transcript:
+        // Make sure the sub-regions in the transcript actually contain the variant:
+        // TODO: this is slow, and repeats work that is done later in the process (we call getSortedCdsAndStartStopPositions when creating the sequence comparison)


Can you file an issue for this TODO?

LeeTL1220 · 2018-05-29T14:04:10Z

...roadinstitute/hellbender/tools/funcotator/dataSources/gencode/GencodeFuncotationFactory.java

+            if ( startPosInTranscript == -1 ) {
+                // we overlap an exon but we don't start in one.  Right now this case cannot be handled.
+                // Bubble up an exception and let the caller handle this case.
+                // TODO: fix this case, issue #4804 (https://github.com/broadinstitute/gatk/issues/4804)


Glad that #4804 is a beta blocker. No further action needed for this PR.

LeeTL1220 · 2018-05-29T14:05:40Z

...roadinstitute/hellbender/tools/funcotator/dataSources/gencode/GencodeFuncotationFactory.java

            else {
-                gencodeFuncotationBuilder.setVariantClassification(GencodeFuncotation.VariantClassification.FIVE_PRIME_UTR);
+                logger.warn("Attempted to process transcript information for transcript WITHOUT sequence data.  Ignoring sequence information for Gencode Transcript ID: " + transcript.getTranscriptId());


When would this happen? Not sure if any action is needed.

This would happen when the Gencode data source does not have a sequence for a given transcript. The information in the resulting funcotation will be correct, but it will be a subset of what a user may expect. This does happen in practice, so I wanted to make sure it was logged as a warning.

LeeTL1220 · 2018-05-29T14:06:09Z

...roadinstitute/hellbender/tools/funcotator/dataSources/gencode/GencodeFuncotationFactory.java

@@ -1441,8 +1456,9 @@ static SequenceComparison createSequenceComparison(final VariantContext variant,
        if ( processSequenceInformation ) {
            if ( transcriptIdMap.containsKey(transcript.getTranscriptId()) ) {

+                final String transcriptSequence;


Why can't these two lines be merged back into one?

jonn-smith added Funcotator FuncotatorBetaBlocker labels May 25, 2018

jonn-smith assigned LeeTL1220 May 25, 2018

jonn-smith requested a review from LeeTL1220 May 25, 2018 20:01

LeeTL1220 reviewed May 29, 2018

View reviewed changes

Addressing code review comments.

6898cd8

jonn-smith merged commit c77101a into master May 29, 2018

jonn-smith deleted the jts_missing_transcript_fix_4739 branch May 29, 2018 19:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refining handling of transcripts with missing sequence info and other fixes. #4817

Refining handling of transcripts with missing sequence info and other fixes. #4817

jonn-smith commented May 25, 2018

codecov-io commented May 25, 2018 •

edited

Loading

LeeTL1220 left a comment

LeeTL1220 May 29, 2018

jonn-smith May 29, 2018

LeeTL1220 May 29, 2018

jonn-smith May 29, 2018

LeeTL1220 May 29, 2018

jonn-smith May 29, 2018

LeeTL1220 May 29, 2018

jonn-smith May 29, 2018

LeeTL1220 May 29, 2018

jonn-smith May 29, 2018

LeeTL1220 May 29, 2018

LeeTL1220 May 29, 2018

LeeTL1220 May 29, 2018

jonn-smith May 29, 2018

LeeTL1220 May 29, 2018

jonn-smith May 29, 2018

		@@ -53,23 +53,35 @@
		writer = csv.writer(f, delimiter='\|', lineterminator="\n")

		isFirstRow = True

		@@ -346,10 +351,21 @@ public static int getAlignedPosition(final int position) {
		*/
		public static boolean isInFrameWithEndOfRegion(final int startPosition, final int regionLength) {

Refining handling of transcripts with missing sequence info and other fixes. #4817

Refining handling of transcripts with missing sequence info and other fixes. #4817

Conversation

jonn-smith commented May 25, 2018

codecov-io commented May 25, 2018 • edited Loading

Codecov Report

LeeTL1220 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented May 25, 2018 •

edited

Loading