SVAnnotate: Functional annotations for SVs called by GATK-SV #7431

epiercehoffman · 2021-08-20T19:06:26Z

Add SVAnnotate for functional annotation of SV VCFs from GATK-SV (after preprocessing to correct END keys, etc.). Posting as a draft for pre-review of work-in-progress code.

mwalker174

This is a great start! I have mostly stylistic comments at this point.

It does seem like there are parallels among the annotateX methods resulting in some repeated code. You could think about loading some of that logic into an abstract annotator class. For each SV type you could then have a subclass that implements more specific logic. That is, implement a common public String annotate() function that looks a lot like your annotateDUP() and annotateINV() methods but calls abstract functions getConsequenceSpanning(), getConsequenceSingleBreakend(), getConsequenceDoubleBreakendInExon(), getConsequenceDoubleBreakendInUtr(), etc. for each case. Not necessary, but something to think about as this becomes more complex.

mwalker174 · 2021-08-25T15:38:50Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+
+    @Argument(
+            fullName="proteinCodingGTF",
+            shortName="P",


We typically don't use short names for non-standard arguments, just so the commands are easier to interpret at a glance.

mwalker174 · 2021-08-25T15:39:45Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+    private VariantContextWriter vcfWriter = null;
+
+    private OverlapDetector<GencodeGtfGeneFeature> gtfOverlapDetector;
+
+    private final Set<String> MSVExonOverlapClassifications = new HashSet<>(Arrays.asList(GATKSVVCFConstants.LOF, GATKSVVCFConstants.INT_EXON_DUP, GATKSVVCFConstants.DUP_PARTIAL, GATKSVVCFConstants.PARTIAL_EXON_DUP, GATKSVVCFConstants.COPY_GAIN));


Can delete the blank lines between these

Can use Sets.newHashSet()

mwalker174 · 2021-08-25T15:40:12Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+
+    @Override
+    public void onTraversalStart() {
+        FeatureDataSource<GencodeGtfGeneFeature> proteinCodingGTFSource = new FeatureDataSource<>(proteinCodingGTFFile);


final … throughout

mwalker174 · 2021-08-25T15:45:10Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        List<GencodeGtfGeneFeature> gtfFeaturesList = new ArrayList<>();
+        proteinCodingGTFSource.forEach(gtfFeaturesList::add);  // TODO: faster method?


Since FeatureDataSource implements Iterable, you can do this in one line with Google's common Lists class like so:

List<GencodeGtfGeneFeature> gtfFeaturesList = Lists.newArrayList(proteinCodingGTFSource);

mwalker174 · 2021-08-25T15:46:49Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        FeatureDataSource<GencodeGtfGeneFeature> proteinCodingGTFSource = new FeatureDataSource<>(proteinCodingGTFFile);
+        List<GencodeGtfGeneFeature> gtfFeaturesList = new ArrayList<>();
+        proteinCodingGTFSource.forEach(gtfFeaturesList::add);  // TODO: faster method?
+        gtfOverlapDetector = OverlapDetector.create(gtfFeaturesList);


I would not declare the gtfFeaturesList since it's only used once, and just put the newArrayList() function in this line, for the sake of code economy.

mwalker174 · 2021-08-25T16:22:25Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        if (!variantConsequenceDict.containsKey(key)) {
+            variantConsequenceDict.put(key, new HashSet<>());
+        }


Can replace with variantConsequenceDict.putIfAbsent()

mwalker174 · 2021-08-25T16:26:30Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+    private String annotateINS(SimpleInterval variantInterval, GencodeGtfTranscriptFeature gtfTranscript) {
+        return annotateDEL(variantInterval, gtfTranscript);
+    }


I think it would be clearer to have a single function annotateDeletionOrInsertion

mwalker174 · 2021-08-25T16:27:21Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        variantConsequenceDict.get(key).add(value);
+    }
+
+    private String annotateDEL(SimpleInterval variantInterval, GencodeGtfTranscriptFeature gtfTranscript) {


I would spell out the type i.e. annotateDeletion

mwalker174 · 2021-08-25T16:31:23Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        if (!variantInterval.getContig().equals(featureInterval.getContig())) {
+            return false;
+        } else {
+            return variantInterval.getStart() <= featureInterval.getStart() && variantInterval.getEnd() >= featureInterval.getEnd();
+        }


Can just use Locatable's contains() method

mwalker174 · 2021-08-25T16:50:38Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        }
+    }
+
+    protected final static int countBreakpointsInsideFeature(SimpleInterval variantInterval, SimpleInterval featureInterval) {


I think you'll want countBreakendsInsideFeature - see https://bioconductor.org/packages/devel/bioc/vignettes/StructuralVariantAnnotation/inst/doc/vignettes.html#Representing_structural_variants_in_VCF

tedsharpe

Just a couple of trivial comments from me, but this looks fine. You can merge from my point of view. A couple of more general suggestions, though, which you can feel free to ignore:

Have you carefully considered the off-by-one errors that can occur because of the differing standards for representing intervals (0-based vs. 1-based coordinates and open ended vs. closed)? I believe that the IntervalTree class on which the OverlapDetector is based makes some assumptions about intervals being half-open in testing for overlapping intervals. Do all of your input files adhere to the same standard? (Apologies if you have unit tests for this--I didn't examine them carefully.)

The OverlapDetector class is super ugly, IMHO. (But kudos for finding existing code to reuse.) You could eliminate it, and save a ton of memory, by reorganizing this tool to walk through the various sources of annotations as you walk through the VCF, rather than reading them all into memory at the beginning of operations. Alternatively, a simpler to implement strategy would be to go chromosome by chromosome: Every time you stumble into a new contig in the VCF, you build IntervalTrees for just that contig for each of your annotation sources, discarding the IntervalTrees for the previous contig. Doesn't save quite as much memory, but it's simpler. We can discuss if you're interested and these brief couple of sentences don't make the idea clear.

tedsharpe · 2021-10-04T14:28:33Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        return consequence;
+    }
+
+    private void annotateTranscript(final SimpleInterval variantInterval, final StructuralVariantAnnotationType svType, final GencodeGtfTranscriptFeature transcript, final Map<String, Set<String>> variantConsequenceDict) {


Some of these lines are awfully long. The GATK coding standard says something about line lengths, but it's frequently ignored. Nonetheless, it would be easier to read if you wrapped some of these super-long lines.

tedsharpe · 2021-10-04T14:29:22Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+import com.google.common.collect.Lists;
+import htsjdk.samtools.SAMSequenceDictionary;
+import htsjdk.samtools.SAMSequenceRecord;
+import htsjdk.samtools.util.Locatable;


There are quite a few unused imports that could be cleaned up.

tedsharpe · 2021-10-04T14:33:03Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+    protected static void annotateNearestTranscriptionStartSite(final SimpleInterval variantInterval, final Map<String, Set<String>> variantConsequenceDict, SVIntervalTree<String> transcriptionStartSiteTree, int maxContigLength, int variantContigID) {
+        // TODO: keep all nearest TSS for dispersed CPX / CTX or choose closest?
+        // TODO: will start < end ever? Shouldn't at this point in the pipeline
+        SVIntervalTree.Entry<String> nearestBefore = transcriptionStartSiteTree.max(new SVInterval(variantContigID, variantInterval.getStart(), variantInterval.getEnd()));


It's not super important here--I don't think it will have much of an impact on performance--but just in general best not to repeat operations unnecessarily: Pull out the creation of the new SVInterval into a local variable so you don't repeat it.

Hmm. Now that I think it over, I believe you want to take the min on (contig, start, start) and the max on (contig, end, end) rather than on (contig, start, end) as you're currently doing.

I think the interval used - (start,end) or (start,start) & (end,end) - doesn't matter here because it's already been determined that the variant doesn't overlap any of the features, so the min on (start,start) should be the same as the min on (start,end), etc. Unless there's a performance advantage or a nuance in behavior that I'm missing re: using (start,start) and (end,end)?

No, I think you're right that if there are no overlappers you'll get what you expect.
The subtlety is that [1000, 1500) is less than [1000, 2000), so if you queried for max on [1000, 2000) and [1000, 1500) was in the tree, you'd get that as the max, which sometimes surprises people who think that the max will necessarily be disjoint from the query interval. But you don't have that problem.

tedsharpe

Just a couple of trivial things that I had meant to mention but forgot.

tedsharpe · 2021-10-05T12:35:39Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        return count;
+    }
+
+    protected static boolean variantOverlapsFeature(final SimpleInterval variantInterval, final SimpleInterval featureInterval) {


Pretty sure this method is unnecessary (and is, in fact, causing you to create an extra SimpleInterval). I think you can just call variantInterval.overlaps(someFeature), rather than variantOverlapsFeature(variantInterval, new SimpleInterval(someFeature)).

tedsharpe · 2021-10-05T12:37:52Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        // TODO: return consequence instead?
+    }
+
+    private StructuralVariantAnnotationType getSVType(final VariantContext variant) {


Would it be simpler to use the SVTYPE info field rather than the ALT allele? Or does that not handle all the cases?

When using built-in methods to access SVTYPE from a VariantContext class, we ran into issues because StructuralVariantType as defined in htsjdk doesn't include CPX or CTX types. Rather than reimplementing a way to access SVTYPE while circumventing that check, it was suggested to use the ALT field because SVTYPE is going to be deprecated in VCF v4.4 anyway

Ah, yes. I remember that discussion now. But you could use getAttributeAsString(VCConstants.SVTYPE, null) rather than getStructuralVariantType() if it simplified your life. I'm just worried that when we get to INS:ME:ALU that things will break.

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

mwalker174 · 2022-02-23T16:53:53Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotate.java

+        final boolean beforeInvalid = (isNull(nearestBefore) || nearestBefore.getInterval().getContig() != variantContigID );
+        final boolean afterInvalid = (isNull(nearestAfter) || nearestAfter.getInterval().getContig() != variantContigID );
+        // if at least one result is valid, keep one with shorter distance
+        if (!(beforeInvalid && afterInvalid)) {
+            // if result is invalid, set distance to longest contig length so that other TSS will be kept
+            final int distanceBefore = beforeInvalid ? maxContigLength : nearestBefore.getInterval().gapLen(svInterval);
+            final int distanceAfter = afterInvalid ? maxContigLength : svInterval.gapLen(nearestAfter.getInterval());


Found this a little easier to follow by avoiding double-negatives:

Suggested change

final boolean beforeInvalid = (isNull(nearestBefore) || nearestBefore.getInterval().getContig() != variantContigID );

final boolean afterInvalid = (isNull(nearestAfter) || nearestAfter.getInterval().getContig() != variantContigID );

// if at least one result is valid, keep one with shorter distance

if (!(beforeInvalid && afterInvalid)) {

// if result is invalid, set distance to longest contig length so that other TSS will be kept

final int distanceBefore = beforeInvalid ? maxContigLength : nearestBefore.getInterval().gapLen(svInterval);

final int distanceAfter = afterInvalid ? maxContigLength : svInterval.gapLen(nearestAfter.getInterval());

final boolean beforeValid = nearestBefore != null && nearestBefore.getInterval().getContig() == variantContigID;

final boolean afterValid = nearestAfter != null && nearestAfter.getInterval().getContig() == variantContigID;

// only update if at least one TSS is valid

if (beforeValid || afterValid) {

// set distance to closest valid TSS

final int distanceBefore = beforeValid ? nearestBefore.getInterval().gapLen(svInterval) : maxContigLength;

final int distanceAfter = afterValid ? svInterval.gapLen(nearestAfter.getInterval()) : maxContigLength;

Also got rid of maxContigLength and replaced with Integer.MAX_VALUE

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/SVUtils.java

src/test/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateUnitTest.java

src/test/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateIntegrationTest.java

mwalker174

Thanks @epiercehoffman ! I have a few finishing touches to suggest here:

Use of static methods - this is somewhat subjective. IMO, static methods are good for small "helper" functions, particularly when exposed (public), and functions that do not access class members. I've suggested a few functions that I think are more appropriately non-static. It shouldn't be too much trouble to modify your tests this way.
Using GTFIntervalsTreeContainer to hold the trees would also cut down on lines of code, which is always great. I would move the container class into the engine and make it public with proper getter methods. The trees don't seem to be accessed that often, it shouldn't be too burdensome to call the getter methods.

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVAnnotateEngine.java

* add TSS_DUP annotation category * fix TSS and promoter inference and overlap determination * handle 2nd breakpoint for BNDs in GATK-SV VCF format * handle BND representation of CPX and CTX events for spec-compliant VCF format * add option to annotate BNDs under a specified length as DEL or DUP if applicable * style edits * add unit tests for TSS, promoter, annotation of all SV types, and add toy GTF file for testing

* handle dDUP events: add INS interval * set INS interval to CHR:POS-POS+1 and ignore END * fixes to BND handling

…cleanup

* add tool docs * rename command line args * add unit tests for ClosedSVInterval * add unit test for SVInterval.toSimpleInterval

epiercehoffman requested a review from mwalker174 August 20, 2021 19:06

mwalker174 reviewed Aug 25, 2021

View reviewed changes

tedsharpe approved these changes Oct 4, 2021

View reviewed changes

tedsharpe reviewed Oct 5, 2021

View reviewed changes

epiercehoffman force-pushed the eph_sv_annotate branch from c80b15a to 3c89b0c Compare December 2, 2021 23:00

mwalker174 requested changes Feb 25, 2022

View reviewed changes

mwalker174 approved these changes Mar 4, 2022

View reviewed changes

epiercehoffman marked this pull request as ready for review March 4, 2022 19:52

epiercehoffman added 22 commits March 8, 2022 16:02

first pass classifying DEL, DUP, INS, INV, and CPX events

f8ca69f

add variant feature interval comparison unit tests

2e5e5fb

review comments, BND, CTX, CNV, promoters, noncoding

72837b9

add nearest TSS annotation and unit test

d20f137

ClosedSVIntervals, optional ref inputs, and PR feedback

69b94ef

fix dup and TSS logic, infer promoters, del LOF if TSS overlap

2eb8c37

fix toy gtf gene name

6cd5f1b

resolve conflicts from rebase

dd190fb

unit test for getSVType and getSVSegments + refinements/fixes

c9d396a

* handle dDUP events: add INS interval * set INS interval to CHR:POS-POS+1 and ignore END * fixes to BND handling

put variant data for unit test for SV type / segments directly in test

f11b51e

clean up some TODOs

8297e8d

add unit test for annotateStructuralVariant incl noncoding/promoter; …

eef4ddc

…cleanup

add resources for integration test

9f77780

add tests for handling unexpected contigs in gtf and bed file

ab83265

add integration tests for different cmd line args

c22bd55

add comments to tests and methods

43fabb2

finishing touches

819c9c3

* add tool docs * rename command line args * add unit tests for ClosedSVInterval * add unit test for SVInterval.toSimpleInterval

remove ClosedSVInterval, interconvert to SVInterval instead

70b4201

address comments part 1: mostly cosmetic

4344610

address comments part 2: style & substance

5cb9a1a

address comments part 3: style & substance

80559d6

epiercehoffman added 6 commits March 8, 2022 16:02

address comments part 4: sequence dictionary

dd9c67f

revert changes to gtf interval trees container

bf66907

address comments part 5: add doc comments for non-test methods

ece0b6b

address comments part 6: create engine class

1b9239d

add test case for promoter annotation of multi-segment SV

06b3a41

fewer static methods & pass around gtf trees container

e714c7e

epiercehoffman force-pushed the eph_sv_annotate branch from 3e0b041 to e714c7e Compare March 8, 2022 21:08

swap non-ascii prime character for ascii version

e535ccf

epiercehoffman merged commit 1c749b3 into master Mar 9, 2022

epiercehoffman deleted the eph_sv_annotate branch March 9, 2022 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVAnnotate: Functional annotations for SVs called by GATK-SV #7431

SVAnnotate: Functional annotations for SVs called by GATK-SV #7431

epiercehoffman commented Aug 20, 2021

mwalker174 left a comment

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

mwalker174 Aug 25, 2021

tedsharpe left a comment

tedsharpe Oct 4, 2021

tedsharpe Oct 4, 2021

tedsharpe Oct 4, 2021

tedsharpe Oct 4, 2021

epiercehoffman Oct 8, 2021

tedsharpe Oct 8, 2021

tedsharpe left a comment

tedsharpe Oct 5, 2021 •

edited

Loading

tedsharpe Oct 5, 2021

epiercehoffman Oct 8, 2021

tedsharpe Oct 8, 2021

mwalker174 Feb 23, 2022

epiercehoffman Feb 28, 2022

mwalker174 left a comment

		List<GencodeGtfGeneFeature> gtfFeaturesList = new ArrayList<>();
		proteinCodingGTFSource.forEach(gtfFeaturesList::add); // TODO: faster method?

SVAnnotate: Functional annotations for SVs called by GATK-SV #7431

SVAnnotate: Functional annotations for SVs called by GATK-SV #7431

Conversation

epiercehoffman commented Aug 20, 2021

mwalker174 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe left a comment

Choose a reason for hiding this comment

tedsharpe Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalker174 left a comment

Choose a reason for hiding this comment

tedsharpe Oct 5, 2021 •

edited

Loading