Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVAnnotate: Functional annotations for SVs called by GATK-SV #7431

Merged
merged 29 commits into from
Mar 9, 2022

Conversation

epiercehoffman
Copy link
Contributor

Add SVAnnotate for functional annotation of SV VCFs from GATK-SV (after preprocessing to correct END keys, etc.). Posting as a draft for pre-review of work-in-progress code.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start! I have mostly stylistic comments at this point.

It does seem like there are parallels among the annotateX methods resulting in some repeated code. You could think about loading some of that logic into an abstract annotator class. For each SV type you could then have a subclass that implements more specific logic. That is, implement a common public String annotate() function that looks a lot like your annotateDUP() and annotateINV() methods but calls abstract functions getConsequenceSpanning(), getConsequenceSingleBreakend(), getConsequenceDoubleBreakendInExon(), getConsequenceDoubleBreakendInUtr(), etc. for each case. Not necessary, but something to think about as this becomes more complex.


@Argument(
fullName="proteinCodingGTF",
shortName="P",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically don't use short names for non-standard arguments, just so the commands are easier to interpret at a glance.

Comment on lines 53 to 57
private VariantContextWriter vcfWriter = null;

private OverlapDetector<GencodeGtfGeneFeature> gtfOverlapDetector;

private final Set<String> MSVExonOverlapClassifications = new HashSet<>(Arrays.asList(GATKSVVCFConstants.LOF, GATKSVVCFConstants.INT_EXON_DUP, GATKSVVCFConstants.DUP_PARTIAL, GATKSVVCFConstants.PARTIAL_EXON_DUP, GATKSVVCFConstants.COPY_GAIN));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can delete the blank lines between these

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use Sets.newHashSet()


@Override
public void onTraversalStart() {
FeatureDataSource<GencodeGtfGeneFeature> proteinCodingGTFSource = new FeatureDataSource<>(proteinCodingGTFFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final … throughout

Comment on lines 62 to 63
List<GencodeGtfGeneFeature> gtfFeaturesList = new ArrayList<>();
proteinCodingGTFSource.forEach(gtfFeaturesList::add); // TODO: faster method?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FeatureDataSource implements Iterable, you can do this in one line with Google's common Lists class like so:

List<GencodeGtfGeneFeature> gtfFeaturesList = Lists.newArrayList(proteinCodingGTFSource);

FeatureDataSource<GencodeGtfGeneFeature> proteinCodingGTFSource = new FeatureDataSource<>(proteinCodingGTFFile);
List<GencodeGtfGeneFeature> gtfFeaturesList = new ArrayList<>();
proteinCodingGTFSource.forEach(gtfFeaturesList::add); // TODO: faster method?
gtfOverlapDetector = OverlapDetector.create(gtfFeaturesList);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not declare the gtfFeaturesList since it's only used once, and just put the newArrayList() function in this line, for the sake of code economy.

Comment on lines 116 to 470
if (!variantConsequenceDict.containsKey(key)) {
variantConsequenceDict.put(key, new HashSet<>());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can replace with variantConsequenceDict.putIfAbsent()

Comment on lines 140 to 310
private String annotateINS(SimpleInterval variantInterval, GencodeGtfTranscriptFeature gtfTranscript) {
return annotateDEL(variantInterval, gtfTranscript);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer to have a single function annotateDeletionOrInsertion

variantConsequenceDict.get(key).add(value);
}

private String annotateDEL(SimpleInterval variantInterval, GencodeGtfTranscriptFeature gtfTranscript) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would spell out the type i.e. annotateDeletion

Comment on lines 91 to 226
if (!variantInterval.getContig().equals(featureInterval.getContig())) {
return false;
} else {
return variantInterval.getStart() <= featureInterval.getStart() && variantInterval.getEnd() >= featureInterval.getEnd();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just use Locatable's contains() method

}
}

protected final static int countBreakpointsInsideFeature(SimpleInterval variantInterval, SimpleInterval featureInterval) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of trivial comments from me, but this looks fine. You can merge from my point of view. A couple of more general suggestions, though, which you can feel free to ignore:

Have you carefully considered the off-by-one errors that can occur because of the differing standards for representing intervals (0-based vs. 1-based coordinates and open ended vs. closed)? I believe that the IntervalTree class on which the OverlapDetector is based makes some assumptions about intervals being half-open in testing for overlapping intervals. Do all of your input files adhere to the same standard? (Apologies if you have unit tests for this--I didn't examine them carefully.)

The OverlapDetector class is super ugly, IMHO. (But kudos for finding existing code to reuse.) You could eliminate it, and save a ton of memory, by reorganizing this tool to walk through the various sources of annotations as you walk through the VCF, rather than reading them all into memory at the beginning of operations. Alternatively, a simpler to implement strategy would be to go chromosome by chromosome: Every time you stumble into a new contig in the VCF, you build IntervalTrees for just that contig for each of your annotation sources, discarding the IntervalTrees for the previous contig. Doesn't save quite as much memory, but it's simpler. We can discuss if you're interested and these brief couple of sentences don't make the idea clear.

return consequence;
}

private void annotateTranscript(final SimpleInterval variantInterval, final StructuralVariantAnnotationType svType, final GencodeGtfTranscriptFeature transcript, final Map<String, Set<String>> variantConsequenceDict) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these lines are awfully long. The GATK coding standard says something about line lengths, but it's frequently ignored. Nonetheless, it would be easier to read if you wrapped some of these super-long lines.

import com.google.common.collect.Lists;
import htsjdk.samtools.SAMSequenceDictionary;
import htsjdk.samtools.SAMSequenceRecord;
import htsjdk.samtools.util.Locatable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few unused imports that could be cleaned up.

protected static void annotateNearestTranscriptionStartSite(final SimpleInterval variantInterval, final Map<String, Set<String>> variantConsequenceDict, SVIntervalTree<String> transcriptionStartSiteTree, int maxContigLength, int variantContigID) {
// TODO: keep all nearest TSS for dispersed CPX / CTX or choose closest?
// TODO: will start < end ever? Shouldn't at this point in the pipeline
SVIntervalTree.Entry<String> nearestBefore = transcriptionStartSiteTree.max(new SVInterval(variantContigID, variantInterval.getStart(), variantInterval.getEnd()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not super important here--I don't think it will have much of an impact on performance--but just in general best not to repeat operations unnecessarily: Pull out the creation of the new SVInterval into a local variable so you don't repeat it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Now that I think it over, I believe you want to take the min on (contig, start, start) and the max on (contig, end, end) rather than on (contig, start, end) as you're currently doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the interval used - (start,end) or (start,start) & (end,end) - doesn't matter here because it's already been determined that the variant doesn't overlap any of the features, so the min on (start,start) should be the same as the min on (start,end), etc. Unless there's a performance advantage or a nuance in behavior that I'm missing re: using (start,start) and (end,end)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think you're right that if there are no overlappers you'll get what you expect.
The subtlety is that [1000, 1500) is less than [1000, 2000), so if you queried for max on [1000, 2000) and [1000, 1500) was in the tree, you'd get that as the max, which sometimes surprises people who think that the max will necessarily be disjoint from the query interval. But you don't have that problem.

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of trivial things that I had meant to mention but forgot.

return count;
}

protected static boolean variantOverlapsFeature(final SimpleInterval variantInterval, final SimpleInterval featureInterval) {
Copy link
Contributor

@tedsharpe tedsharpe Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this method is unnecessary (and is, in fact, causing you to create an extra SimpleInterval). I think you can just call variantInterval.overlaps(someFeature), rather than variantOverlapsFeature(variantInterval, new SimpleInterval(someFeature)).

// TODO: return consequence instead?
}

private StructuralVariantAnnotationType getSVType(final VariantContext variant) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to use the SVTYPE info field rather than the ALT allele? Or does that not handle all the cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using built-in methods to access SVTYPE from a VariantContext class, we ran into issues because StructuralVariantType as defined in htsjdk doesn't include CPX or CTX types. Rather than reimplementing a way to access SVTYPE while circumventing that check, it was suggested to use the ALT field because SVTYPE is going to be deprecated in VCF v4.4 anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I remember that discussion now. But you could use getAttributeAsString(VCConstants.SVTYPE, null) rather than getStructuralVariantType() if it simplified your life. I'm just worried that when we get to INS:ME:ALU that things will break.

Comment on lines 627 to 633
final boolean beforeInvalid = (isNull(nearestBefore) || nearestBefore.getInterval().getContig() != variantContigID );
final boolean afterInvalid = (isNull(nearestAfter) || nearestAfter.getInterval().getContig() != variantContigID );
// if at least one result is valid, keep one with shorter distance
if (!(beforeInvalid && afterInvalid)) {
// if result is invalid, set distance to longest contig length so that other TSS will be kept
final int distanceBefore = beforeInvalid ? maxContigLength : nearestBefore.getInterval().gapLen(svInterval);
final int distanceAfter = afterInvalid ? maxContigLength : svInterval.gapLen(nearestAfter.getInterval());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this a little easier to follow by avoiding double-negatives:

Suggested change
final boolean beforeInvalid = (isNull(nearestBefore) || nearestBefore.getInterval().getContig() != variantContigID );
final boolean afterInvalid = (isNull(nearestAfter) || nearestAfter.getInterval().getContig() != variantContigID );
// if at least one result is valid, keep one with shorter distance
if (!(beforeInvalid && afterInvalid)) {
// if result is invalid, set distance to longest contig length so that other TSS will be kept
final int distanceBefore = beforeInvalid ? maxContigLength : nearestBefore.getInterval().gapLen(svInterval);
final int distanceAfter = afterInvalid ? maxContigLength : svInterval.gapLen(nearestAfter.getInterval());
final boolean beforeValid = nearestBefore != null && nearestBefore.getInterval().getContig() == variantContigID;
final boolean afterValid = nearestAfter != null && nearestAfter.getInterval().getContig() == variantContigID;
// only update if at least one TSS is valid
if (beforeValid || afterValid) {
// set distance to closest valid TSS
final int distanceBefore = beforeValid ? nearestBefore.getInterval().gapLen(svInterval) : maxContigLength;
final int distanceAfter = afterValid ? svInterval.gapLen(nearestAfter.getInterval()) : maxContigLength;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also got rid of maxContigLength and replaced with Integer.MAX_VALUE

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @epiercehoffman ! I have a few finishing touches to suggest here:

  • Use of static methods - this is somewhat subjective. IMO, static methods are good for small "helper" functions, particularly when exposed (public), and functions that do not access class members. I've suggested a few functions that I think are more appropriately non-static. It shouldn't be too much trouble to modify your tests this way.
  • Using GTFIntervalsTreeContainer to hold the trees would also cut down on lines of code, which is always great. I would move the container class into the engine and make it public with proper getter methods. The trees don't seem to be accessed that often, it shouldn't be too burdensome to call the getter methods.

@epiercehoffman epiercehoffman marked this pull request as ready for review March 4, 2022 19:52
* add TSS_DUP annotation category
* fix TSS and promoter inference and overlap determination
* handle 2nd breakpoint for BNDs in GATK-SV VCF format
* handle BND representation of CPX and CTX events for spec-compliant VCF format
* add option to annotate BNDs under a specified length as DEL or DUP if applicable
* style edits
* add unit tests for TSS, promoter, annotation of all SV types, and add toy GTF file for testing
* handle dDUP events: add INS interval
* set INS interval to CHR:POS-POS+1 and ignore END
* fixes to BND handling
* add tool docs
* rename command line args
* add unit tests for ClosedSVInterval
* add unit test for SVInterval.toSimpleInterval
@epiercehoffman epiercehoffman merged commit 1c749b3 into master Mar 9, 2022
@epiercehoffman epiercehoffman deleted the eph_sv_annotate branch March 9, 2022 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants