Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SV clustering classes #7243

Merged
merged 10 commits into from
Oct 6, 2021
Merged

Update SV clustering classes #7243

merged 10 commits into from
Oct 6, 2021

Conversation

mwalker174
Copy link
Contributor

Reworks classes used by JointGermlineCNVSegmentationIntegration for SV clustering and defragmentation. The design of SVClusterEngine has been overhauled to enable the implementation of CNVDefragmenter and BinnedCNVDefragmenter subclasses. Logic for producing representative records from a collection of clustered SVs has been separated into an SVCollapser class, which provides enhanced functionality for handling genotypes for SVs more generally.

A number of bugs, particularly with max-clique clustering, have been fixed, as well as a parameter swap bug in JointGermlineCNVSegmentationIntegration.

This is the first of a series of PRs for an experimental Java-based implementation of some modules in gatk-sv pipeline, including SV vcf merging, clustering, evidence aggregation, and genotyping.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find the code to do the defragmentation in bin-space that I need for exomes.

I also had a question regarding what the spec says about determining the ref allele.

Other than that, lots of little clarifications and suggestions you can take or leave. I didn't go over the tests with a fine toothed comb, but I can chip in if Chris doesn't have the time for that.

protected Object collapseSingleVariantAttribute(final Set<Object> values) {
if (values.size() == 1) {
return values.iterator().next();
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to add up DP or anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.


/**
* CNV defragmenter for when the intervals used for coverage collection are available. The intervals are used to adjust
* variant padding by rounding to the nearest bin boundary.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something missing from getPaddedRecordInterval() below?

call2End - call2Start + 1, Collections.singletonList(GATKSVVCFConstants.DEPTH_ALGORITHM),
Lists.newArrayList(Allele.REF_N, Allele.SV_SIMPLE_DEL),
Collections.emptyList(), Collections.emptyMap());
Assert.assertTrue(defragmenter.clusterTogether(call1, call2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice improvement over the old checks.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great stuff! Looks really good. The tests are great, too. Feel free to take some of my comments as suggestions rather than requirements, or correct me if I'm misunderstood something as I'm commenting on it.


// evidence metrics
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP";
public static final String NEUTRAL_COPY_NUMBER_KEY = "NCN";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this one for? Is it to tag copy number neutral events? Is it meant to hold the reference ploidy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for ploidy - future PR.

public static final String CLUSTER_MEMBER_IDS_KEY = "MEMBERS";

// evidence metrics
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new constants from this point on don't seem to be used anywhere in the GATK code base that I can see. Am I missing some way in which they are used? If not, do we want to add them now or hold off until they are used for something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to cherry-pick all the changes for this file, but these are actually for a later PR. I'll remove them for now.

public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP";
public static final String NEUTRAL_COPY_NUMBER_KEY = "NCN";
public static final String DEPTH_OVERLAP_KEY = "RDOV";
public static final String DEPTH_P_HARDY_WEINBERG_LOSS_FIELD = "PHW_L";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure what some of these are. If you want to keep them, could you add some quick comments here or somewhere else to describe what they are?

@Override
public int getStart() {
return start;
public boolean isCNV() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the CPX types in our post 0506 VCFs are or contain CNVs. Maybe change this to something like isSimpleCNV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* @param attributes attributes to apply to all new genotypes
* @return genotypes augmented with missing samples
*/
public static GenotypesContext fillMissingSamplesWithGenotypes(final GenotypesContext genotypes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, but I might change the name of this to populateGenotypesForMissingSamplesWithAlleles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} else if (newType.equals(StructuralVariantType.INS)) {
// Median
final int[] lengths = items.stream().mapToInt(SVCallRecord::getLength).toArray();
final int midIndex = lengths.length / 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use MathUtils.median()? You might have to update it or create another flavor of the method that would let you use the averaging EstimationType, but that might be more generally useful for GATK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the implementation but I suppose I should use the existing utility. If it ends up being slow we can propose a change there.

}
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A high-level description would be useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

final List<Integer> startPositions = items.stream().map(SVCallRecord::getPositionA).sorted().collect(Collectors.toList());
final List<Integer> endPositions = items.stream().map(SVCallRecord::getPositionB).sorted().collect(Collectors.toList());
//use the mid value of the sorted list so the start and end represent real breakpoint observations
final int medianStart = startPositions.get(startPositions.size() / 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use MathUtils.median

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also replaced mean calculations below with (int)Math.round(MathUtils.sum(startPositions) / (double) startPositions.length); since it's now an array type.

* @param items
* @return (key, value) entry of (start, end)
*/
protected Map.Entry<Integer,Integer> collapseInterval(final Collection<SVCallRecord> items) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be clearer to use a org.apache.commons.lang3.tuple.Pair rather than a Map.Entry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes that is much better!

{SVTestUtils.newCallRecordWithAlgorithms(Collections.emptyList()), false},
{SVTestUtils.newCallRecordWithAlgorithms(Collections.singletonList(GATKSVVCFConstants.DEPTH_ALGORITHM)), true},
{SVTestUtils.newCallRecordWithAlgorithms(Collections.singletonList("pesr")), false},
{SVTestUtils.newCallRecordWithAlgorithms(Lists.newArrayList(GATKSVVCFConstants.DEPTH_ALGORITHM, "pesr")), false}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIght as well add "pesr" to GATKSVVCFConstants too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use the caller name eg "manta", "melt", etc. for pesr and should allow for any caller name (that is not "depth"), whereas "depth" has a special meaning. Here "pesr" is just for testing.

Copy link
Contributor Author

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ldgauthier and @cwhelan for the reviews! I believe I've responded to all of them. There's been some major refactoring of the clustering classes as suggested by @cwhelan to make the clustering engine more flexible. I've also made some changes to the SVCallRecord class - in particular the ability to have null length/strand values.

public static final String CLUSTER_MEMBER_IDS_KEY = "MEMBERS";

// evidence metrics
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to cherry-pick all the changes for this file, but these are actually for a later PR. I'll remove them for now.


// evidence metrics
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP";
public static final String NEUTRAL_COPY_NUMBER_KEY = "NCN";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for ploidy - future PR.


/**
* CNV defragmenter for when the intervals used for coverage collection are available. The intervals are used to adjust
* variant padding by rounding to the nearest bin boundary.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something missing from getPaddedRecordInterval() below?

*/
public class BinnedCNVDefragmenter extends CNVDefragmenter {

protected final TreeMap<GenomeLoc, Integer> genomicToBinMap;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider using SVIntervalTree? It could be faster because it's contig-aware and was written by @tedsharpe . The downside is converting from GenomeLoc to SVInterval which involves converting contig IDs to ints.

I'm hesitant to start mixing in SVIntervals mainly because of the confusion it can create with differences in coordinates.

Also I might make the name of this map a little more explicit, ie genomicCoordinatesToBinIndexMap

Done

final int positionB,
final boolean strandB,
final StructuralVariantType type,
final int length,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made length and the strands into Integer and Boolean types, respectively, so that they can take on null values when appropriate (e.g. length for BNDs or strands for CNVs). This required quite a bit of refactoring and fixing of tests but I seem to have it working. There may be some performance hit from boxing/unboxing these values everywhere but I do agree that passing around -1 everywhere (and having non-sensical strand values) is problematic in some ways.

It is a little tricky to have constructors that allow for length/strand parameters that are potentially inconsistent, so I have checks to ensure the values are consistent and also sanitize attributes that could contain redundant information.

@Override
public int getStart() {
return start;
public boolean isCNV() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
Utils.validateArg(record.isIntrachromosomal(), "Inversion is not intrachromosomal");
final SVCallRecord positiveBreakend = new SVCallRecord(record.getId(), record.getContigA(),
record.getPositionA(), true, record.getContigB(), record.getPositionB(), true, StructuralVariantType.BND, -1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this - see comment in SVCallRecord.java.

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now; sorry for the delay in re-review. I only have a couple of very minor stylistic comments; feel free to merge after deciding whether or not to make changes based on them.

return newAttributes;
}

private static Integer getLength(final VariantContext variant, final StructuralVariantType type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little confusing to have getLength and getStrands return null if the values are inferrable from the SV type -- in my view that'd be surprising behavior if I was reading through the SVCallRecord.create() method. I'd recommend putting the SVCallRecord can resolve these clauses of these if statements back into SVCallRecord.create() and having these methods just have the else clauses -- parsing the attribute values, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (i < tokens.length && subtype.equals(tokens[i])) {
alleleSize = i + 1;
} else {
break outerloop;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labeled breaks always look a little weird to me (probably because I've worked on other codebases where they were explicitly against the style guide) and we don't have many of them in the GATK codebase. You could replace this with return Arrays.copyOf(firstTokens, alleleSize); since there's nothing else going on after the method, or refactor the loop labeled outerloop into a calculateAlleleSize method and just return from from it. But I guess they're not against current style guidelines and my aversion to them might be just a personal quirk, so feel free to keep this if you like it better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, just replaced with return Arrays.copyOf(firstTokens, alleleSize);

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to go over the tests, but here are the code comments. Mostly requests to rename, but there was one concern I had about dupes with no-call GTs.

final int copyNumber = VariantContextGetters.getAttributeAsInt(genotype, COPY_NUMBER_FORMAT, 0);
if (altAllele.equals(Allele.SV_SIMPLE_DEL)) {
return copyNumber < ploidy;
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there are some nice checks in the calling method, but can you put javadoc to be explicit about the parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is relevant, but I clearly documented the new isCarrier() method.

for (final Genotype g : variant.getGenotypes()) {
if (g.isHomRef() || (g.isNoCall() && !g.hasExtendedAttribute(GATKSVVCFConstants.COPY_NUMBER_FORMAT))) {
continue;
private static Pair<Boolean, Boolean> inferStrands(final StructuralVariantType type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let Chris double check this logic.

Assert.assertFalse(defragmenter.areClusterable(deletionRAR, deletionAAR));
Assert.assertFalse(defragmenter.areClusterable(deletionRAR, deletionAAA));

Assert.assertTrue(defragmenter.areClusterable(deletionARR1, deletionARR1Copy));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh good call. I have a branch where I ran JointGermlineCNVSegmentation on two multi-sample input VCFs, but never merged -- is that possible with this code now? If not then maybe I want to rebase before you merge...

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be missing a few test cases, but otherwise just tidying up that you can choose to do or not.

Collections.singletonList(Allele.REF_N)
),
Collections.singletonList(Allele.SV_SIMPLE_DEL),
Lists.newArrayList(Allele.REF_N, Allele.SV_SIMPLE_DEL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about preferring higher ploidy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Lists.newArrayList(Allele.SV_SIMPLE_DEL, Allele.SV_SIMPLE_DUP),
Lists.newArrayList(Allele.SV_SIMPLE_DUP, Allele.SV_SIMPLE_DEL)
},
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc for the method says "potentially augmented with additional ref alleles to match the ploidy" -- is there a test for that? Like DEL, REF/REF ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, added a test for that and additional CNV cases such as haploid DUPs where we can infer GT exactly.

{
new int[]{-1},
new String[]{"chr2"},
new StructuralVariantType[]{StructuralVariantType.BND},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add another BND here or another case with multiple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

public Object[][] collapseAlgorithmsTestData() {
return new Object[][]{
{Collections.singletonList(Collections.singletonList(GATKSVVCFConstants.DEPTH_ALGORITHM)), Collections.singletonList(GATKSVVCFConstants.DEPTH_ALGORITHM)},
{Collections.singletonList(Collections.singletonList("pesr")), Collections.singletonList("pesr")},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where events can be supported by depth and pesr evidence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added those cases now

@mwalker174
Copy link
Contributor Author

Back to you @ldgauthier

@ldgauthier
Copy link
Contributor

This must be the most discussed tool in the GATK. I am relieved to close the book on this PR. Very nice work @mwalker174 -- thanks for all the refactoring!

@mwalker174 mwalker174 merged commit 95852a1 into master Oct 6, 2021
@mwalker174 mwalker174 deleted the mw_sv_cluster_engine branch October 6, 2021 18:58
gileshall pushed a commit that referenced this pull request Oct 6, 2021
@mwalker174 mwalker174 mentioned this pull request Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants