Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVCluster tool #7541

Merged
merged 8 commits into from
Jan 26, 2022
Merged

SVCluster tool #7541

merged 8 commits into from
Jan 26, 2022

Conversation

mwalker174
Copy link
Contributor

@mwalker174 mwalker174 commented Nov 1, 2021

Implements tool for clustering SVs, built on top of the clustering engine code refined recently in #7243. In addition to a few bug fixes, updates also include:

  • PloidyTable class, which ingests and serves as a simple data class for a tsv of per-sample contig ploidies. This was necessary for inferring genotypes when input vcfs contain non-matching sample and variant records.
  • Modified SVClusterEngine to render sorted output.
  • Improved code for SV record collapsing (see the CanonicalSVCollapser), particularly for CNVs. Genotype collapsing now infers allele phasing in certain unambiguous cases, in particular for DUPs and multi-allelic CNVs. Testing for this has been cleaned up and augmented with further cases to validate this functionality.

@broadinstitute broadinstitute deleted a comment from gatk-bot Nov 15, 2021
@broadinstitute broadinstitute deleted a comment from gatk-bot Nov 15, 2021
@broadinstitute broadinstitute deleted a comment from gatk-bot Nov 15, 2021
Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwalker174 This is looking great, glad to see it so fleshed out.

newGenotypes.add(new GenotypeBuilder(g).alleles(genotypeAlleles).make());
}
builder.genotypes(newGenotypes);
if (record.getGenotypes().stream().anyMatch(g -> g.getAlleles().isEmpty())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels inefficient to stream over the genotypes twice (first to check if any are empty, then to sanitize), even if the anyMatch stream stops at the first match. Why not just always return the mapped stream below? Or do you think the overhead costs of mapping and re-collecting outweigh the benefits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, this was a premature optimization on my part. I was thinking it would be relatively rare to have empty alleles and this would make it faster by not having to make a copy of the genotypes. On second thought, I'm not sure this would have a huge effect, and could lead to some puzzling performance inconsistency for different vcfs (e.g. a chrY vcf would usually run slower since it's likely to have a lot of empty genotypes). I've changed it to just simply call sanitize on every genotype.

for (final String sample : missingSamples) {
final GenotypeBuilder genotypeBuilder = new GenotypeBuilder(sample);
if (attributes != null) {
genotypeBuilder.attributes(attributes);
final int ploidy = ploidyTable.get(sample, contig);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add some documentation in the method javadoc about how this logic works (and add @param docs for refAlleleDefault and ploidyTable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.sorted()
.collect(Collectors.toList());
// Alt alleles match already, or some alts vanished in genotype collapsing (and we keep them)
if (genotypeAltAlleles.equals(sortedAltAlleles) || sortedAltAlleles.containsAll(genotypeAltAlleles)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part of the || seems redundant with the second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably another premature optimization - was trying to avoid containsAll which is N^2 for two lists. These lists will likely be short. Again favoring performance consistency rather than optimizing the more common case, I've changed genotypeAltAlleles to a Set and will only check containsAll()

&& sampleAltAlleles.get(0).equals(Allele.SV_SIMPLE_DUP)) {
return Collections.nCopies(expectedCopyNumber, Allele.NO_CALL);
// Ploidy 0
if (expectedCopyNumber == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some places where you get eg coverage on Y in females because of segmental duplications between Y and other chromosomes. Do you think we always want to throw data from those places away? (maybe we do, just wanted to raise it as a discussion point).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point. On the one hand, I can see why it would be of interest and valid to preserve the calls there. On the other hand, it seems like a contradiction to have a duplication "on" a chromosome that doesn't exist. I think this highlights a weakness of the way we represent duplications.

For example, consider a seg dup present in 2 copies on the reference and a sample that matches reference (assume diploid for both). Then you technically have 4 copies of this particular sequence, but should that get called as a dup? Most would argue probably not, but in some sense it's not wrong. If you did decide to call it, would you call it at one or both? What if there were a duplication of one copy but you had insufficient information to determine which locus? Where do you put the call? Do you call it twice - once at each locus? Seems ambiguous.

I don't want to open that can of worms here. If we really wanted to, I think representing such events with an insertion might be practical. IMO, graph representations are the solution to these kinds of issues. I've added a TODO here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the issues you are talking about here in your second paragraph are behind some of the changes that are coming to the VCF spec in version 4.4. In particular, the SVCLAIM info field will allow you to specify whether your deletion and duplication calls are really claiming that there are breakpoints at the specified location, or if the claim is simply that the copy number of the segment on the reference differs from expectations based on read depth. My initial thought when reading this segment of code was that representing things in this style might allow us to preserve the calls we are throwing away here. What do you think of that approach?

Genome STRiP measures the copy number of repeated sequences like segmental duplications and represents it by adding a "duplicate segments" info field. So for example if there are two copies of a segmental duplication in the reference on the autosomes (so you expect a reference sample to be copy number 4), and the copy number is measured to be 5 but we don't know which side of the seg dup is duplicated, it writes a record at one locus with the duplicate segment info field pointing to the other segment and the sample's CN format field set to 5. In VCF 4.4 this sort of thing would still make sense using a depth SVCLAIM type representation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very realistic to adopt 4.4 in the output in the near future (or have a setting to select which spec to use), and it seems like it would be easy to set SVCLAIM at the end of the pipeline using the EVIDENCE field. Carrying it through certainly has its advantages, including resolving cases like the one you suggested above, but might be hard to implement in gatk-sv given the current scope of the pipeline.

@@ -401,8 +622,10 @@ protected Genotype collapseSampleGenotypes(final Collection<Genotype> genotypes,
.collect(Collectors.groupingBy(Map.Entry::getKey, Collectors.mapping(Map.Entry::getValue, Collectors.toSet())));
for (final Map.Entry<String, Set<Object>> entry : genotypeFields.entrySet()) {
if (entry.getKey().equals(GATKSVVCFConstants.COPY_NUMBER_FORMAT)) {
collapsedAttributes.put(GATKSVVCFConstants.COPY_NUMBER_FORMAT, collapseSampleCopyNumber(genotypes, expectedCopyNumber));
// Handled elsewhere
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this comment more explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to:

  // Copy number attribute set later using special logic in collapseSampleCopyNumber
  // Unit testing is easier with this design

I made an attempt to move collapseSampleCopyNumber() here, but it would have required some substantial changes to the unit tests. Then I also have to pull out CN in the calling function, which requires effectively casting an object to an Integer and thus introduces a bunch of extra code.

}

public List<SVCallRecord> flush() {
buffer.addAll(clusterEngine.getOutput());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty unusual object access pattern -- I think it would be more intuitive if the buffer class held a reference to the clusterEngine passed in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


public List<SVCallRecord> forceFlush() {
buffer.addAll(clusterEngine.forceFlushAndGetOutput());
final List<SVCallRecord> result = new ArrayList<>(buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't these records need to be sorted as in the flush() method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it should be

} else if (expectedCopyNumber == 0) {
return Collections.emptyList();
} else if (expectedCopyNumber == 1) {
// At this point know it's a DUP, and since it's haploid the phasing is unambiguous
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the right thing to do given how we are using the output VCF downstream, but it's unfortunate that the allele is named SV_SIMPLE_DUP which implies a single duplication of the segment. I'm not sure how best to make it less confusing, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO here noting that we need a generic multi-copy DUP allele for this

} else {
final String messageAlleles = String.join(", ",
siteAltAlleles.stream().map(Allele::getDisplayString).collect(Collectors.toList()));
throw new IllegalArgumentException("Unsupported CNV alt alleles: " + messageAlleles);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to throw in a unit test that hits this (unless you already did and I missed it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also added one for the exception below

}

@DataProvider(name = "collapseSampleGenotypesTestData")
public Object[][] collapseSampleGenotypesTestData() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are very nice. I worry a little that they are somewhat hard to find initially for someone doing a bug fix -- say you wanted to fix a problem in the method CanonicalSVCollapser.getCNVGenotypeAllelesFromCopyNumber() -- you'd have to know to come to SVCollapserTest.collapseSampleGenotypesTestData() to find your test case. This class only seems to be testing the CanonicalSVCollapser -- what about renaming it to CanonicalSVCollapserUnitTest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion (I think I forgot to change this after refactoring)

Copy link
Contributor Author

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cwhelan, I think I've addressed all comments.

newGenotypes.add(new GenotypeBuilder(g).alleles(genotypeAlleles).make());
}
builder.genotypes(newGenotypes);
if (record.getGenotypes().stream().anyMatch(g -> g.getAlleles().isEmpty())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, this was a premature optimization on my part. I was thinking it would be relatively rare to have empty alleles and this would make it faster by not having to make a copy of the genotypes. On second thought, I'm not sure this would have a huge effect, and could lead to some puzzling performance inconsistency for different vcfs (e.g. a chrY vcf would usually run slower since it's likely to have a lot of empty genotypes). I've changed it to just simply call sanitize on every genotype.

for (final String sample : missingSamples) {
final GenotypeBuilder genotypeBuilder = new GenotypeBuilder(sample);
if (attributes != null) {
genotypeBuilder.attributes(attributes);
final int ploidy = ploidyTable.get(sample, contig);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.sorted()
.collect(Collectors.toList());
// Alt alleles match already, or some alts vanished in genotype collapsing (and we keep them)
if (genotypeAltAlleles.equals(sortedAltAlleles) || sortedAltAlleles.containsAll(genotypeAltAlleles)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably another premature optimization - was trying to avoid containsAll which is N^2 for two lists. These lists will likely be short. Again favoring performance consistency rather than optimizing the more common case, I've changed genotypeAltAlleles to a Set and will only check containsAll()

&& sampleAltAlleles.get(0).equals(Allele.SV_SIMPLE_DUP)) {
return Collections.nCopies(expectedCopyNumber, Allele.NO_CALL);
// Ploidy 0
if (expectedCopyNumber == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point. On the one hand, I can see why it would be of interest and valid to preserve the calls there. On the other hand, it seems like a contradiction to have a duplication "on" a chromosome that doesn't exist. I think this highlights a weakness of the way we represent duplications.

For example, consider a seg dup present in 2 copies on the reference and a sample that matches reference (assume diploid for both). Then you technically have 4 copies of this particular sequence, but should that get called as a dup? Most would argue probably not, but in some sense it's not wrong. If you did decide to call it, would you call it at one or both? What if there were a duplication of one copy but you had insufficient information to determine which locus? Where do you put the call? Do you call it twice - once at each locus? Seems ambiguous.

I don't want to open that can of worms here. If we really wanted to, I think representing such events with an insertion might be practical. IMO, graph representations are the solution to these kinds of issues. I've added a TODO here.

} else {
final String messageAlleles = String.join(", ",
siteAltAlleles.stream().map(Allele::getDisplayString).collect(Collectors.toList()));
throw new IllegalArgumentException("Unsupported CNV alt alleles: " + messageAlleles);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also added one for the exception below

)
private double defragPaddingFraction = CNVLinkage.DEFAULT_PADDING_FRACTION;

@Argument(fullName = DEFRAG_SAMPLE_OVERLAP_LONG_NAME,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I struggled with this issue. I'd rather not overload the arguments from SVClusterEngineArgumentsCollection since I think it should have a different default value. One option is to create a separate tool for defragmentation, but that would require extra code. I've modified the docstring to be a little clearer:

Minimum sample overlap fraction. Use instead of --depth-sample-overlap in CNV defragmentation mode.

and also for padding:

Padding as a fraction of variant length for CNV defragmentation mode.

clusterParameterArgs.getDepthParameters(), clusterParameterArgs.getMixedParameters(),
clusterParameterArgs.getPESRParameters());
} else {
throw new UnsupportedOperationException("Unsupported algorithm: " + algorithm.name());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

public List<SVCallRecord> flush() {
buffer.addAll(clusterEngine.getOutput());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


public List<SVCallRecord> forceFlush() {
buffer.addAll(clusterEngine.forceFlushAndGetOutput());
final List<SVCallRecord> result = new ArrayList<>(buffer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it should be

}

@DataProvider(name = "collapseSampleGenotypesTestData")
public Object[][] collapseSampleGenotypesTestData() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion (I think I forgot to change this after refactoring)

Copy link
Member

@cwhelan cwhelan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, they look good. I left a couple of comments that relate to the VCF spec and its future. Feel free to respond to them and continue the conversation or just use them as food for thought for the future directions of the pipeline.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really just trivial changes to the gCNV bits, so those are fine as far as I'm concerned.

@mwalker174 mwalker174 merged commit 2c167ee into master Jan 26, 2022
@mwalker174 mwalker174 deleted the mw_svcluster branch January 26, 2022 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants