SVCluster tool #7541

mwalker174 · 2021-11-01T16:39:56Z

Implements tool for clustering SVs, built on top of the clustering engine code refined recently in #7243. In addition to a few bug fixes, updates also include:

PloidyTable class, which ingests and serves as a simple data class for a tsv of per-sample contig ploidies. This was necessary for inferring genotypes when input vcfs contain non-matching sample and variant records.
Modified SVClusterEngine to render sorted output.
Improved code for SV record collapsing (see the CanonicalSVCollapser), particularly for CNVs. Genotype collapsing now infers allele phasing in certain unambiguous cases, in particular for DUPs and multi-allelic CNVs. Testing for this has been cleaned up and augmented with further cases to validate this functionality.

cwhelan

@mwalker174 This is looking great, glad to see it so fleshed out.

cwhelan · 2021-11-22T15:15:00Z

src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java

-                newGenotypes.add(new GenotypeBuilder(g).alleles(genotypeAlleles).make());
-            }
-            builder.genotypes(newGenotypes);
+        if (record.getGenotypes().stream().anyMatch(g -> g.getAlleles().isEmpty())) {


It feels inefficient to stream over the genotypes twice (first to check if any are empty, then to sanitize), even if the anyMatch stream stops at the first match. Why not just always return the mapped stream below? Or do you think the overhead costs of mapping and re-collecting outweigh the benefits?

I think you're right, this was a premature optimization on my part. I was thinking it would be relatively rare to have empty alleles and this would make it faster by not having to make a copy of the genotypes. On second thought, I'm not sure this would have a huge effect, and could lead to some puzzling performance inconsistency for different vcfs (e.g. a chrY vcf would usually run slower since it's likely to have a lot of empty genotypes). I've changed it to just simply call sanitize on every genotype.

cwhelan · 2021-11-22T15:20:46Z

src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java

        for (final String sample : missingSamples) {
            final GenotypeBuilder genotypeBuilder = new GenotypeBuilder(sample);
-            if (attributes != null) {
-                genotypeBuilder.attributes(attributes);
+            final int ploidy = ploidyTable.get(sample, contig);


I'd add some documentation in the method javadoc about how this logic works (and add @param docs for refAlleleDefault and ploidyTable).

cwhelan · 2021-11-22T15:29:58Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

+                .sorted()
+                .collect(Collectors.toList());
+        // Alt alleles match already, or some alts vanished in genotype collapsing (and we keep them)
+        if (genotypeAltAlleles.equals(sortedAltAlleles) || sortedAltAlleles.containsAll(genotypeAltAlleles)) {


The first part of the || seems redundant with the second.

Probably another premature optimization - was trying to avoid containsAll which is N^2 for two lists. These lists will likely be short. Again favoring performance consistency rather than optimizing the more common case, I've changed genotypeAltAlleles to a Set and will only check containsAll()

cwhelan · 2021-11-23T16:41:08Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

-                && sampleAltAlleles.get(0).equals(Allele.SV_SIMPLE_DUP)) {
-            return Collections.nCopies(expectedCopyNumber, Allele.NO_CALL);
+        // Ploidy 0
+        if (expectedCopyNumber == 0) {


There are some places where you get eg coverage on Y in females because of segmental duplications between Y and other chromosomes. Do you think we always want to throw data from those places away? (maybe we do, just wanted to raise it as a discussion point).

This is a great point. On the one hand, I can see why it would be of interest and valid to preserve the calls there. On the other hand, it seems like a contradiction to have a duplication "on" a chromosome that doesn't exist. I think this highlights a weakness of the way we represent duplications.

For example, consider a seg dup present in 2 copies on the reference and a sample that matches reference (assume diploid for both). Then you technically have 4 copies of this particular sequence, but should that get called as a dup? Most would argue probably not, but in some sense it's not wrong. If you did decide to call it, would you call it at one or both? What if there were a duplication of one copy but you had insufficient information to determine which locus? Where do you put the call? Do you call it twice - once at each locus? Seems ambiguous.

I don't want to open that can of worms here. If we really wanted to, I think representing such events with an insertion might be practical. IMO, graph representations are the solution to these kinds of issues. I've added a TODO here.

Some of the issues you are talking about here in your second paragraph are behind some of the changes that are coming to the VCF spec in version 4.4. In particular, the SVCLAIM info field will allow you to specify whether your deletion and duplication calls are really claiming that there are breakpoints at the specified location, or if the claim is simply that the copy number of the segment on the reference differs from expectations based on read depth. My initial thought when reading this segment of code was that representing things in this style might allow us to preserve the calls we are throwing away here. What do you think of that approach?

Genome STRiP measures the copy number of repeated sequences like segmental duplications and represents it by adding a "duplicate segments" info field. So for example if there are two copies of a segmental duplication in the reference on the autosomes (so you expect a reference sample to be copy number 4), and the copy number is measured to be 5 but we don't know which side of the seg dup is duplicated, it writes a record at one locus with the duplicate segment info field pointing to the other segment and the sample's CN format field set to 5. In VCF 4.4 this sort of thing would still make sense using a depth SVCLAIM type representation.

It would be very realistic to adopt 4.4 in the output in the near future (or have a setting to select which spec to use), and it seems like it would be easy to set SVCLAIM at the end of the pipeline using the EVIDENCE field. Carrying it through certainly has its advantages, including resolving cases like the one you suggested above, but might be hard to implement in gatk-sv given the current scope of the pipeline.

cwhelan · 2021-11-23T16:59:28Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

@@ -401,8 +622,10 @@ protected Genotype collapseSampleGenotypes(final Collection<Genotype> genotypes,
                .collect(Collectors.groupingBy(Map.Entry::getKey, Collectors.mapping(Map.Entry::getValue, Collectors.toSet())));
        for (final Map.Entry<String, Set<Object>> entry : genotypeFields.entrySet()) {
            if (entry.getKey().equals(GATKSVVCFConstants.COPY_NUMBER_FORMAT)) {
-                collapsedAttributes.put(GATKSVVCFConstants.COPY_NUMBER_FORMAT, collapseSampleCopyNumber(genotypes, expectedCopyNumber));
+                // Handled elsewhere


Can you make this comment more explicit?

Changed to:

// Copy number attribute set later using special logic in collapseSampleCopyNumber // Unit testing is easier with this design

I made an attempt to move collapseSampleCopyNumber() here, but it would have required some substantial changes to the unit tests. Then I also have to pull out CN in the calling function, which requires effectively casting an object to an Integer and thus introduces a bunch of extra code.

cwhelan · 2021-11-23T21:04:49Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVCluster.java

+        }
+
+        public List<SVCallRecord> flush() {
+            buffer.addAll(clusterEngine.getOutput());


This is a pretty unusual object access pattern -- I think it would be more intuitive if the buffer class held a reference to the clusterEngine passed in the constructor.

cwhelan · 2021-11-23T21:05:35Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVCluster.java

+
+        public List<SVCallRecord> forceFlush() {
+            buffer.addAll(clusterEngine.forceFlushAndGetOutput());
+            final List<SVCallRecord> result = new ArrayList<>(buffer);


Why don't these records need to be sorted as in the flush() method?

Good catch, it should be

cwhelan · 2021-11-24T15:06:56Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

+        } else if (expectedCopyNumber == 0) {
+            return Collections.emptyList();
+        } else if (expectedCopyNumber == 1) {
+            // At this point know it's a DUP, and since it's haploid the phasing is unambiguous


This seems like the right thing to do given how we are using the output VCF downstream, but it's unfortunate that the allele is named SV_SIMPLE_DUP which implies a single duplication of the segment. I'm not sure how best to make it less confusing, though.

Added a TODO here noting that we need a generic multi-copy DUP allele for this

cwhelan · 2021-11-24T15:10:08Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

+            } else {
+                final String messageAlleles = String.join(", ",
+                        siteAltAlleles.stream().map(Allele::getDisplayString).collect(Collectors.toList()));
+                throw new IllegalArgumentException("Unsupported CNV alt alleles: " + messageAlleles);


It might be good to throw in a unit test that hits this (unless you already did and I missed it).

Done, also added one for the exception below

cwhelan · 2021-11-24T15:21:18Z

src/test/java/org/broadinstitute/hellbender/tools/sv/cluster/SVCollapserTest.java

+    }
+
+    @DataProvider(name = "collapseSampleGenotypesTestData")
+    public Object[][] collapseSampleGenotypesTestData() {


These tests are very nice. I worry a little that they are somewhat hard to find initially for someone doing a bug fix -- say you wanted to fix a problem in the method CanonicalSVCollapser.getCNVGenotypeAllelesFromCopyNumber() -- you'd have to know to come to SVCollapserTest.collapseSampleGenotypesTestData() to find your test case. This class only seems to be testing the CanonicalSVCollapser -- what about renaming it to CanonicalSVCollapserUnitTest?

Good suggestion (I think I forgot to change this after refactoring)

mwalker174

Thanks @cwhelan, I think I've addressed all comments.

mwalker174 · 2021-12-04T16:14:20Z

src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java

-                newGenotypes.add(new GenotypeBuilder(g).alleles(genotypeAlleles).make());
-            }
-            builder.genotypes(newGenotypes);
+        if (record.getGenotypes().stream().anyMatch(g -> g.getAlleles().isEmpty())) {


I think you're right, this was a premature optimization on my part. I was thinking it would be relatively rare to have empty alleles and this would make it faster by not having to make a copy of the genotypes. On second thought, I'm not sure this would have a huge effect, and could lead to some puzzling performance inconsistency for different vcfs (e.g. a chrY vcf would usually run slower since it's likely to have a lot of empty genotypes). I've changed it to just simply call sanitize on every genotype.

mwalker174 · 2021-12-04T16:22:42Z

src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java

        for (final String sample : missingSamples) {
            final GenotypeBuilder genotypeBuilder = new GenotypeBuilder(sample);
-            if (attributes != null) {
-                genotypeBuilder.attributes(attributes);
+            final int ploidy = ploidyTable.get(sample, contig);


mwalker174 · 2021-12-04T16:30:36Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

+                .sorted()
+                .collect(Collectors.toList());
+        // Alt alleles match already, or some alts vanished in genotype collapsing (and we keep them)
+        if (genotypeAltAlleles.equals(sortedAltAlleles) || sortedAltAlleles.containsAll(genotypeAltAlleles)) {


Probably another premature optimization - was trying to avoid containsAll which is N^2 for two lists. These lists will likely be short. Again favoring performance consistency rather than optimizing the more common case, I've changed genotypeAltAlleles to a Set and will only check containsAll()

mwalker174 · 2021-12-04T17:01:37Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

-                && sampleAltAlleles.get(0).equals(Allele.SV_SIMPLE_DUP)) {
-            return Collections.nCopies(expectedCopyNumber, Allele.NO_CALL);
+        // Ploidy 0
+        if (expectedCopyNumber == 0) {


This is a great point. On the one hand, I can see why it would be of interest and valid to preserve the calls there. On the other hand, it seems like a contradiction to have a duplication "on" a chromosome that doesn't exist. I think this highlights a weakness of the way we represent duplications.

For example, consider a seg dup present in 2 copies on the reference and a sample that matches reference (assume diploid for both). Then you technically have 4 copies of this particular sequence, but should that get called as a dup? Most would argue probably not, but in some sense it's not wrong. If you did decide to call it, would you call it at one or both? What if there were a duplication of one copy but you had insufficient information to determine which locus? Where do you put the call? Do you call it twice - once at each locus? Seems ambiguous.

I don't want to open that can of worms here. If we really wanted to, I think representing such events with an insertion might be practical. IMO, graph representations are the solution to these kinds of issues. I've added a TODO here.

mwalker174 · 2022-01-07T17:20:21Z

src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java

+            } else {
+                final String messageAlleles = String.join(", ",
+                        siteAltAlleles.stream().map(Allele::getDisplayString).collect(Collectors.toList()));
+                throw new IllegalArgumentException("Unsupported CNV alt alleles: " + messageAlleles);


Done, also added one for the exception below

mwalker174 · 2022-01-10T18:50:57Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVCluster.java

+    )
+    private double defragPaddingFraction = CNVLinkage.DEFAULT_PADDING_FRACTION;
+
+    @Argument(fullName = DEFRAG_SAMPLE_OVERLAP_LONG_NAME,


Yes I struggled with this issue. I'd rather not overload the arguments from SVClusterEngineArgumentsCollection since I think it should have a different default value. One option is to create a separate tool for defragmentation, but that would require extra code. I've modified the docstring to be a little clearer:

Minimum sample overlap fraction. Use instead of --depth-sample-overlap in CNV defragmentation mode.

and also for padding:

Padding as a fraction of variant length for CNV defragmentation mode.

mwalker174 · 2022-01-10T18:51:41Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVCluster.java

+                    clusterParameterArgs.getDepthParameters(), clusterParameterArgs.getMixedParameters(),
+                    clusterParameterArgs.getPESRParameters());
+        } else {
+            throw new UnsupportedOperationException("Unsupported algorithm: " + algorithm.name());


mwalker174 · 2022-01-10T18:54:27Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVCluster.java

+        }
+
+        public List<SVCallRecord> flush() {
+            buffer.addAll(clusterEngine.getOutput());


mwalker174 · 2022-01-10T19:06:16Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/SVCluster.java

+
+        public List<SVCallRecord> forceFlush() {
+            buffer.addAll(clusterEngine.forceFlushAndGetOutput());
+            final List<SVCallRecord> result = new ArrayList<>(buffer);


Good catch, it should be

mwalker174 · 2022-01-10T19:08:21Z

src/test/java/org/broadinstitute/hellbender/tools/sv/cluster/SVCollapserTest.java

+    }
+
+    @DataProvider(name = "collapseSampleGenotypesTestData")
+    public Object[][] collapseSampleGenotypesTestData() {


Good suggestion (I think I forgot to change this after refactoring)

cwhelan

Thanks for the changes, they look good. I left a couple of comments that relate to the VCF spec and its future. Feel free to respond to them and continue the conversation or just use them as food for thought for the future directions of the pipeline.

ldgauthier

Really just trivial changes to the gCNV bits, so those are fine as far as I'm concerned.

Implement SVCluster tool; improve clustering backend and tests

27d6afa

mwalker174 force-pushed the mw_svcluster branch from 259cf88 to 27d6afa Compare November 1, 2021 16:41

mwalker174 requested review from ldgauthier and cwhelan November 1, 2021 16:43

mwalker174 added 4 commits November 7, 2021 09:35

Fix compiler warning

2c3631d

Fix some tests and interval collapsing

9224741

Fix integration test resources dir name

500c238

Fix resource path in PloidyTableTest

1cdb219

broadinstitute deleted a comment from gatk-bot Nov 15, 2021

cwhelan requested changes Nov 24, 2021

View reviewed changes

Address comments

55a7e20

mwalker174 commented Jan 10, 2022

View reviewed changes

Fix compiler warning

fe7f7ba

cwhelan approved these changes Jan 21, 2022

View reviewed changes

Add SV type check to CanonicalSVCollapser

0f3004d

ldgauthier reviewed Jan 26, 2022

View reviewed changes

mwalker174 merged commit 2c167ee into master Jan 26, 2022

mwalker174 deleted the mw_svcluster branch January 26, 2022 15:43

SVCluster tool #7541

SVCluster tool #7541

Conversation

mwalker174 commented Nov 1, 2021 • edited Loading

cwhelan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwalker174 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cwhelan left a comment

Choose a reason for hiding this comment

ldgauthier left a comment

Choose a reason for hiding this comment

mwalker174 commented Nov 1, 2021 •

edited

Loading