-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update SV clustering classes #7243
Conversation
src/main/java/org/broadinstitute/hellbender/tools/spark/sv/utils/GATKSVVCFConstants.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find the code to do the defragmentation in bin-space that I need for exomes.
I also had a question regarding what the spec says about determining the ref allele.
Other than that, lots of little clarifications and suggestions you can take or leave. I didn't go over the tests with a fine toothed comb, but I can chip in if Chris doesn't have the time for that.
src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtils.java
Outdated
Show resolved
Hide resolved
protected Object collapseSingleVariantAttribute(final Set<Object> values) { | ||
if (values.size() == 1) { | ||
return values.iterator().next(); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to add up DP or anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/SVCollapser.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CNVDefragmenter.java
Outdated
Show resolved
Hide resolved
|
||
/** | ||
* CNV defragmenter for when the intervals used for coverage collection are available. The intervals are used to adjust | ||
* variant padding by rounding to the nearest bin boundary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something missing from getPaddedRecordInterval()
below?
call2End - call2Start + 1, Collections.singletonList(GATKSVVCFConstants.DEPTH_ALGORITHM), | ||
Lists.newArrayList(Allele.REF_N, Allele.SV_SIMPLE_DEL), | ||
Collections.emptyList(), Collections.emptyMap()); | ||
Assert.assertTrue(defragmenter.clusterTogether(call1, call2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice improvement over the old checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great stuff! Looks really good. The tests are great, too. Feel free to take some of my comments as suggestions rather than requirements, or correct me if I'm misunderstood something as I'm commenting on it.
|
||
// evidence metrics | ||
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP"; | ||
public static final String NEUTRAL_COPY_NUMBER_KEY = "NCN"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this one for? Is it to tag copy number neutral events? Is it meant to hold the reference ploidy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for ploidy - future PR.
public static final String CLUSTER_MEMBER_IDS_KEY = "MEMBERS"; | ||
|
||
// evidence metrics | ||
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new constants from this point on don't seem to be used anywhere in the GATK code base that I can see. Am I missing some way in which they are used? If not, do we want to add them now or hold off until they are used for something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to cherry-pick all the changes for this file, but these are actually for a later PR. I'll remove them for now.
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP"; | ||
public static final String NEUTRAL_COPY_NUMBER_KEY = "NCN"; | ||
public static final String DEPTH_OVERLAP_KEY = "RDOV"; | ||
public static final String DEPTH_P_HARDY_WEINBERG_LOSS_FIELD = "PHW_L"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure what some of these are. If you want to keep them, could you add some quick comments here or somewhere else to describe what they are?
@Override | ||
public int getStart() { | ||
return start; | ||
public boolean isCNV() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the CPX types in our post 0506 VCFs are or contain CNVs. Maybe change this to something like isSimpleCNV
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* @param attributes attributes to apply to all new genotypes | ||
* @return genotypes augmented with missing samples | ||
*/ | ||
public static GenotypesContext fillMissingSamplesWithGenotypes(final GenotypesContext genotypes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, but I might change the name of this to populateGenotypesForMissingSamplesWithAlleles
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} else if (newType.equals(StructuralVariantType.INS)) { | ||
// Median | ||
final int[] lengths = items.stream().mapToInt(SVCallRecord::getLength).toArray(); | ||
final int midIndex = lengths.length / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use MathUtils.median()
? You might have to update it or create another flavor of the method that would let you use the averaging EstimationType
, but that might be more generally useful for GATK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the implementation but I suppose I should use the existing utility. If it ends up being slow we can propose a change there.
} | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A high-level description would be useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
final List<Integer> startPositions = items.stream().map(SVCallRecord::getPositionA).sorted().collect(Collectors.toList()); | ||
final List<Integer> endPositions = items.stream().map(SVCallRecord::getPositionB).sorted().collect(Collectors.toList()); | ||
//use the mid value of the sorted list so the start and end represent real breakpoint observations | ||
final int medianStart = startPositions.get(startPositions.size() / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use MathUtils.median
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also replaced mean calculations below with (int)Math.round(MathUtils.sum(startPositions) / (double) startPositions.length);
since it's now an array type.
* @param items | ||
* @return (key, value) entry of (start, end) | ||
*/ | ||
protected Map.Entry<Integer,Integer> collapseInterval(final Collection<SVCallRecord> items) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be clearer to use a org.apache.commons.lang3.tuple.Pair
rather than a Map.Entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes that is much better!
{SVTestUtils.newCallRecordWithAlgorithms(Collections.emptyList()), false}, | ||
{SVTestUtils.newCallRecordWithAlgorithms(Collections.singletonList(GATKSVVCFConstants.DEPTH_ALGORITHM)), true}, | ||
{SVTestUtils.newCallRecordWithAlgorithms(Collections.singletonList("pesr")), false}, | ||
{SVTestUtils.newCallRecordWithAlgorithms(Lists.newArrayList(GATKSVVCFConstants.DEPTH_ALGORITHM, "pesr")), false} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MIght as well add "pesr" to GATKSVVCFConstants
too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use the caller name eg "manta", "melt", etc. for pesr and should allow for any caller name (that is not "depth"), whereas "depth" has a special meaning. Here "pesr" is just for testing.
75dd534
to
fa907fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ldgauthier and @cwhelan for the reviews! I believe I've responded to all of them. There's been some major refactoring of the clustering classes as suggested by @cwhelan to make the clustering engine more flexible. I've also made some changes to the SVCallRecord
class - in particular the ability to have null length/strand values.
public static final String CLUSTER_MEMBER_IDS_KEY = "MEMBERS"; | ||
|
||
// evidence metrics | ||
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to cherry-pick all the changes for this file, but these are actually for a later PR. I'll remove them for now.
|
||
// evidence metrics | ||
public static final String COPY_NUMBER_LOG_POSTERIORS_KEY = "CNLP"; | ||
public static final String NEUTRAL_COPY_NUMBER_KEY = "NCN"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for ploidy - future PR.
|
||
/** | ||
* CNV defragmenter for when the intervals used for coverage collection are available. The intervals are used to adjust | ||
* variant padding by rounding to the nearest bin boundary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something missing from getPaddedRecordInterval()
below?
*/ | ||
public class BinnedCNVDefragmenter extends CNVDefragmenter { | ||
|
||
protected final TreeMap<GenomeLoc, Integer> genomicToBinMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider using
SVIntervalTree
? It could be faster because it's contig-aware and was written by @tedsharpe . The downside is converting fromGenomeLoc
toSVInterval
which involves converting contig IDs to ints.
I'm hesitant to start mixing in SVIntervals
mainly because of the confusion it can create with differences in coordinates.
Also I might make the name of this map a little more explicit, ie
genomicCoordinatesToBinIndexMap
Done
src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecord.java
Outdated
Show resolved
Hide resolved
final int positionB, | ||
final boolean strandB, | ||
final StructuralVariantType type, | ||
final int length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made length and the strands into Integer
and Boolean
types, respectively, so that they can take on null values when appropriate (e.g. length for BNDs or strands for CNVs). This required quite a bit of refactoring and fixing of tests but I seem to have it working. There may be some performance hit from boxing/unboxing these values everywhere but I do agree that passing around -1 everywhere (and having non-sensical strand values) is problematic in some ways.
It is a little tricky to have constructors that allow for length/strand parameters that are potentially inconsistent, so I have checks to ensure the values are consistent and also sanitize attributes that could contain redundant information.
@Override | ||
public int getStart() { | ||
return start; | ||
public boolean isCNV() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
Utils.validateArg(record.isIntrachromosomal(), "Inversion is not intrachromosomal"); | ||
final SVCallRecord positiveBreakend = new SVCallRecord(record.getId(), record.getContigA(), | ||
record.getPositionA(), true, record.getContigB(), record.getPositionB(), true, StructuralVariantType.BND, -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this - see comment in SVCallRecord.java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now; sorry for the delay in re-review. I only have a couple of very minor stylistic comments; feel free to merge after deciding whether or not to make changes based on them.
return newAttributes; | ||
} | ||
|
||
private static Integer getLength(final VariantContext variant, final StructuralVariantType type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little confusing to have getLength
and getStrands
return null if the values are inferrable from the SV type -- in my view that'd be surprising behavior if I was reading through the SVCallRecord.create()
method. I'd recommend putting the SVCallRecord can resolve these
clauses of these if statements back into SVCallRecord.create()
and having these methods just have the else clauses -- parsing the attribute values, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (i < tokens.length && subtype.equals(tokens[i])) { | ||
alleleSize = i + 1; | ||
} else { | ||
break outerloop; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Labeled breaks always look a little weird to me (probably because I've worked on other codebases where they were explicitly against the style guide) and we don't have many of them in the GATK codebase. You could replace this with return Arrays.copyOf(firstTokens, alleleSize);
since there's nothing else going on after the method, or refactor the loop labeled outerloop
into a calculateAlleleSize
method and just return from from it. But I guess they're not against current style guidelines and my aversion to them might be just a personal quirk, so feel free to keep this if you like it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, just replaced with return Arrays.copyOf(firstTokens, alleleSize);
fa907fd
to
a65420d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have to go over the tests, but here are the code comments. Mostly requests to rename, but there was one concern I had about dupes with no-call GTs.
src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecord.java
Outdated
Show resolved
Hide resolved
final int copyNumber = VariantContextGetters.getAttributeAsInt(genotype, COPY_NUMBER_FORMAT, 0); | ||
if (altAllele.equals(Allele.SV_SIMPLE_DEL)) { | ||
return copyNumber < ploidy; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that there are some nice checks in the calling method, but can you put javadoc to be explicit about the parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is relevant, but I clearly documented the new isCarrier()
method.
for (final Genotype g : variant.getGenotypes()) { | ||
if (g.isHomRef() || (g.isNoCall() && !g.hasExtendedAttribute(GATKSVVCFConstants.COPY_NUMBER_FORMAT))) { | ||
continue; | ||
private static Pair<Boolean, Boolean> inferStrands(final StructuralVariantType type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let Chris double check this logic.
src/main/java/org/broadinstitute/hellbender/tools/sv/SVCallRecord.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/SVClusterEngineFactory.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/cluster/CanonicalSVCollapser.java
Show resolved
Hide resolved
Assert.assertFalse(defragmenter.areClusterable(deletionRAR, deletionAAR)); | ||
Assert.assertFalse(defragmenter.areClusterable(deletionRAR, deletionAAA)); | ||
|
||
Assert.assertTrue(defragmenter.areClusterable(deletionARR1, deletionARR1Copy)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh good call. I have a branch where I ran JointGermlineCNVSegmentation on two multi-sample input VCFs, but never merged -- is that possible with this code now? If not then maybe I want to rebase before you merge...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be missing a few test cases, but otherwise just tidying up that you can choose to do or not.
src/test/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtilsUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtilsUnitTest.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtilsUnitTest.java
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtilsUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/sv/SVCallRecordUtilsUnitTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/broadinstitute/hellbender/tools/sv/cluster/SVCollapserTest.java
Outdated
Show resolved
Hide resolved
Collections.singletonList(Allele.REF_N) | ||
), | ||
Collections.singletonList(Allele.SV_SIMPLE_DEL), | ||
Lists.newArrayList(Allele.REF_N, Allele.SV_SIMPLE_DEL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment about preferring higher ploidy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Lists.newArrayList(Allele.SV_SIMPLE_DEL, Allele.SV_SIMPLE_DUP), | ||
Lists.newArrayList(Allele.SV_SIMPLE_DUP, Allele.SV_SIMPLE_DEL) | ||
}, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The javadoc for the method says "potentially augmented with additional ref alleles to match the ploidy" -- is there a test for that? Like DEL, REF/REF ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, added a test for that and additional CNV cases such as haploid DUPs where we can infer GT exactly.
{ | ||
new int[]{-1}, | ||
new String[]{"chr2"}, | ||
new StructuralVariantType[]{StructuralVariantType.BND}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add another BND here or another case with multiple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
public Object[][] collapseAlgorithmsTestData() { | ||
return new Object[][]{ | ||
{Collections.singletonList(Collections.singletonList(GATKSVVCFConstants.DEPTH_ALGORITHM)), Collections.singletonList(GATKSVVCFConstants.DEPTH_ALGORITHM)}, | ||
{Collections.singletonList(Collections.singletonList("pesr")), Collections.singletonList("pesr")}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where events can be supported by depth and pesr evidence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added those cases now
Back to you @ldgauthier |
This must be the most discussed tool in the GATK. I am relieved to close the book on this PR. Very nice work @mwalker174 -- thanks for all the refactoring! |
Reworks classes used by
JointGermlineCNVSegmentationIntegration
for SV clustering and defragmentation. The design ofSVClusterEngine
has been overhauled to enable the implementation ofCNVDefragmenter
andBinnedCNVDefragmenter
subclasses. Logic for producing representative records from a collection of clustered SVs has been separated into anSVCollapser
class, which provides enhanced functionality for handling genotypes for SVs more generally.A number of bugs, particularly with max-clique clustering, have been fixed, as well as a parameter swap bug in
JointGermlineCNVSegmentationIntegration
.This is the first of a series of PRs for an experimental Java-based implementation of some modules in
gatk-sv
pipeline, including SV vcf merging, clustering, evidence aggregation, and genotyping.