Fixed ref conf calculation near indels #5172

ldgauthier · 2018-09-10T13:57:56Z

Previous behavior generated some PL=0,0,0 no-calls because CIGAR of reads containing indels wasn't taken into account when determining which reads were informative for the indel ref conf model. The local realignment wasn't being used inside the active region previously either, which has been fixed. A related change considers bases on either side of indels informative if local assembly has been performed (but not during active region detection). Both result in far fewer 0,0,0 calls. Unfortunately there are still some 0,0,X homRef calls related to #5171.

vruano

Just some code simplification ideas.

vruano · 2018-09-11T17:37:43Z

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModel.java

@@ -547,7 +558,7 @@ int sumMismatchingQualities(final byte[] readBases,
        for ( int i = 0; i < n; i++ ) {
            final byte readBase = readBases[readStart + i];
            final byte refBase  = refBases[refStart + i];
-            if ( readBase != refBase ) {
+            if ( readBase != refBase && readBase != 'N') {


You can use !Nucleotide.intersect(readBase, refBase) instead.

These are not totally equivalent but I guess it should work as intended (e.g. the Nucleotide. alternative would handle other ambiguity codes such as P R, W etc.)

vruano · 2018-09-11T17:57:36Z

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModel.java

+     * @return the new reference-aligned index/offset
+     */
+    @VisibleForTesting
+    protected int getCigarModifiedOffset (final PileupElement p){


It seems that this method intent boils down to calculate the number of "consumed reference bases" (+ softclips?) by the read alignment up to the location of the pileup in the read... CigarOperator have a consumesReferenceBases method that one can use to figure out this number without the need to make explicit reference to the operation types such INSERTION or DELETION or any other.

In other words I think there is a more straight forward solution that would count the consumed reference bases rather than "adjust" the consumed read bases to get to the same number....

perhaps would look something like this:

int offset = p.getCurrentCigarElement().getOperator().consumesReferenceBases() ? p.getIndexInCurrentElement() : 0; for (CigarElement elem : read.getCigar().getCigarElements().subList(0, p.getCurrentCigarOffset())) { if (elem.getOperator().consumesReferenceBases() || elem.getOperator() == CigarOperator.S) { offset += elem.getLength(); } }

vruano · 2018-09-11T18:20:01Z

src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java

@@ -196,6 +196,56 @@ public static GATKRead createReadAlignedToRef(final GATKRead originalRead,
        return Arrays.copyOfRange(bases, basesStart, basesStop + 1);
    }

+    public static byte[] getBasesAlignedOneToOne(final GATKRead read) {
+        return getSequenceAlignedOneToOne(read, true, (byte)'N');


Perhaps use Nucleotide.N.encodeAsByte().

vruano · 2018-09-11T18:24:49Z

src/main/java/org/broadinstitute/hellbender/utils/read/CigarUtils.java

@@ -125,6 +125,37 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar
        return result;
    }

+    public static int countRefBasesBasedOnCigarIgnoringHardClips(final GATKRead read, final int cigarStartIndex, final int cigarEndIndex){


Since we are in CigarUtils the "BasedOnCigar" part of the name is kind of redundant.

Looking at the code what I find peculiar is that "S" soft-clips are actually counted rather than "H" hard-clips are ignored. Both operations don't consume reference bases so I think that ignoring H is actually a natural assumption whereas counting S is not.

Should this be called countReferenceAndSoftClippedBases instead?

vruano · 2018-09-11T18:30:24Z

src/main/java/org/broadinstitute/hellbender/utils/read/CigarUtils.java

@@ -125,6 +125,37 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar
        return result;
    }

+    public static int countRefBasesBasedOnCigarIgnoringHardClips(final GATKRead read, final int cigarStartIndex, final int cigarEndIndex){
+        if (read == null){
+            throw new IllegalArgumentException("null read");


You can be more succinct using Utils.nonNull(read, "null read") from 3 to only 1 line (will automatically improve test coverage as I presume you are not checking on the null read behavior (most people wouldn't bother).

vruano · 2018-09-11T18:35:37Z

src/main/java/org/broadinstitute/hellbender/utils/read/CigarUtils.java

+        int result = 0;
+        for(int i = cigarStartIndex; i < cigarEndIndex; i++){
+            final CigarElement cigarElement = elems.get(i);
+            switch (cigarElement.getOperator()) {


This switch could be rewritten as:

final CigarOperator operator = cigarElement.getOperator(); if (operator.consumesReferenceBases() || operator == CigarOperator.S) { result += cigarElement.getLength(); }

vruano · 2018-09-11T19:02:02Z

src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java

+        return getSequenceAlignedOneToOne(read, false, (byte)0);
+    }
+
+    public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final boolean forBases, final byte padWith) {


Using a boolean param to indicate whether this is an operation on quality scores or read bases smells a bit. What about provide a lambda to obtain the original-not-copied byte array given the read.

public static byte[] getSequenceAlignedOneToOne(final GATKRead read, Function<GATKRead, byte[]> bytesProvider, int padWith) { ... }

vruano · 2018-09-11T19:07:46Z

src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java

+            int paddedPos = 0;
+            for ( int i = 0; i < cigar.numCigarElements(); i++ ) {
+                final CigarElement ce = cigar.getCigarElement(i);
+                switch ( ce.getOperator() ) {


This switch can be rewriten with a couple of ifs and using the consumesReadBases and consumesReferenceBases without the need to list operators explicitly.

vruano · 2018-09-11T20:02:59Z

Back to @ldgauthier.

vruano · 2018-09-12T18:03:57Z

Looking back into this PR... at some point you are using 'N' to pad what seem to be gaps on the read sequence. Although the end result would be the same perhaps is better to be more explicit and just use '-' instead. In that case my suggestion of using Nucleotide.intersect wouldn't cover for the '-' character so you need a explicity "&&" or "II".

When you compare the cost of each different alignment the gap-open and gap-ext are ignored (you only look at base call mismatches). I wonder whether it would be more correct to actually take them in consideration... so imagine that there is no gaps in the original alignment what-soever and that adding a 1bp gap decreases the number of mis-matches by just 1 which is typically Q30 increase in the Lk but the gap itself default penalty is Q45 so can one say that that read wouldn't still support the reference over a 1-bp gap alternative?

Example with a 2-bp gap making the trick:

Ref:  ....GCATGTGATATATATATATATATATATATACACACAC....
Read: ....GCATGTG--ATATATATATATATATATATAC <end-of-the-read>

That could happen in STRs with impurities... but if the original alignment did not added itself the gap to reduce the number of mismatches is because precisely due to the added cost of the gap-open and necessary extends that we would be ignoring here.

This is all hypothetical until some one quantifies how often this might occur ... so I'm happy to keep ignoring the gaps for now until we get a report on a real-live dataset that would benefit of such a change or some enthusiastic dsde-methods member investigates this.

ldgauthier · 2018-09-14T13:49:37Z

@vruano I addressed your comments. The code is a lot cleaner. Let me know if there's anything else. The Travis failures look unrelated so I'm rerunning them.

vruano · 2018-09-14T19:09:18Z

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModel.java

@@ -555,7 +555,8 @@ int sumMismatchingQualities(final byte[] readBases,
        for ( int i = 0; i < n; i++ ) {
            final byte readBase = readBases[readStart + i];
            final byte refBase  = refBases[refStart + i];
-            if ( !Nucleotide.decode(readBase).intersect(Nucleotide.decode(refBase)).isValid()) {  //allow N read bases (and other ambiguous bases) to "match" reference
+            if ( !Nucleotide.decode(readBase).intersect(Nucleotide.decode(refBase)).isValid() &&


Ah I see, I was assume that master had a change that I have on Nucleotides adding an new intersect(byte, byte) method:

/** * Checks whether two nucleotides intersect given their byte encodings. * @param a first nucleotide. * @param b second nucleotide. * @return {@code true} iff the input nucleotides intersect. */ public static boolean intersect(final byte a, final byte b) { return (baseToValue[0xFF & a].mask & baseToValue[0xFF & b].mask) != 0; }

Just add it in your branch's Nucleotides.java and the the if code would be:

if (!Nucleotide.intersect(readBase, refBase) && readBase != AlignmentUtils.PLACEHOLDER) { .... }

src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java

vruano · 2018-09-14T19:16:40Z

src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java

    }

-    public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final boolean forBases, final byte padWith) {
+    public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) {
        final Cigar cigar = read.getCigar();


Do null checking on read and baseProvider.

vruano · 2018-09-14T19:21:30Z

...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/ReferenceConfidenceModel.java

-            if (cigar.getCigarElement(i).getOperator().equals(CigarOperator.INSERTION)) {
-                offset -= cigar.getCigarElement(i).getLength();
+        int offset = p.getCurrentCigarElement().getOperator().consumesReferenceBases() ? p.getOffsetInCurrentCigar() : 0;
+        for (CigarElement elem : read.getCigar().getCigarElements().subList(0, p.getCurrentCigarOffset())) {


Sorry, I didn't pay close attention that my suggested code was totally correct... the int offset = condition ... do you need to add || elem.getOperator() == CigarOperator.S)?

Make elem final.

vruano · 2018-09-14T19:39:59Z

src/main/java/org/broadinstitute/hellbender/utils/read/CigarUtils.java

@@ -106,7 +106,7 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar
        for(int i = cigarStartIndex; i < cigarEndIndex; i++){
            final CigarElement cigarElement = elems.get(i);
            final CigarOperator operator = cigarElement.getOperator();
-            if (operator.consumesReferenceBases() || operator == CigarOperator.S) {
+            if (operator.consumesReferenceBases() || operator.isClipping()) {
                result += cigarElement.getLength();


I see that given the original code, this is the wright translation so no changes needed... but the documentation is rather unclear about what this method does. Perhaps a better name for it would be countReferenceBasesOnUnclippedAlignment.

My point here is that now we have a some methods that count soft and hard-clips in this kind of operations whereas some other methods only du soft... and I wonder whether there is a need for consolidate these (e.g. perhaps some changes in so calling code would make such a difference unnecessary).

In any case feel free to leave it for a future PR.

vruano · 2018-09-14T19:46:52Z

@ldgauthier I asked for further changes in my latest review (can you see it). By accident I approved rather than ask for changes so please take a look before merging. It should not take long to address them. Then you can merge.

Consider bases on either side of indels informative if local assembly has been performed

ldgauthier requested a review from vruano September 10, 2018 13:57

droazen assigned vruano Sep 10, 2018

vruano suggested changes Sep 11, 2018

View reviewed changes

vruano assigned ldgauthier and unassigned vruano Sep 11, 2018

ldgauthier assigned vruano and unassigned ldgauthier Sep 14, 2018

vruano suggested changes Sep 14, 2018

View reviewed changes

vruano approved these changes Sep 14, 2018

View reviewed changes

vruano assigned ldgauthier and unassigned vruano Sep 14, 2018

Fixed ref conf calculation upstream of indel

3a88104

Consider bases on either side of indels informative if local assembly has been performed

ldgauthier force-pushed the ldg_indelRefConfPatch branch from 736d552 to 3a88104 Compare September 14, 2018 20:26

ldgauthier merged commit 21f16d5 into master Sep 17, 2018

ldgauthier deleted the ldg_indelRefConfPatch branch September 17, 2018 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed ref conf calculation near indels #5172

Fixed ref conf calculation near indels #5172

ldgauthier commented Sep 10, 2018

vruano left a comment

vruano Sep 11, 2018

vruano Sep 11, 2018

vruano Sep 11, 2018 •

edited

Loading

vruano Sep 11, 2018

vruano Sep 11, 2018 •

edited

Loading

vruano Sep 11, 2018

vruano Sep 11, 2018

vruano Sep 11, 2018

vruano Sep 11, 2018

vruano Sep 11, 2018 •

edited

Loading

vruano commented Sep 11, 2018

vruano commented Sep 12, 2018

ldgauthier commented Sep 14, 2018

vruano Sep 14, 2018

vruano Sep 14, 2018

vruano Sep 14, 2018

vruano Sep 14, 2018

vruano Sep 14, 2018

vruano commented Sep 14, 2018

Fixed ref conf calculation near indels #5172

Fixed ref conf calculation near indels #5172

Conversation

ldgauthier commented Sep 10, 2018

vruano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vruano Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vruano Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vruano Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

vruano commented Sep 11, 2018

vruano commented Sep 12, 2018

ldgauthier commented Sep 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vruano commented Sep 14, 2018

vruano Sep 11, 2018 •

edited

Loading

vruano Sep 11, 2018 •

edited

Loading

vruano Sep 11, 2018 •

edited

Loading