-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed ref conf calculation near indels #5172
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some code simplification ideas.
@@ -547,7 +558,7 @@ int sumMismatchingQualities(final byte[] readBases, | |||
for ( int i = 0; i < n; i++ ) { | |||
final byte readBase = readBases[readStart + i]; | |||
final byte refBase = refBases[refStart + i]; | |||
if ( readBase != refBase ) { | |||
if ( readBase != refBase && readBase != 'N') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use !Nucleotide.intersect(readBase, refBase)
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not totally equivalent but I guess it should work as intended (e.g. the Nucleotide. alternative would handle other ambiguity codes such as P R, W etc.)
* @return the new reference-aligned index/offset | ||
*/ | ||
@VisibleForTesting | ||
protected int getCigarModifiedOffset (final PileupElement p){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this method intent boils down to calculate the number of "consumed reference bases" (+ softclips?) by the read alignment up to the location of the pileup in the read... CigarOperator
have a consumesReferenceBases
method that one can use to figure out this number without the need to make explicit reference to the operation types such INSERTION
or DELETION
or any other.
In other words I think there is a more straight forward solution that would count the consumed reference bases rather than "adjust" the consumed read bases to get to the same number....
perhaps would look something like this:
int offset = p.getCurrentCigarElement().getOperator().consumesReferenceBases() ? p.getIndexInCurrentElement() : 0;
for (CigarElement elem : read.getCigar().getCigarElements().subList(0, p.getCurrentCigarOffset())) {
if (elem.getOperator().consumesReferenceBases() || elem.getOperator() == CigarOperator.S) {
offset += elem.getLength();
}
}
@@ -196,6 +196,56 @@ public static GATKRead createReadAlignedToRef(final GATKRead originalRead, | |||
return Arrays.copyOfRange(bases, basesStart, basesStop + 1); | |||
} | |||
|
|||
public static byte[] getBasesAlignedOneToOne(final GATKRead read) { | |||
return getSequenceAlignedOneToOne(read, true, (byte)'N'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use Nucleotide.N.encodeAsByte()
.
@@ -125,6 +125,37 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar | |||
return result; | |||
} | |||
|
|||
public static int countRefBasesBasedOnCigarIgnoringHardClips(final GATKRead read, final int cigarStartIndex, final int cigarEndIndex){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are in CigarUtils
the "BasedOnCigar" part of the name is kind of redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code what I find peculiar is that "S" soft-clips are actually counted rather than "H" hard-clips are ignored. Both operations don't consume reference bases so I think that ignoring H is actually a natural assumption whereas counting S is not.
Should this be called countReferenceAndSoftClippedBases
instead?
@@ -125,6 +125,37 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar | |||
return result; | |||
} | |||
|
|||
public static int countRefBasesBasedOnCigarIgnoringHardClips(final GATKRead read, final int cigarStartIndex, final int cigarEndIndex){ | |||
if (read == null){ | |||
throw new IllegalArgumentException("null read"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can be more succinct using Utils.nonNull(read, "null read")
from 3 to only 1 line (will automatically improve test coverage as I presume you are not checking on the null read behavior (most people wouldn't bother).
int result = 0; | ||
for(int i = cigarStartIndex; i < cigarEndIndex; i++){ | ||
final CigarElement cigarElement = elems.get(i); | ||
switch (cigarElement.getOperator()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch could be rewritten as:
final CigarOperator operator = cigarElement.getOperator();
if (operator.consumesReferenceBases() || operator == CigarOperator.S) {
result += cigarElement.getLength();
}
return getSequenceAlignedOneToOne(read, false, (byte)0); | ||
} | ||
|
||
public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final boolean forBases, final byte padWith) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a boolean param to indicate whether this is an operation on quality scores or read bases smells a bit. What about provide a lambda to obtain the original-not-copied byte array given the read.
public static byte[] getSequenceAlignedOneToOne(final GATKRead read, Function<GATKRead, byte[]> bytesProvider, int padWith) {
...
}
int paddedPos = 0; | ||
for ( int i = 0; i < cigar.numCigarElements(); i++ ) { | ||
final CigarElement ce = cigar.getCigarElement(i); | ||
switch ( ce.getOperator() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch can be rewriten with a couple of ifs and using the consumesReadBases
and consumesReferenceBases
without the need to list operators explicitly.
Back to @ldgauthier. |
Looking back into this PR... at some point you are using 'N' to pad what seem to be gaps on the read sequence. Although the end result would be the same perhaps is better to be more explicit and just use '-' instead. In that case my suggestion of using When you compare the cost of each different alignment the gap-open and gap-ext are ignored (you only look at base call mismatches). I wonder whether it would be more correct to actually take them in consideration... so imagine that there is no gaps in the original alignment what-soever and that adding a 1bp gap decreases the number of mis-matches by just 1 which is typically Q30 increase in the Lk but the gap itself default penalty is Q45 so can one say that that read wouldn't still support the reference over a 1-bp gap alternative? Example with a 2-bp gap making the trick:
That could happen in STRs with impurities... but if the original alignment did not added itself the gap to reduce the number of mismatches is because precisely due to the added cost of the gap-open and necessary extends that we would be ignoring here. This is all hypothetical until some one quantifies how often this might occur ... so I'm happy to keep ignoring the gaps for now until we get a report on a real-live dataset that would benefit of such a change or some enthusiastic dsde-methods member investigates this. |
@vruano I addressed your comments. The code is a lot cleaner. Let me know if there's anything else. The Travis failures look unrelated so I'm rerunning them. |
@@ -555,7 +555,8 @@ int sumMismatchingQualities(final byte[] readBases, | |||
for ( int i = 0; i < n; i++ ) { | |||
final byte readBase = readBases[readStart + i]; | |||
final byte refBase = refBases[refStart + i]; | |||
if ( !Nucleotide.decode(readBase).intersect(Nucleotide.decode(refBase)).isValid()) { //allow N read bases (and other ambiguous bases) to "match" reference | |||
if ( !Nucleotide.decode(readBase).intersect(Nucleotide.decode(refBase)).isValid() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, I was assume that master had a change that I have on Nucleotides adding an new intersect(byte, byte) method:
/**
* Checks whether two nucleotides intersect given their byte encodings.
* @param a first nucleotide.
* @param b second nucleotide.
* @return {@code true} iff the input nucleotides intersect.
*/
public static boolean intersect(final byte a, final byte b) {
return (baseToValue[0xFF & a].mask & baseToValue[0xFF & b].mask) != 0;
}
Just add it in your branch's Nucleotides.java and the the if code would be:
if (!Nucleotide.intersect(readBase, refBase) && readBase != AlignmentUtils.PLACEHOLDER) {
....
}
src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/utils/read/AlignmentUtils.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final boolean forBases, final byte padWith) { | ||
public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) { | ||
final Cigar cigar = read.getCigar(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do null checking on read and baseProvider.
if (cigar.getCigarElement(i).getOperator().equals(CigarOperator.INSERTION)) { | ||
offset -= cigar.getCigarElement(i).getLength(); | ||
int offset = p.getCurrentCigarElement().getOperator().consumesReferenceBases() ? p.getOffsetInCurrentCigar() : 0; | ||
for (CigarElement elem : read.getCigar().getCigarElements().subList(0, p.getCurrentCigarOffset())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't pay close attention that my suggested code was totally correct... the int offset =
condition ... do you need to add || elem.getOperator() == CigarOperator.S)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make elem
final
.
@@ -106,7 +106,7 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar | |||
for(int i = cigarStartIndex; i < cigarEndIndex; i++){ | |||
final CigarElement cigarElement = elems.get(i); | |||
final CigarOperator operator = cigarElement.getOperator(); | |||
if (operator.consumesReferenceBases() || operator == CigarOperator.S) { | |||
if (operator.consumesReferenceBases() || operator.isClipping()) { | |||
result += cigarElement.getLength(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that given the original code, this is the wright translation so no changes needed... but the documentation is rather unclear about what this method does. Perhaps a better name for it would be countReferenceBasesOnUnclippedAlignment
.
My point here is that now we have a some methods that count soft and hard-clips in this kind of operations whereas some other methods only du soft... and I wonder whether there is a need for consolidate these (e.g. perhaps some changes in so calling code would make such a difference unnecessary).
In any case feel free to leave it for a future PR.
@ldgauthier I asked for further changes in my latest review (can you see it). By accident I approved rather than ask for changes so please take a look before merging. It should not take long to address them. Then you can merge. |
Consider bases on either side of indels informative if local assembly has been performed
736d552
to
3a88104
Compare
Previous behavior generated some PL=0,0,0 no-calls because CIGAR of reads containing indels wasn't taken into account when determining which reads were informative for the indel ref conf model. The local realignment wasn't being used inside the active region previously either, which has been fixed. A related change considers bases on either side of indels informative if local assembly has been performed (but not during active region detection). Both result in far fewer 0,0,0 calls. Unfortunately there are still some 0,0,X homRef calls related to #5171.