Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed ref conf calculation near indels #5172

Merged
merged 1 commit into from
Sep 17, 2018
Merged

Conversation

ldgauthier
Copy link
Contributor

Previous behavior generated some PL=0,0,0 no-calls because CIGAR of reads containing indels wasn't taken into account when determining which reads were informative for the indel ref conf model. The local realignment wasn't being used inside the active region previously either, which has been fixed. A related change considers bases on either side of indels informative if local assembly has been performed (but not during active region detection). Both result in far fewer 0,0,0 calls. Unfortunately there are still some 0,0,X homRef calls related to #5171.

Copy link
Contributor

@vruano vruano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some code simplification ideas.

@@ -547,7 +558,7 @@ int sumMismatchingQualities(final byte[] readBases,
for ( int i = 0; i < n; i++ ) {
final byte readBase = readBases[readStart + i];
final byte refBase = refBases[refStart + i];
if ( readBase != refBase ) {
if ( readBase != refBase && readBase != 'N') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use !Nucleotide.intersect(readBase, refBase) instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not totally equivalent but I guess it should work as intended (e.g. the Nucleotide. alternative would handle other ambiguity codes such as P R, W etc.)

* @return the new reference-aligned index/offset
*/
@VisibleForTesting
protected int getCigarModifiedOffset (final PileupElement p){
Copy link
Contributor

@vruano vruano Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this method intent boils down to calculate the number of "consumed reference bases" (+ softclips?) by the read alignment up to the location of the pileup in the read... CigarOperator have a consumesReferenceBases method that one can use to figure out this number without the need to make explicit reference to the operation types such INSERTION or DELETION or any other.

In other words I think there is a more straight forward solution that would count the consumed reference bases rather than "adjust" the consumed read bases to get to the same number....

perhaps would look something like this:

int offset = p.getCurrentCigarElement().getOperator().consumesReferenceBases() ? p.getIndexInCurrentElement() : 0;
for (CigarElement elem : read.getCigar().getCigarElements().subList(0, p.getCurrentCigarOffset())) {
    if (elem.getOperator().consumesReferenceBases() || elem.getOperator() == CigarOperator.S) {
        offset += elem.getLength();
    }
} 

@@ -196,6 +196,56 @@ public static GATKRead createReadAlignedToRef(final GATKRead originalRead,
return Arrays.copyOfRange(bases, basesStart, basesStop + 1);
}

public static byte[] getBasesAlignedOneToOne(final GATKRead read) {
return getSequenceAlignedOneToOne(read, true, (byte)'N');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use Nucleotide.N.encodeAsByte().

@@ -125,6 +125,37 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar
return result;
}

public static int countRefBasesBasedOnCigarIgnoringHardClips(final GATKRead read, final int cigarStartIndex, final int cigarEndIndex){
Copy link
Contributor

@vruano vruano Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are in CigarUtils the "BasedOnCigar" part of the name is kind of redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code what I find peculiar is that "S" soft-clips are actually counted rather than "H" hard-clips are ignored. Both operations don't consume reference bases so I think that ignoring H is actually a natural assumption whereas counting S is not.

Should this be called countReferenceAndSoftClippedBases instead?

@@ -125,6 +125,37 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar
return result;
}

public static int countRefBasesBasedOnCigarIgnoringHardClips(final GATKRead read, final int cigarStartIndex, final int cigarEndIndex){
if (read == null){
throw new IllegalArgumentException("null read");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can be more succinct using Utils.nonNull(read, "null read") from 3 to only 1 line (will automatically improve test coverage as I presume you are not checking on the null read behavior (most people wouldn't bother).

int result = 0;
for(int i = cigarStartIndex; i < cigarEndIndex; i++){
final CigarElement cigarElement = elems.get(i);
switch (cigarElement.getOperator()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch could be rewritten as:

final CigarOperator operator = cigarElement.getOperator();
if (operator.consumesReferenceBases() || operator == CigarOperator.S) {
   result += cigarElement.getLength();
}

return getSequenceAlignedOneToOne(read, false, (byte)0);
}

public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final boolean forBases, final byte padWith) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a boolean param to indicate whether this is an operation on quality scores or read bases smells a bit. What about provide a lambda to obtain the original-not-copied byte array given the read.

public static byte[] getSequenceAlignedOneToOne(final GATKRead read, Function<GATKRead, byte[]> bytesProvider, int padWith) {
   ...
} 

int paddedPos = 0;
for ( int i = 0; i < cigar.numCigarElements(); i++ ) {
final CigarElement ce = cigar.getCigarElement(i);
switch ( ce.getOperator() ) {
Copy link
Contributor

@vruano vruano Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch can be rewriten with a couple of ifs and using the consumesReadBases and consumesReferenceBases without the need to list operators explicitly.

@vruano
Copy link
Contributor

vruano commented Sep 11, 2018

Back to @ldgauthier.

@vruano vruano assigned ldgauthier and unassigned vruano Sep 11, 2018
@vruano
Copy link
Contributor

vruano commented Sep 12, 2018

Looking back into this PR... at some point you are using 'N' to pad what seem to be gaps on the read sequence. Although the end result would be the same perhaps is better to be more explicit and just use '-' instead. In that case my suggestion of using Nucleotide.intersect wouldn't cover for the '-' character so you need a explicity "&&" or "II".

When you compare the cost of each different alignment the gap-open and gap-ext are ignored (you only look at base call mismatches). I wonder whether it would be more correct to actually take them in consideration... so imagine that there is no gaps in the original alignment what-soever and that adding a 1bp gap decreases the number of mis-matches by just 1 which is typically Q30 increase in the Lk but the gap itself default penalty is Q45 so can one say that that read wouldn't still support the reference over a 1-bp gap alternative?

Example with a 2-bp gap making the trick:

Ref:  ....GCATGTGATATATATATATATATATATATACACACAC....
Read: ....GCATGTG--ATATATATATATATATATATAC <end-of-the-read>

That could happen in STRs with impurities... but if the original alignment did not added itself the gap to reduce the number of mismatches is because precisely due to the added cost of the gap-open and necessary extends that we would be ignoring here.

This is all hypothetical until some one quantifies how often this might occur ... so I'm happy to keep ignoring the gaps for now until we get a report on a real-live dataset that would benefit of such a change or some enthusiastic dsde-methods member investigates this.

@ldgauthier
Copy link
Contributor Author

@vruano I addressed your comments. The code is a lot cleaner. Let me know if there's anything else. The Travis failures look unrelated so I'm rerunning them.

@ldgauthier ldgauthier assigned vruano and unassigned ldgauthier Sep 14, 2018
@@ -555,7 +555,8 @@ int sumMismatchingQualities(final byte[] readBases,
for ( int i = 0; i < n; i++ ) {
final byte readBase = readBases[readStart + i];
final byte refBase = refBases[refStart + i];
if ( !Nucleotide.decode(readBase).intersect(Nucleotide.decode(refBase)).isValid()) { //allow N read bases (and other ambiguous bases) to "match" reference
if ( !Nucleotide.decode(readBase).intersect(Nucleotide.decode(refBase)).isValid() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I was assume that master had a change that I have on Nucleotides adding an new intersect(byte, byte) method:

/**
 * Checks whether two nucleotides intersect given their byte encodings.
 * @param a first nucleotide.
 * @param b second nucleotide.
 * @return {@code true} iff the input nucleotides intersect.
 */
public static boolean intersect(final byte a, final byte b) {
  return (baseToValue[0xFF & a].mask & baseToValue[0xFF & b].mask) != 0;
}

Just add it in your branch's Nucleotides.java and the the if code would be:

if (!Nucleotide.intersect(readBase, refBase) && readBase != AlignmentUtils.PLACEHOLDER) {
   ....
}

}

public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final boolean forBases, final byte padWith) {
public static byte[] getSequenceAlignedOneToOne(final GATKRead read, final Function<GATKRead, byte[]> bytesProvider, final byte padWith) {
final Cigar cigar = read.getCigar();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do null checking on read and baseProvider.

if (cigar.getCigarElement(i).getOperator().equals(CigarOperator.INSERTION)) {
offset -= cigar.getCigarElement(i).getLength();
int offset = p.getCurrentCigarElement().getOperator().consumesReferenceBases() ? p.getOffsetInCurrentCigar() : 0;
for (CigarElement elem : read.getCigar().getCigarElements().subList(0, p.getCurrentCigarOffset())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't pay close attention that my suggested code was totally correct... the int offset = condition ... do you need to add || elem.getOperator() == CigarOperator.S)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make elem final.

@@ -106,7 +106,7 @@ public static int countRefBasesBasedOnCigar(final GATKRead read, final int cigar
for(int i = cigarStartIndex; i < cigarEndIndex; i++){
final CigarElement cigarElement = elems.get(i);
final CigarOperator operator = cigarElement.getOperator();
if (operator.consumesReferenceBases() || operator == CigarOperator.S) {
if (operator.consumesReferenceBases() || operator.isClipping()) {
result += cigarElement.getLength();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that given the original code, this is the wright translation so no changes needed... but the documentation is rather unclear about what this method does. Perhaps a better name for it would be countReferenceBasesOnUnclippedAlignment.

My point here is that now we have a some methods that count soft and hard-clips in this kind of operations whereas some other methods only du soft... and I wonder whether there is a need for consolidate these (e.g. perhaps some changes in so calling code would make such a difference unnecessary).

In any case feel free to leave it for a future PR.

@vruano
Copy link
Contributor

vruano commented Sep 14, 2018

@ldgauthier I asked for further changes in my latest review (can you see it). By accident I approved rather than ask for changes so please take a look before merging. It should not take long to address them. Then you can merge.

@vruano vruano assigned ldgauthier and unassigned vruano Sep 14, 2018
Consider bases on either side of indels informative if local assembly has been performed
@ldgauthier ldgauthier merged commit 21f16d5 into master Sep 17, 2018
@ldgauthier ldgauthier deleted the ldg_indelRefConfPatch branch September 17, 2018 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants