Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port LeftAlignAndTrimVariants from GATK3 #5144

Merged
merged 17 commits into from
Sep 14, 2018

Conversation

kachulis
Copy link
Contributor

Ports LeftAlignAndTrimVariants from GATK3 (requested in #3487). There is one behavior change compared to GATK3: this tool will not left align a variant to before the end of a previous variant, since doing so could end up representing a non-equivalent sequence.

@droazen
Copy link
Collaborator

droazen commented Aug 31, 2018

Assigning to @jamesemery and @cmnbroad for review

@droazen droazen requested a review from sooheelee August 31, 2018 14:35
@droazen
Copy link
Collaborator

droazen commented Aug 31, 2018

Also tagging @sooheelee to check if the improvements she suggested in the original ticket (#3487) have been implemented.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some relatively minor changes for gatk style purposes and a few problems I would like you to address/explain.

/**
* If this argument is set, bases common to all alleles will not be removed and will not leave their minimal representation.
*/
@Argument(fullName = "dontTrimAlleles", shortName = "notrim", doc = "Do not Trim alleles to remove bases common to all of them", optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These arguments should be changed to match the new gatk style arguments (kebab casing) so this would be dontTrimAlleles -> `dont-trim-alleles'

* If this argument is set, split multiallelic records and left-align individual alleles.
* If this argument is not set, multiallelic records are not attempted to left-align and will be copied as is.
*/
@Argument(fullName = "splitMultiallelics", shortName = "split", doc = "Split multiallelic records and left-align individual alleles", optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kebab case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this probably shouldn't have a short name

* subset. If this flag is enabled, the original values of those annotations will be stored in new annotations called
* AC_Orig, AF_Orig, and AN_Orig.
*/
@Argument(fullName = "keepOriginalAC", shortName = "keepOriginalAC", doc = "Store the original AC, AF, and AN values after subsetting", optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kebab case

/**
* Maximum indel size to realign. Indels larger than this will be left unadjusted.
*/
@Argument(fullName = "maxIndelSize", shortName = "maxIndelSize", doc = "Set maximum indel size to realign", optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kebab

/**
* If this argument is set, bases common to all alleles will not be removed and will not leave their minimal representation.
*/
@Argument(fullName = "dontTrimAlleles", shortName = "notrim", doc = "Do not Trim alleles to remove bases common to all of them", optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, it is generally good practice to extract out these argument names to constants either held in one of the gatk-wide classes (like gatkStandardArgumentDefinitions) or if they aren't likely relevant anywhere else, just put them into a static field on this class. As part of this you should ideally also replace the arguments with references to the static field in the tests as well.

* @param assignGenotypes assignment strategy for the (subsetted) PLs
* @return a new non-null GenotypesContext with subsetted alleles
*/
public static GenotypesContext subsetAlleles(final VariantContext vc,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be replaced by the existing codepath in AlleleSubsettingUtils

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe we can avoid duplicating a lot of code by replacing this part of the method path with AlleleSubsettingUtils, the functionality of the methods are very close to identical and allele subsetting utils will perform the tasks you want. It would also be constructive to push the code you ported concerning recomputing the genotype there so it can be shared with tools that already have to subset Alleles.

int leftmostAllowedAlignment = furthestEndOfEarlierVariant - ref.getWindow().getStart() + 2;
// left align the CIGAR
Cigar newCigar = AlignmentUtils.leftAlignIndel(originalCigar, refSeq, originalIndel, 0, 0, leftmostAllowedAlignment, true);
if (newCigar.equals(originalCigar)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this saying that the smallest repetitive unit we can left align is a unit that is less than 50 bases long? Because this returns if the newCigar hasn't moved at all, but looking at leftAlignIndel the code appears to check deletion strings for equality going back until it falls off the read it is provided, returning the original cigar if it is never true. Is that behavior expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. In the case of an indel with an str greater than 50 bases, LeftAlignIndel will push the indel to the front of the provided read, which then leads to the number of leading bases being doubled. The reason for this behavior is the same as the explanation to your next comment, so see there for detail.

return newVC;
}
// if we have left aligned to the beginning of the reference window, double leadingBases length and try again
leadingBases *= 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me (first based on the actual tests that hit this line and the above code) that this line will only get hit in the case where the reference is a repetition of a single base, as that seems to be the only likely case where a deletion or insertion will "fall off" a read, and it seems to pick this up. But I see no tests of this auto-expansion behavior working for repetitive units like short tandem repeats, indeed it would seem that the number of units in a repetitive group would need to be divisible by 50 to ever reach this point in the code, otherwise the above left aligning code will just stop and place the indel somewhere short of the last base at the front of the read. Its a shame, because the underlying left alignment code is not designed for or even remotely efficient at 1000+ base searches so it must be avoided somehow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least I would like to see a test that demonstrates you can left align a 3 base str that extends over 50 bases on the reference properly with this approach, as it currently seems very few test cases actually hit this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I initially thought you were totally right about this. But it turns out this auto-expansion actually does work, although it took me a little while to figure out why when the tests I wrote to demonstrate the problem unexpectedly passed. Essentially, when an indel is not left aligned it can always be moved one base (and possibly more) to the left. When we are testing whether an indel can be moved one base to the left in AlignmentUtils.leftAlignSingleIndel, we are testing whether the sequence can be represented by an indel of the same length placed one base to the left, not whether the sequence can be represented by the same inserted/deleted bases placed one base to the left. So even when the str is not a factor of 50, the indel will still be initially aligned to the beginning of the given sequence, leading to the window being expanded. I do agree that there are no tests showing that this works correctly though, so I will add some.


@DataProvider(name = "LeftAlignDataProvider")
public Object[][] LeftAlignTestData() throws UnsupportedEncodingException {
List<Object[]> tests = new ArrayList<Object[]>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data provider needs to be better formatted with more comments to explain exactly what it is generating as test data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably create a second data provider for this test that test larger repeat units rather than repeating the same alt-base over and over again.

import java.util.*;


public class LeftAlignAndTrimVariantsIntegrationTest extends CommandLineProgramTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might consider trying to find/construct some larger test data than these alleles. The depth of the tests in GATK3 for this tool appears to have been pretty light and we could use perhaps a much larger snippet of VCF or some more pathological sites to demonstrate that the code is behaving properly. I'm not sure who to ask for that though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I agree. I will hunt for some pathologically long repeats/strs in the reference and add some tests using them.

Copy link
Contributor

@sooheelee sooheelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kachulis. Thanks for this port. I sussed out the changes and wrote out my observations in #3487.

As it stands, the functionality passes muster. Your port is an improvement to the GATK3 version in that the tool no longer silently drops variants with max indel size greater than that set by --maxIndelSize, which defaults to 200. Rather your branch keeps all records. This is great. 👏🏼

If you are feeling generous with your time, I would ask for two small additional things towards improving messaging.

  • [1] Stdout still says 0 variants left aligned. A count of records that were left-aligned-and-trimmed would be helpful. addition: Alternatively, this misleading message should be dropped.
  • [2] Stdout still only lists the large indels and does not give a summary largest indel size, which would be helpful in setting --maxIndelSize to ensure all records are left-aligned and trimmed. This message could be next to that of [1], at the very end of the stdout. addition: It may be useful to structure the message something along the lines of 'X number of records had max-indel-size above the set threshold of Y bases. These remain in their original state.'

These small changes are not required to merge from me. I thought I'd ask you given your current familiarity with the code. Thank you for this port and improvement to the tool. 😀

@sooheelee
Copy link
Contributor

Again, my testing is outlined in #3487.

@kachulis
Copy link
Contributor Author

kachulis commented Sep 6, 2018

thanks @sooheelee. The improved messaging you requested isn't difficult to implement, so I have added it, if you'd like to try it out.

@sooheelee
Copy link
Contributor

Thank you for the much-improved stdout messaging @kachulis. I tested it out and noted the changes again in the issue ticket. 👍🏽

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments

* -R reference.fasta \
* -V input.vcf \
* -O output.vcf \
* --splitMultiallelics \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment argument names here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



/**
* @param vc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comments for this method to be more instructive?

* @param assignGenotypes assignment strategy for the (subsetted) PLs
* @return a new non-null GenotypesContext with subsetted alleles
*/
public static GenotypesContext subsetAlleles(final VariantContext vc,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe we can avoid duplicating a lot of code by replacing this part of the method path with AlleleSubsettingUtils, the functionality of the methods are very close to identical and allele subsetting utils will perform the tasks you want. It would also be constructive to push the code you ported concerning recomputing the genotype there so it can be shared with tools that already have to subset Alleles.

final int readLength = referenceLength - (indelOp == 'D' ? indelSize : -indelSize);

// create the original CIGAR string
final GATKRead read = ArtificialReadUtils.createArtificialRead(header, "myRead", 0, 1, readLength);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good thorough test that helps ensure that this method is ding what it purports to. I do still think there is a case that i'm not strictly convinced this method is handling correctly (or at least expectedly) according to your usage of it in the tool. All of these test cases involve a read spanning the entirety of the contig, which means when we left align the indel we end up with the indel sitting right in the middle of the read (though surely correctly for various types of repetitions). The issue is that I don't see any test for the behavior of this method if we expect the indel to fall off of the edge of the read. I would like to see some assertion that if we start the read in the middle of say a 3base str that it will necessarily fall off the left end of the read if we run it, regardless of where the read/indel starts modulo 3 relative to the start of the indel. If we are sure according to a test that the indel will fall off of the read then we can continue to use your auto-expansion code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a test of this to AlignmentUtilsUnitTest.

}

@Test(dataProvider = "LeftAlignIndelWithLimitDataProvider")
public void testLeftAlignIndelWithLimit(final Cigar originalCigar, final Cigar expectedCigar, final byte[] reference, final byte[] read, final int repeatLength, final int leftmostAllowedAlignment) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a short comment here summarizing what the dataset you generated with the data provider looks like when we run this test.

@kachulis
Copy link
Contributor Author

@jamesemery I consolidated the additions to GATKVariantContextUtils to use the methods in AlleleSubsettingUtils as much as possible, and also added more tests of pathological case (long alleles in long repeats, etc.) so now back to you.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits to pick but once you resolve the failing tests/conflicts I would say this can probably be merged

final List<Allele> allelesToUse) {
final List<Allele> allelesToUse,
final List<Allele> originalGT) {
if(originalGT == null && assignmentMethod == GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL) throw new IllegalArgumentException("origianlGT cannot be null if assignmentMethod is BEST_MATCH_TO_ORIGINAL");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we don't use one line if statements for clarity in the gakt. Just add brackets to encase this and indent

@@ -1290,7 +1300,7 @@ public static VariantContext makeFromAlleles(final String name, final String con
genotypeAssignmentMethodUsed != GenotypeAssignmentMethod.SET_TO_NO_CALL)
addInfoFieldAnnotations(vc, builder, keepOriginalChrCounts);

builder.genotypes(subsetAlleles(vc, alleles, genotypeAssignmentMethodUsed));
builder.genotypes(AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(),2,vc.getAlleles(), alleles, genotypeAssignmentMethodUsed,vc.getAttributeAsInt("DP",0)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, it looks like this did away with a bunch of duplicated code

}

@DataProvider(name = "LeftAlignIndelStartOfRead")
public Object[][] makeLeftAlignIndelStartOfReadDataProvider() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, i'm convinced by this test. It looks like it is indeed sliding deletions off the edge of the read and is sliding insertions onto the edge properly.

oneLineSummary = "Left align and trim vairants",
programGroup = VariantManipulationProgramGroup.class
)
@DocumentedFeature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the ultimate use case, you might consider marking the tool as beta for now as it is a new port of this tool that might be prone to further changes/feature requests.

@jamesemery
Copy link
Collaborator

@kachulis looks like you need to rebase onto master one more time

@kachulis kachulis force-pushed the ck_3487_port_LeftAlignAndTrimVariants branch from 2b4d940 to 8a29c83 Compare September 14, 2018 18:18
@jamesemery jamesemery merged commit 60f383d into master Sep 14, 2018
@jamesemery
Copy link
Collaborator

Thank you, @kachulis

@kachulis
Copy link
Contributor Author

Thanks @jamesemery for the review

@kachulis kachulis deleted the ck_3487_port_LeftAlignAndTrimVariants branch September 14, 2018 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants