Port LeftAlignAndTrimVariants from GATK3 #5144

kachulis · 2018-08-29T14:32:46Z

Ports LeftAlignAndTrimVariants from GATK3 (requested in #3487). There is one behavior change compared to GATK3: this tool will not left align a variant to before the end of a previous variant, since doing so could end up representing a non-equivalent sequence.

droazen · 2018-08-31T14:31:33Z

Assigning to @jamesemery and @cmnbroad for review

droazen · 2018-08-31T14:36:55Z

Also tagging @sooheelee to check if the improvements she suggested in the original ticket (#3487) have been implemented.

jamesemery

Some relatively minor changes for gatk style purposes and a few problems I would like you to address/explain.

jamesemery · 2018-09-05T15:12:15Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+    /**
+     * If this argument is set, bases common to all alleles will not be removed and will not leave their minimal representation.
+     */
+    @Argument(fullName = "dontTrimAlleles", shortName = "notrim", doc = "Do not Trim alleles to remove bases common to all of them", optional = true)


These arguments should be changed to match the new gatk style arguments (kebab casing) so this would be dontTrimAlleles -> `dont-trim-alleles'

jamesemery · 2018-09-05T15:12:31Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+     * If this argument is set, split multiallelic records and left-align individual alleles.
+     * If this argument is not set, multiallelic records are not attempted to left-align and will be copied as is.
+     */
+    @Argument(fullName = "splitMultiallelics", shortName = "split", doc = "Split multiallelic records and left-align individual alleles", optional = true)


Also this probably shouldn't have a short name

jamesemery · 2018-09-05T15:12:37Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+     * subset. If this flag is enabled, the original values of those annotations will be stored in new annotations called
+     * AC_Orig, AF_Orig, and AN_Orig.
+     */
+    @Argument(fullName = "keepOriginalAC", shortName = "keepOriginalAC", doc = "Store the original AC, AF, and AN values after subsetting", optional = true)


jamesemery · 2018-09-05T15:12:45Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+    /**
+     * Maximum indel size to realign.  Indels larger than this will be left unadjusted.
+     */
+    @Argument(fullName = "maxIndelSize", shortName = "maxIndelSize", doc = "Set maximum indel size to realign", optional = true)


jamesemery · 2018-09-05T15:15:05Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+    /**
+     * If this argument is set, bases common to all alleles will not be removed and will not leave their minimal representation.
+     */
+    @Argument(fullName = "dontTrimAlleles", shortName = "notrim", doc = "Do not Trim alleles to remove bases common to all of them", optional = true)


Additionally, it is generally good practice to extract out these argument names to constants either held in one of the gatk-wide classes (like gatkStandardArgumentDefinitions) or if they aren't likely relevant anywhere else, just put them into a static field on this class. As part of this you should ideally also replace the arguments with references to the static field in the tests as well.

jamesemery · 2018-09-05T17:04:30Z

src/main/java/org/broadinstitute/hellbender/utils/variant/GATKVariantContextUtils.java

+     * @param assignGenotypes assignment strategy for the (subsetted) PLs
+     * @return a new non-null GenotypesContext with subsetted alleles
+     */
+    public static GenotypesContext subsetAlleles(final VariantContext vc,


This method should be replaced by the existing codepath in AlleleSubsettingUtils

I still believe we can avoid duplicating a lot of code by replacing this part of the method path with AlleleSubsettingUtils, the functionality of the methods are very close to identical and allele subsetting utils will perform the tasks you want. It would also be constructive to push the code you ported concerning recomputing the genotype there so it can be shared with tools that already have to subset Alleles.

jamesemery · 2018-09-05T17:22:03Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+            int leftmostAllowedAlignment = furthestEndOfEarlierVariant - ref.getWindow().getStart() + 2;
+            // left align the CIGAR
+            Cigar newCigar = AlignmentUtils.leftAlignIndel(originalCigar, refSeq, originalIndel, 0, 0, leftmostAllowedAlignment, true);
+            if (newCigar.equals(originalCigar)) {


So is this saying that the smallest repetitive unit we can left align is a unit that is less than 50 bases long? Because this returns if the newCigar hasn't moved at all, but looking at leftAlignIndel the code appears to check deletion strings for equality going back until it falls off the read it is provided, returning the original cigar if it is never true. Is that behavior expected?

Not quite. In the case of an indel with an str greater than 50 bases, LeftAlignIndel will push the indel to the front of the provided read, which then leads to the number of leading bases being doubled. The reason for this behavior is the same as the explanation to your next comment, so see there for detail.

jamesemery · 2018-09-05T17:33:04Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+                return newVC;
+            }
+            // if we have left aligned to the beginning of the reference window, double leadingBases length and try again
+            leadingBases *= 2;


It seems to me (first based on the actual tests that hit this line and the above code) that this line will only get hit in the case where the reference is a repetition of a single base, as that seems to be the only likely case where a deletion or insertion will "fall off" a read, and it seems to pick this up. But I see no tests of this auto-expansion behavior working for repetitive units like short tandem repeats, indeed it would seem that the number of units in a repetitive group would need to be divisible by 50 to ever reach this point in the code, otherwise the above left aligning code will just stop and place the indel somewhere short of the last base at the front of the read. Its a shame, because the underlying left alignment code is not designed for or even remotely efficient at 1000+ base searches so it must be avoided somehow.

At the very least I would like to see a test that demonstrates you can left align a 3 base str that extends over 50 bases on the reference properly with this approach, as it currently seems very few test cases actually hit this line.

So I initially thought you were totally right about this. But it turns out this auto-expansion actually does work, although it took me a little while to figure out why when the tests I wrote to demonstrate the problem unexpectedly passed. Essentially, when an indel is not left aligned it can always be moved one base (and possibly more) to the left. When we are testing whether an indel can be moved one base to the left in AlignmentUtils.leftAlignSingleIndel, we are testing whether the sequence can be represented by an indel of the same length placed one base to the left, not whether the sequence can be represented by the same inserted/deleted bases placed one base to the left. So even when the str is not a factor of 50, the indel will still be initially aligned to the beginning of the given sequence, leading to the window being expanded. I do agree that there are no tests showing that this works correctly though, so I will add some.

jamesemery · 2018-09-05T17:38:38Z

...g/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariantsUnitTest.java

+
+    @DataProvider(name = "LeftAlignDataProvider")
+    public Object[][] LeftAlignTestData() throws UnsupportedEncodingException {
+        List<Object[]> tests = new ArrayList<Object[]>();


This data provider needs to be better formatted with more comments to explain exactly what it is generating as test data.

You should probably create a second data provider for this test that test larger repeat units rather than repeating the same alt-base over and over again.

jamesemery · 2018-09-05T17:44:29Z

...institute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariantsIntegrationTest.java

+import java.util.*;
+
+
+public class LeftAlignAndTrimVariantsIntegrationTest extends CommandLineProgramTest {


I might consider trying to find/construct some larger test data than these alleles. The depth of the tests in GATK3 for this tool appears to have been pretty light and we could use perhaps a much larger snippet of VCF or some more pathological sites to demonstrate that the code is behaving properly. I'm not sure who to ask for that though.

Ya I agree. I will hunt for some pathologically long repeats/strs in the reference and add some tests using them.

sooheelee

Hi @kachulis. Thanks for this port. I sussed out the changes and wrote out my observations in #3487.

As it stands, the functionality passes muster. Your port is an improvement to the GATK3 version in that the tool no longer silently drops variants with max indel size greater than that set by --maxIndelSize, which defaults to 200. Rather your branch keeps all records. This is great. 👏🏼

If you are feeling generous with your time, I would ask for two small additional things towards improving messaging.

[1] Stdout still says 0 variants left aligned. A count of records that were left-aligned-and-trimmed would be helpful. addition: Alternatively, this misleading message should be dropped.
[2] Stdout still only lists the large indels and does not give a summary largest indel size, which would be helpful in setting --maxIndelSize to ensure all records are left-aligned and trimmed. This message could be next to that of [1], at the very end of the stdout. addition: It may be useful to structure the message something along the lines of 'X number of records had max-indel-size above the set threshold of Y bases. These remain in their original state.'

These small changes are not required to merge from me. I thought I'd ask you given your current familiarity with the code. Thank you for this port and improvement to the tool. 😀

sooheelee · 2018-09-05T22:07:32Z

Again, my testing is outlined in #3487.

kachulis · 2018-09-06T16:22:39Z

thanks @sooheelee. The improved messaging you requested isn't difficult to implement, so I have added it, if you'd like to try it out.

sooheelee · 2018-09-06T19:55:03Z

Thank you for the much-improved stdout messaging @kachulis. I tested it out and noted the changes again in the issue ticket. 👍🏽

jamesemery

Some more comments

jamesemery · 2018-09-07T16:23:48Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+ *   -R reference.fasta \
+ *   -V input.vcf \
+ *   -O output.vcf \
+ *   --splitMultiallelics \


Please update the comment argument names here

jamesemery · 2018-09-07T18:39:06Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+
+
+    /**
+     * @param vc


Can you update the comments for this method to be more instructive?

jamesemery · 2018-09-07T18:45:12Z

src/main/java/org/broadinstitute/hellbender/utils/variant/GATKVariantContextUtils.java

+     * @param assignGenotypes assignment strategy for the (subsetted) PLs
+     * @return a new non-null GenotypesContext with subsetted alleles
+     */
+    public static GenotypesContext subsetAlleles(final VariantContext vc,


I still believe we can avoid duplicating a lot of code by replacing this part of the method path with AlleleSubsettingUtils, the functionality of the methods are very close to identical and allele subsetting utils will perform the tasks you want. It would also be constructive to push the code you ported concerning recomputing the genotype there so it can be shared with tools that already have to subset Alleles.

jamesemery · 2018-09-07T19:54:37Z

src/test/java/org/broadinstitute/hellbender/utils/read/AlignmentUtilsUnitTest.java

+                        final int readLength = referenceLength - (indelOp == 'D' ? indelSize : -indelSize);
+
+                        // create the original CIGAR string
+                        final GATKRead read = ArtificialReadUtils.createArtificialRead(header, "myRead", 0, 1, readLength);


I think this is a good thorough test that helps ensure that this method is ding what it purports to. I do still think there is a case that i'm not strictly convinced this method is handling correctly (or at least expectedly) according to your usage of it in the tool. All of these test cases involve a read spanning the entirety of the contig, which means when we left align the indel we end up with the indel sitting right in the middle of the read (though surely correctly for various types of repetitions). The issue is that I don't see any test for the behavior of this method if we expect the indel to fall off of the edge of the read. I would like to see some assertion that if we start the read in the middle of say a 3base str that it will necessarily fall off the left end of the read if we run it, regardless of where the read/indel starts modulo 3 relative to the start of the indel. If we are sure according to a test that the indel will fall off of the read then we can continue to use your auto-expansion code.

I have added a test of this to AlignmentUtilsUnitTest.

jamesemery · 2018-09-07T19:57:06Z

src/test/java/org/broadinstitute/hellbender/utils/read/AlignmentUtilsUnitTest.java

+    }
+
+    @Test(dataProvider = "LeftAlignIndelWithLimitDataProvider")
+    public void testLeftAlignIndelWithLimit(final Cigar originalCigar, final Cigar expectedCigar, final byte[] reference, final byte[] read, final int repeatLength, final int leftmostAllowedAlignment) {


I would add a short comment here summarizing what the dataset you generated with the data provider looks like when we run this test.

src/test/java/org/broadinstitute/hellbender/utils/variant/GATKVariantContextUtilsUnitTest.java

kachulis · 2018-09-11T19:01:11Z

@jamesemery I consolidated the additions to GATKVariantContextUtils to use the methods in AlleleSubsettingUtils as much as possible, and also added more tests of pathological case (long alleles in long repeats, etc.) so now back to you.

jamesemery

Just a few nits to pick but once you resolve the failing tests/conflicts I would say this can probably be merged

jamesemery · 2018-09-12T14:41:35Z

src/main/java/org/broadinstitute/hellbender/utils/variant/GATKVariantContextUtils.java

-                                        final List<Allele> allelesToUse) {
+                                        final List<Allele> allelesToUse,
+                                        final List<Allele> originalGT) {
+        if(originalGT == null && assignmentMethod == GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL) throw new IllegalArgumentException("origianlGT cannot be null if assignmentMethod is BEST_MATCH_TO_ORIGINAL");


Generally we don't use one line if statements for clarity in the gakt. Just add brackets to encase this and indent

jamesemery · 2018-09-12T14:43:10Z

src/main/java/org/broadinstitute/hellbender/utils/variant/GATKVariantContextUtils.java

@@ -1290,7 +1300,7 @@ public static VariantContext makeFromAlleles(final String name, final String con
                        genotypeAssignmentMethodUsed != GenotypeAssignmentMethod.SET_TO_NO_CALL)
                    addInfoFieldAnnotations(vc, builder, keepOriginalChrCounts);

-                builder.genotypes(subsetAlleles(vc, alleles, genotypeAssignmentMethodUsed));
+                builder.genotypes(AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(),2,vc.getAlleles(), alleles, genotypeAssignmentMethodUsed,vc.getAttributeAsInt("DP",0)));


Thank you, it looks like this did away with a bunch of duplicated code

jamesemery · 2018-09-12T14:59:00Z

src/test/java/org/broadinstitute/hellbender/utils/read/AlignmentUtilsUnitTest.java

+    }
+
+    @DataProvider(name = "LeftAlignIndelStartOfRead")
+    public Object[][] makeLeftAlignIndelStartOfReadDataProvider() {


Alright, i'm convinced by this test. It looks like it is indeed sliding deletions off the edge of the read and is sliding insertions onto the edge properly.

jamesemery · 2018-09-12T15:37:41Z

.../java/org/broadinstitute/hellbender/tools/walkers/variantutils/LeftAlignAndTrimVariants.java

+        oneLineSummary = "Left align and trim vairants",
+        programGroup = VariantManipulationProgramGroup.class
+)
+@DocumentedFeature


Depending on the ultimate use case, you might consider marking the tool as beta for now as it is a new port of this tool that might be prone to further changes/feature requests.

jamesemery · 2018-09-14T16:46:52Z

@kachulis looks like you need to rebase onto master one more time

…ust don't try to left align

jamesemery · 2018-09-14T20:10:11Z

Thank you, @kachulis

kachulis · 2018-09-14T20:17:20Z

Thanks @jamesemery for the review

droazen assigned jamesemery Aug 31, 2018

droazen requested review from jamesemery and cmnbroad August 31, 2018 14:31

droazen assigned cmnbroad Aug 31, 2018

droazen requested a review from sooheelee August 31, 2018 14:35

jamesemery requested changes Sep 5, 2018

View reviewed changes

sooheelee approved these changes Sep 5, 2018

View reviewed changes

sooheelee mentioned this pull request Sep 6, 2018

Port LeftAlignAndTrimVariants and improve stdout messaging... #3487

Closed

jamesemery reviewed Sep 7, 2018

View reviewed changes

jamesemery approved these changes Sep 12, 2018

View reviewed changes

kachulis added 13 commits September 14, 2018 13:21

LeftAlignAndTrimVariants added

425c0ce

tests for LeftAlignAndTrimVariants

38d0b33

working on tests

c771063

LeftAlignAndTrimVariants tests

1a3b617

tests for LeftAlignAndTrimVariants

fc93ab4

adding check to never left align to before the end of a previous variant

00adfe4

fix LeftAlignAndTrimVariantsIntegrationTest after rebase

eafe6e9

aesthetic changes

1b3c123

if refLength longer than max_indel_length, still write out variant, j…

9e7a21d

…ust don't try to left align

additional test case

9e5d364

soohee functionality requests

f23f1b9

consolidating subsetAlleles code as requested by James

11ce9a0

changes and additional tests requested by James

00292ff

kachulis added 4 commits September 14, 2018 13:52

fixing example usage

3f0b970

fixing test failure in GATKVariantContextUtilsUnitTest

cb89970

small aesthetic change

a1576b1

fixing merge conflict and removing unintended intelliJ reformatting

8a29c83

kachulis force-pushed the ck_3487_port_LeftAlignAndTrimVariants branch from 2b4d940 to 8a29c83 Compare September 14, 2018 18:18

jamesemery merged commit 60f383d into master Sep 14, 2018

kachulis deleted the ck_3487_port_LeftAlignAndTrimVariants branch September 14, 2018 20:17

		import java.util.*;


		public class LeftAlignAndTrimVariantsIntegrationTest extends CommandLineProgramTest {

Port LeftAlignAndTrimVariants from GATK3 #5144

Port LeftAlignAndTrimVariants from GATK3 #5144

Conversation

kachulis commented Aug 29, 2018

droazen commented Aug 31, 2018

droazen commented Aug 31, 2018

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sooheelee left a comment • edited Loading

Choose a reason for hiding this comment

sooheelee commented Sep 5, 2018

kachulis commented Sep 6, 2018

sooheelee commented Sep 6, 2018

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kachulis commented Sep 11, 2018

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Sep 14, 2018

jamesemery commented Sep 14, 2018

kachulis commented Sep 14, 2018

sooheelee left a comment •

edited

Loading