-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port LeftAlignAndTrimVariants from GATK3 #5144
Conversation
Assigning to @jamesemery and @cmnbroad for review |
Also tagging @sooheelee to check if the improvements she suggested in the original ticket (#3487) have been implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some relatively minor changes for gatk style purposes and a few problems I would like you to address/explain.
/** | ||
* If this argument is set, bases common to all alleles will not be removed and will not leave their minimal representation. | ||
*/ | ||
@Argument(fullName = "dontTrimAlleles", shortName = "notrim", doc = "Do not Trim alleles to remove bases common to all of them", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arguments should be changed to match the new gatk style arguments (kebab casing) so this would be dontTrimAlleles
-> `dont-trim-alleles'
* If this argument is set, split multiallelic records and left-align individual alleles. | ||
* If this argument is not set, multiallelic records are not attempted to left-align and will be copied as is. | ||
*/ | ||
@Argument(fullName = "splitMultiallelics", shortName = "split", doc = "Split multiallelic records and left-align individual alleles", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kebab case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this probably shouldn't have a short name
* subset. If this flag is enabled, the original values of those annotations will be stored in new annotations called | ||
* AC_Orig, AF_Orig, and AN_Orig. | ||
*/ | ||
@Argument(fullName = "keepOriginalAC", shortName = "keepOriginalAC", doc = "Store the original AC, AF, and AN values after subsetting", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kebab case
/** | ||
* Maximum indel size to realign. Indels larger than this will be left unadjusted. | ||
*/ | ||
@Argument(fullName = "maxIndelSize", shortName = "maxIndelSize", doc = "Set maximum indel size to realign", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kebab
/** | ||
* If this argument is set, bases common to all alleles will not be removed and will not leave their minimal representation. | ||
*/ | ||
@Argument(fullName = "dontTrimAlleles", shortName = "notrim", doc = "Do not Trim alleles to remove bases common to all of them", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, it is generally good practice to extract out these argument names to constants either held in one of the gatk-wide classes (like gatkStandardArgumentDefinitions) or if they aren't likely relevant anywhere else, just put them into a static field on this class. As part of this you should ideally also replace the arguments with references to the static field in the tests as well.
* @param assignGenotypes assignment strategy for the (subsetted) PLs | ||
* @return a new non-null GenotypesContext with subsetted alleles | ||
*/ | ||
public static GenotypesContext subsetAlleles(final VariantContext vc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should be replaced by the existing codepath in AlleleSubsettingUtils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe we can avoid duplicating a lot of code by replacing this part of the method path with AlleleSubsettingUtils
, the functionality of the methods are very close to identical and allele subsetting utils will perform the tasks you want. It would also be constructive to push the code you ported concerning recomputing the genotype there so it can be shared with tools that already have to subset Alleles.
int leftmostAllowedAlignment = furthestEndOfEarlierVariant - ref.getWindow().getStart() + 2; | ||
// left align the CIGAR | ||
Cigar newCigar = AlignmentUtils.leftAlignIndel(originalCigar, refSeq, originalIndel, 0, 0, leftmostAllowedAlignment, true); | ||
if (newCigar.equals(originalCigar)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is this saying that the smallest repetitive unit we can left align is a unit that is less than 50 bases long? Because this returns if the newCigar hasn't moved at all, but looking at leftAlignIndel the code appears to check deletion strings for equality going back until it falls off the read it is provided, returning the original cigar if it is never true. Is that behavior expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. In the case of an indel with an str greater than 50 bases, LeftAlignIndel will push the indel to the front of the provided read, which then leads to the number of leading bases being doubled. The reason for this behavior is the same as the explanation to your next comment, so see there for detail.
return newVC; | ||
} | ||
// if we have left aligned to the beginning of the reference window, double leadingBases length and try again | ||
leadingBases *= 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me (first based on the actual tests that hit this line and the above code) that this line will only get hit in the case where the reference is a repetition of a single base, as that seems to be the only likely case where a deletion or insertion will "fall off" a read, and it seems to pick this up. But I see no tests of this auto-expansion behavior working for repetitive units like short tandem repeats, indeed it would seem that the number of units in a repetitive group would need to be divisible by 50 to ever reach this point in the code, otherwise the above left aligning code will just stop and place the indel somewhere short of the last base at the front of the read. Its a shame, because the underlying left alignment code is not designed for or even remotely efficient at 1000+ base searches so it must be avoided somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least I would like to see a test that demonstrates you can left align a 3 base str that extends over 50 bases on the reference properly with this approach, as it currently seems very few test cases actually hit this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I initially thought you were totally right about this. But it turns out this auto-expansion actually does work, although it took me a little while to figure out why when the tests I wrote to demonstrate the problem unexpectedly passed. Essentially, when an indel is not left aligned it can always be moved one base (and possibly more) to the left. When we are testing whether an indel can be moved one base to the left in AlignmentUtils.leftAlignSingleIndel, we are testing whether the sequence can be represented by an indel of the same length placed one base to the left, not whether the sequence can be represented by the same inserted/deleted bases placed one base to the left. So even when the str is not a factor of 50, the indel will still be initially aligned to the beginning of the given sequence, leading to the window being expanded. I do agree that there are no tests showing that this works correctly though, so I will add some.
|
||
@DataProvider(name = "LeftAlignDataProvider") | ||
public Object[][] LeftAlignTestData() throws UnsupportedEncodingException { | ||
List<Object[]> tests = new ArrayList<Object[]>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This data provider needs to be better formatted with more comments to explain exactly what it is generating as test data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably create a second data provider for this test that test larger repeat units rather than repeating the same alt-base over and over again.
import java.util.*; | ||
|
||
|
||
public class LeftAlignAndTrimVariantsIntegrationTest extends CommandLineProgramTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might consider trying to find/construct some larger test data than these alleles. The depth of the tests in GATK3 for this tool appears to have been pretty light and we could use perhaps a much larger snippet of VCF or some more pathological sites to demonstrate that the code is behaving properly. I'm not sure who to ask for that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I agree. I will hunt for some pathologically long repeats/strs in the reference and add some tests using them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kachulis. Thanks for this port. I sussed out the changes and wrote out my observations in #3487.
As it stands, the functionality passes muster. Your port is an improvement to the GATK3 version in that the tool no longer silently drops variants with max indel size greater than that set by --maxIndelSize
, which defaults to 200. Rather your branch keeps all records. This is great. 👏🏼
If you are feeling generous with your time, I would ask for two small additional things towards improving messaging.
- [1] Stdout still says
0 variants left aligned
. A count of records that were left-aligned-and-trimmed would be helpful.addition:
Alternatively, this misleading message should be dropped. - [2] Stdout still only lists the large indels and does not give a summary largest indel size, which would be helpful in setting
--maxIndelSize
to ensure all records are left-aligned and trimmed. This message could be next to that of [1], at the very end of the stdout.addition:
It may be useful to structure the message something along the lines of 'X number of records had max-indel-size above the set threshold of Y bases. These remain in their original state.'
These small changes are not required to merge from me. I thought I'd ask you given your current familiarity with the code. Thank you for this port and improvement to the tool. 😀
Again, my testing is outlined in #3487. |
thanks @sooheelee. The improved messaging you requested isn't difficult to implement, so I have added it, if you'd like to try it out. |
Thank you for the much-improved stdout messaging @kachulis. I tested it out and noted the changes again in the issue ticket. 👍🏽 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments
* -R reference.fasta \ | ||
* -V input.vcf \ | ||
* -O output.vcf \ | ||
* --splitMultiallelics \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the comment argument names here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
|
||
/** | ||
* @param vc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the comments for this method to be more instructive?
* @param assignGenotypes assignment strategy for the (subsetted) PLs | ||
* @return a new non-null GenotypesContext with subsetted alleles | ||
*/ | ||
public static GenotypesContext subsetAlleles(final VariantContext vc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe we can avoid duplicating a lot of code by replacing this part of the method path with AlleleSubsettingUtils
, the functionality of the methods are very close to identical and allele subsetting utils will perform the tasks you want. It would also be constructive to push the code you ported concerning recomputing the genotype there so it can be shared with tools that already have to subset Alleles.
final int readLength = referenceLength - (indelOp == 'D' ? indelSize : -indelSize); | ||
|
||
// create the original CIGAR string | ||
final GATKRead read = ArtificialReadUtils.createArtificialRead(header, "myRead", 0, 1, readLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good thorough test that helps ensure that this method is ding what it purports to. I do still think there is a case that i'm not strictly convinced this method is handling correctly (or at least expectedly) according to your usage of it in the tool. All of these test cases involve a read spanning the entirety of the contig, which means when we left align the indel we end up with the indel sitting right in the middle of the read (though surely correctly for various types of repetitions). The issue is that I don't see any test for the behavior of this method if we expect the indel to fall off of the edge of the read. I would like to see some assertion that if we start the read in the middle of say a 3base str that it will necessarily fall off the left end of the read if we run it, regardless of where the read/indel starts modulo 3 relative to the start of the indel. If we are sure according to a test that the indel will fall off of the read then we can continue to use your auto-expansion code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a test of this to AlignmentUtilsUnitTest
.
} | ||
|
||
@Test(dataProvider = "LeftAlignIndelWithLimitDataProvider") | ||
public void testLeftAlignIndelWithLimit(final Cigar originalCigar, final Cigar expectedCigar, final byte[] reference, final byte[] read, final int repeatLength, final int leftmostAllowedAlignment) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a short comment here summarizing what the dataset you generated with the data provider looks like when we run this test.
src/test/java/org/broadinstitute/hellbender/utils/variant/GATKVariantContextUtilsUnitTest.java
Show resolved
Hide resolved
@jamesemery I consolidated the additions to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits to pick but once you resolve the failing tests/conflicts I would say this can probably be merged
final List<Allele> allelesToUse) { | ||
final List<Allele> allelesToUse, | ||
final List<Allele> originalGT) { | ||
if(originalGT == null && assignmentMethod == GenotypeAssignmentMethod.BEST_MATCH_TO_ORIGINAL) throw new IllegalArgumentException("origianlGT cannot be null if assignmentMethod is BEST_MATCH_TO_ORIGINAL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we don't use one line if statements for clarity in the gakt. Just add brackets to encase this and indent
@@ -1290,7 +1300,7 @@ public static VariantContext makeFromAlleles(final String name, final String con | |||
genotypeAssignmentMethodUsed != GenotypeAssignmentMethod.SET_TO_NO_CALL) | |||
addInfoFieldAnnotations(vc, builder, keepOriginalChrCounts); | |||
|
|||
builder.genotypes(subsetAlleles(vc, alleles, genotypeAssignmentMethodUsed)); | |||
builder.genotypes(AlleleSubsettingUtils.subsetAlleles(vc.getGenotypes(),2,vc.getAlleles(), alleles, genotypeAssignmentMethodUsed,vc.getAttributeAsInt("DP",0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, it looks like this did away with a bunch of duplicated code
} | ||
|
||
@DataProvider(name = "LeftAlignIndelStartOfRead") | ||
public Object[][] makeLeftAlignIndelStartOfReadDataProvider() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, i'm convinced by this test. It looks like it is indeed sliding deletions off the edge of the read and is sliding insertions onto the edge properly.
oneLineSummary = "Left align and trim vairants", | ||
programGroup = VariantManipulationProgramGroup.class | ||
) | ||
@DocumentedFeature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the ultimate use case, you might consider marking the tool as beta for now as it is a new port of this tool that might be prone to further changes/feature requests.
@kachulis looks like you need to rebase onto master one more time |
…ust don't try to left align
2b4d940
to
8a29c83
Compare
Thank you, @kachulis |
Thanks @jamesemery for the review |
Ports LeftAlignAndTrimVariants from GATK3 (requested in #3487). There is one behavior change compared to GATK3: this tool will not left align a variant to before the end of a previous variant, since doing so could end up representing a non-equivalent sequence.