Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding base level comparison modes for CompareReferences tool #7987

Merged
merged 13 commits into from
Aug 27, 2022

Conversation

orlicohen
Copy link
Contributor

No description provided.

@gatk-bot
Copy link

gatk-bot commented Aug 11, 2022

Github actions tests reported job failures from actions build 2841114156
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2841114156.13 logs
integration 11 2841114156.12 logs
unit 8 2841114156.1 logs
integration 8 2841114156.0 logs

@orlicohen orlicohen force-pushed the oc_comparereferenceswithaligner branch from ff3ef6b to 2038fd1 Compare August 11, 2022 16:28
@gatk-bot
Copy link

gatk-bot commented Aug 11, 2022

Github actions tests reported job failures from actions build 2841313433
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2841313433.13 logs
integration 11 2841313433.12 logs
unit 8 2841313433.1 logs
integration 8 2841313433.0 logs

@orlicohen orlicohen force-pushed the oc_comparereferenceswithaligner branch from 2038fd1 to 164bad3 Compare August 11, 2022 19:34
@codecov
Copy link

codecov bot commented Aug 11, 2022

Codecov Report

Merging #7987 (b8e168e) into master (c22972a) will decrease coverage by 31.524%.
The diff coverage is 3.205%.

❗ Current head b8e168e differs from pull request most recent head b4ffbc2. Consider uploading reports for the commit b4ffbc2 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##              master     #7987        +/-   ##
================================================
- Coverage     52.260%   20.735%   -31.524%     
+ Complexity     29146     12332     -16814     
================================================
  Files           2310      2331        +21     
  Lines         180344    181988      +1644     
  Branches       19840     19985       +145     
================================================
- Hits           94247     37736     -56511     
- Misses         80124    140700     +60576     
+ Partials        5973      3552      -2421     
Impacted Files Coverage Δ
...bender/tools/reference/ReferenceSequenceTable.java 79.570% <ø> (-7.527%) ⬇️
...ute/hellbender/utils/alignment/MummerExecutor.java 0.000% <0.000%> (ø)
...ls/reference/CompareReferencesIntegrationTest.java 0.694% <0.000%> (-76.770%) ⬇️
.../hellbender/tools/reference/CompareReferences.java 21.344% <3.429%> (-45.323%) ⬇️
...bender/utils/alignment/MummerExecutorUnitTest.java 8.696% <8.696%> (ø)
...der/tools/reference/CompareReferencesUnitTest.java 12.500% <12.500%> (ø)
...tute/hellbender/tools/reference/ReferencePair.java 79.487% <50.000%> (-10.256%) ⬇️
...rg/broadinstitute/hellbender/tools/CountBases.java 0.000% <0.000%> (-100.000%) ⬇️
...rg/broadinstitute/hellbender/tools/CountReads.java 0.000% <0.000%> (-100.000%) ⬇️
...g/broadinstitute/hellbender/utils/mcmc/Decile.java 0.000% <0.000%> (-100.000%) ⬇️
... and 1329 more

@orlicohen orlicohen force-pushed the oc_comparereferenceswithaligner branch from 164bad3 to 40c723f Compare August 11, 2022 19:36
@gatk-bot
Copy link

gatk-bot commented Aug 11, 2022

Github actions tests reported job failures from actions build 2842345810
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 11 2842345810.12 logs
integration 8 2842345810.0 logs

@gatk-bot
Copy link

gatk-bot commented Aug 11, 2022

Github actions tests reported job failures from actions build 2842334324
Failures in the following jobs:

Test Type JDK Job ID Logs
integration 11 2842334324.12 logs
integration 8 2842334324.0 logs

@droazen droazen changed the title * DO NOT MERGE * adding base level comparison modes for CompareReferences tool Adding base level comparison modes for CompareReferences tool Aug 12, 2022
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint review of everything except for the mummer excectuor

@@ -141,6 +171,28 @@ public void traverse(){
for(ReferencePair pair : referencePairs){
System.out.println(pair);
}

if(referencePairs.size() != 1 && baseComparisonMode != BaseComparisonMode.NO_BASE_COMPARISON){
throw new UserException.BadInput("");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finish this warning message

}
if(baseComparisonMode != BaseComparisonMode.NO_BASE_COMPARISON){
if(baseComparisonOutputDirectory == null) {
throw new UserException.CouldNotCreateOutputFile(baseComparisonOutputDirectory, "Output directory not provided but required in base comparison mode.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in "--argument name BASE_COMPARISON" mode

@@ -141,6 +171,28 @@ public void traverse(){
for(ReferencePair pair : referencePairs){
System.out.println(pair);
}

if(referencePairs.size() != 1 && baseComparisonMode != BaseComparisonMode.NO_BASE_COMPARISON){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these sanity checks to before the tool starts doing any of the work, (perhaps in "custom commandline validation"

// pass fastas into mummer, get back a vcf and add to list
MummerExecutor executor = new MummerExecutor();
logger.info("Running mummer alignment on sequence " + sequenceName);
File tempSnpsDirectory = IOUtils.createTempDir("tempsnps");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you are making a whole new tmpdir within each of these loops for every mismatching contig... This can get to be a lot potentially and these are all java tmpdirs that close on the jvm exit too (with apprxoimately 2 uncompressed references' worth of inputs in here....
I think this method should probably manage the deletion of the tmp dir itself to cut down on disk usage. @droazen thoughs?

}
// merge individual snps files
File snps = IOUtils.createTempFile(String.format("%s_%s", refPair.getRef1AsString(), refPair.getRef2AsString()), ".snps");
/*new File(baseComparisonOutputDirectory.toPath().toString(), String.format("%s_%s.snps", refPair.getRef1AsString(), refPair.getRef2AsString()));*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code

SNPRecord record = new SNPRecord(sequenceName, position, new String(new byte[]{ref1Allele}), new String(new byte[]{ref2Allele}), refPair.getRef1AsString(), refPair.getRef2AsString());
writer.writeRecord(record);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check at the end that both are finished when you kick out and throw an exception otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do i need this in addition to the check earlier in the loop that the sequence lengths for the current sequence are equal? if i want one specifically on these iterators, would that be a user exception? should be same sequence length as is being used in the first check

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen This is an error that should never happen, since you've already checked that the sequence lengths are the same. If it does happen that one iterator finishes before the other, something has gone horribly wrong internally. You'd probably want to throw a ShouldNeverReachHereException in that case, since we never expect it to happen in practice.

int sequenceLength = source.getSequenceDictionary().getSequence(sequenceName).getSequenceLength();
FastaReferenceWriter writer = new FastaReferenceWriterBuilder()
.setFastaFile(output.toPath())
.setBasesPerLine(80)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic ints should be variales somewhere.

@@ -35,7 +35,7 @@
* between command/script/module execution. Using -i doesn't buy you anything (for this version of the executor, at
* least) since the process is terminated after each command completes.
*/
public class PythonScriptExecutor extends PythonExecutorBase {
public class PythonScriptExecutor extends PythonExecutorBase {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidental, please remove

@@ -16,7 +16,8 @@
* {@link #getApproximateCommandLine}
* {@link #getScriptException}
*/
public abstract class ScriptExecutor {
public abstract class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidental, please remove


// need intel build for MUMmer
@Test(enabled = false)
public void testExecuteMummer() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are going to need a test that asserts we get a nice user exception in the even that hte correct installation of mummer doesn't exist for the users machine

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen Here's part 1 of my review -- still need to review the tests and MummerExecutor

@@ -0,0 +1,42 @@
package org.broadinstitute.hellbender.tools.reference;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test class should be in src/test/java/org/broadinstitute/hellbender/utils/alignment/

import java.io.File;
import java.io.IOException;

public class MummerExecutorUnitTest extends CommandLineProgramTest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test classes should extend GATKBaseTest, not CommandLineProgramTest

@@ -93,6 +105,15 @@ public class CompareReferences extends GATKTool {
@Argument(fullName = "display-only-differing-sequences", doc = "If provided, only display sequence names that differ in their actual sequence.", optional = true)
private boolean onlyDisplayDifferingSequences = false;

@Argument(fullName = "base-comparison", doc = "If provided, any mismatching, same-length sequences will be aligned for a base-comparison.", optional = true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument description is incorrect, since full alignment mode can handle sequences of different lengths. Should be something like: Mode for base-level comparisons. Off by default, but can do either full alignment of mismatching sequences to find both SNPs and indels (FULL_ALIGNMENT), or can find SNPs only in mismatching sequences of the same length (FIND_SNPS_ONLY)

public enum BaseComparisonMode{
// no base comparison
NO_BASE_COMPARISON,
// run the mummer pipeline to generate a snps file containing SNPs and INDELs for any mismatching sequences of same name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to generate a snps file" -> "to generate a VCF"

NO_BASE_COMPARISON,
// run the mummer pipeline to generate a snps file containing SNPs and INDELs for any mismatching sequences of same name
FULL_ALIGNMENT,
// do a base-by-base comparison of any mistmatching sequences of same name to output a table containing each base mismatch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mistmatching -> mismatching
"of same name" -> "of same name and length"

// find the mismatch sequence
for (String sequenceName : table.getAllSequenceNames()) {
Set<ReferenceSequenceTable.TableRow> rows = table.queryBySequenceName(sequenceName);
if (rows.size() == 2) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment explaining significance of two rows here

// if the lengths of the 2 sequences aren't equal, error - can't compare different sequence lengths, probably indel need alignment
ReferenceSequenceTable.TableRow[] rowArray = rows.toArray(new ReferenceSequenceTable.TableRow[0]);
if (rowArray[0].getLength() != rowArray[1].getLength()) {
logger.warn("Sequence lengths are not equal and can't be compared. Consider running in FULL_ALIGNMENT mode.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the sequence name in this logger message

}
}

private static class MummerIndel{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining what this class is

@@ -16,7 +16,8 @@
* {@link #getApproximateCommandLine}
* {@link #getScriptException}
*/
public abstract class ScriptExecutor {
public abstract class
ScriptExecutor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert accidental whitespace change

@@ -35,7 +35,7 @@
* between command/script/module execution. Using -i doesn't buy you anything (for this version of the executor, at
* least) since the process is terminated after each command completes.
*/
public class PythonScriptExecutor extends PythonExecutorBase {
public class PythonScriptExecutor extends PythonExecutorBase {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert accidental whitespace change

Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen Here's part 2 of my review

import java.util.*;

/**
* Class for executing MUMmer pipeline.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a little more detail here about what the pipeline does and which tools it runs

* @param printStdout boolean to trigger displaying to stdout
* @return the ProcessOutput of the run
*/
public static ProcessOutput runPythonCommand(String script, List<String> scriptArguments, Map<String, String> additionalEnvironmentVars, File stdoutCaptureFile, boolean printStdout){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe runPythonCommand() is now unused and can be deleted

}

// FULL_ALIGNMENT tests:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These FULL_ALIGNMENT mode tests could be unified using a DataProvider, since the actual test code is the same in every case. The DataProvider would list the two fasta inputs, plus the expected output for each test case.

@Test
public void testFullAlignmentModeInsertion() throws IOException{
final File ref1 = new File(getToolTestDataDir() + "hg19mini_chr1indel.fasta");
final File ref2 = new File(getToolTestDataDir() + "hg19mini.fasta");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here documenting that the order in which these are specified determines whether it's an insertion or deletion

@Test
public void testFullAlignmentModeDeletion() throws IOException{
final File ref1 = new File(getToolTestDataDir() + "hg19mini.fasta");
final File ref2 = new File(getToolTestDataDir() + "hg19mini_chr1indel.fasta");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here documenting that the order in which these are specified determines whether it's an insertion or deletion

public void testPrepareMUMmerExecutionDirectory(){
MummerExecutor exec = new MummerExecutor();
File executableDirectory = exec.getMummerExecutableDirectory();
Assert.assertEquals(executableDirectory.listFiles().length, 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check for the presence of the individual expected files (nucmer, delta-filter, etc.), by name

@@ -0,0 +1,5 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there supposed to be a blank line at the start of this expected output file?

@@ -0,0 +1,3099 @@
MD5 Length hg38_better_alt_masked.fa Homo_sapiens_assembly38_masked.fasta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this DRAGEN expected output used in an actual test?

final File expectedOutput = new File(getToolTestDataDir(), "expected.SNPandINDEL.hg19mini.fasta_hg19mini_snpandindel.fasta.vcf");
final File actualOutput = new File(output, "hg19mini.fasta_hg19mini_snpandindel.fasta.vcf");
IntegrationTestSpec.assertEqualTextFiles(actualOutput, expectedOutput);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ideally want a FULL_ALIGNMENT mode test case that comprehensively exercises all of the state transitions in the VCF creation code between SNPs, insertions, and deletions. Eg., something like this:

SNP
SNP
insertion
SNP
deletion
SNP
insertion
insertion
deletion
deletion
insertion
SNP

Is it possible to create a test case with all of the above on a single contig (and in that order)?

final File expectedOutput = new File(getToolTestDataDir(), "expected.hg19mini.fasta_hg19mini_chr2iupacsnps.fasta_snps.tsv");
final File actualOutput = new File(output, "hg19mini.fasta_hg19mini_chr2iupacsnps.fasta_snps.tsv");

IntegrationTestSpec.assertEqualTextFiles(actualOutput, expectedOutput);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an additional FIND_SNPS_ONLY mode test case that runs on a sequence with an indel, just to show that it doesn't explode in that case.

@droazen droazen changed the title Adding base level comparison modes for CompareReferences tool (Do not merge) Adding base level comparison modes for CompareReferences tool Aug 12, 2022
@gatk-bot
Copy link

gatk-bot commented Aug 12, 2022

Github actions tests reported job failures from actions build 2849712241
Failures in the following jobs:

Test Type JDK Job ID Logs
unit 11 2849712241.13 logs
integration 11 2849712241.12 logs
unit 8 2849712241.1 logs
integration 8 2849712241.0 logs

@orlicohen
Copy link
Contributor Author

orlicohen commented Aug 12, 2022

CompareReferences FullAlignment integration tests & ExecuteMummer unit test failing currently due to MUMmer build

@droazen droazen changed the title (Do not merge) Adding base level comparison modes for CompareReferences tool Adding base level comparison modes for CompareReferences tool Aug 27, 2022
Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a patch to add Intel x86_64 builds of MUMmer for Mac and Linux, with code to select the right distribution for the user's system. Branch looks good now, and all comments have been addressed -- will merge once tests are green!

@droazen droazen dismissed jamesemery’s stale review August 27, 2022 16:30

All comments resolved

@droazen droazen merged commit 993c7f1 into master Aug 27, 2022
@droazen droazen deleted the oc_comparereferenceswithaligner branch August 27, 2022 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants