CNNVariant Update models, validate scores, cleanup training #5175

lucidtronix · 2018-09-10T18:17:32Z

Code cleanup and technical debt payback, plus updated models.

cmnbroad

Sorry it took a while to get to this - did a first pass on everything except the CNNVariantTrain changes, but I had some questions, and the concordance test fails (not sure why that doesn't show up as a red x on the PR...), I think its just that the test file name is misspelled in the test.

cmnbroad · 2018-09-19T19:26:28Z

scripts/gatkcondaenv.yml.template

@@ -43,9 +43,8 @@ dependencies:
  - scikit-learn==0.19.1
  - scipy==1.0.0
  - six==1.11.0
-  - $tensorFlowDependency
-  - tensorflow-tensorboard==0.4.0rc3
+  - tensorflow==1.9.0


As long as we still have the build script generating two conda yml files, we should keep the $tensorFlowDependency reference in here. Also, the version referenced in the yml doesn't match the version in build.gradle.

cmnbroad · 2018-09-19T19:28:32Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/validation/InfoConcordanceRecord.java

+import org.broadinstitute.hellbender.utils.tsv.TableWriter;
+import org.broadinstitute.hellbender.utils.tsv.TableReader;
+
+public class InfoConcordanceRecord {


Whole class needs javadoc.

cmnbroad · 2018-09-19T19:29:55Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/validation/InfoConcordanceRecord.java

+            return writer;
+        }
+        catch (IOException e) {
+            throw new UserException(String.format("Encountered an IO exception while reading from %s.", outputTable), e);


This is writing, not reading.

cmnbroad · 2018-09-19T19:37:14Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/validation/VcfInfoConcordance.java

+        programGroup=VariantEvaluationProgramGroup.class)
+@DocumentedFeature
+@BetaFeature
+public class VcfInfoConcordance extends AbstractConcordanceWalker {


Is this sufficiently different from the Concordance tool to be a separate tool ?

Tool name should probably have a verb. Maybe EvaluateInfoFieldConcordance or EvaluateSiteConcordance ?

Needs javadoc.

Fixed, and yes I think this is different enough from concordance which evaluates a caller, because this tool is intended for filter scores or annotations.

cmnbroad · 2018-09-19T19:45:42Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/validation/VcfInfoConcordance.java

+    @Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME)
+    protected File summary;
+    @Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true)
+    protected String evalInfoKey = "CNN_2D";


Use the existing GATKVCFConstant.

cmnbroad · 2018-09-19T21:07:19Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

+            scoreScan.useDelimiter("\\n");
+            writeVCFHeader(vcfWriter);
+        } catch (IOException e) {
+            throw new GATKException("Error when trying to write annotated VCF.", e);


It could be the score file - might be worth having a nested try/catch for that.

Also, the Scanner is never closed anywhere.

it seems createVCFWriter wont throw IOExceptions, so I just fixed the error message. Scanner is now closed.

cmnbroad · 2018-09-19T21:09:41Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

+        logger.info("Done scoring variants with CNN.");
+        if ( vcfWriter != null ) {
+            vcfWriter.close();
+        }


Should test the Scanner for null and also close that if not.

cmnbroad · 2018-09-19T21:13:45Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

@@ -486,5 +495,20 @@ private void setArchitectureAndWeightsFromResources() {
        }
    }

+    private void startPythonSession(){


Can you rename this to something more specific - maybe one suggestion would be initializePythonArgsAndModel.

I think it would be a lot simple if all of the code that figures out the values for weights and architecture was in one place, i.e., all of the if blocks in setArchitectureAndWeightsFromResources and startPythonSession together in one place. The case where you use Python None would require special treatment, but it would be much easier to read if the logic were consolidated.

renamed and consolidated the fxns.

cmnbroad · 2018-09-19T21:16:16Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

            pythonExecutor.sendSynchronousCommand(String.format("tempFile = open('%s', 'w+')" + NL, scoreFile.getAbsolutePath()));
            pythonExecutor.sendSynchronousCommand("import vqsr_cnn" + NL);
-


Can you move the above import vqsr_cnn into the python code like you did with the keras import.

not easily, because the executePythonCommand calls vqsr_cnn.score_and_write_batch so this python session needs the vqsr_cnn module loaded

cmnbroad · 2018-09-19T21:22:48Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

@@ -92,15 +92,15 @@
 *   -weights path/to/my_weights.hd5
 * </pre>
 */
+@BetaFeature


I've done only superficial review of the python code behind this. Has anyone else ever reviewed it ? We should talk about how to get that done.

@mbabadi did a review too, but I would be happy for someone to take closer look if there's a volunteer.

lucidtronix · 2018-09-21T21:02:49Z

@cmnbroad thanks for the review back to you!

codecov-io · 2018-09-22T16:33:53Z

Codecov Report

Merging #5175 into master will decrease coverage by 0.01%.
The diff coverage is 74.89%.

@@             Coverage Diff              @@
##             master    #5175      +/-   ##
============================================
- Coverage     86.77%   86.75%   -0.02%     
- Complexity    29910    29935      +25     
============================================
  Files          1835     1838       +3     
  Lines        138574   138742     +168     
  Branches      15255    15276      +21     
============================================
+ Hits         120250   120369     +119     
- Misses        12772    12808      +36     
- Partials       5552     5565      +13

Impacted Files	Coverage Δ	Complexity Δ
...nder/tools/walkers/vqsr/FilterVariantTranches.java	`92.24% <ø> (ø)`	`42 <0> (ø)`	⬇️
...der/tools/walkers/vqsr/CNNVariantWriteTensors.java	`85.71% <100%> (+2.38%)`	`4 <0> (ø)`	⬇️
...hellbender/tools/walkers/vqsr/CNNVariantTrain.java	`60% <46.66%> (-20.65%)`	`4 <0> (ø)`
...lkers/validation/EvaluateInfoFieldConcordance.java	`72.58% <72.58%> (ø)`	`14 <14> (?)`
...ellbender/tools/walkers/vqsr/CNNScoreVariants.java	`73.68% <77.14%> (-1.32%)`	`41 <17> (+1)`
...ools/walkers/validation/InfoConcordanceRecord.java	`93.93% <93.93%> (ø)`	`8 <8> (?)`
...n/EvaluateInfoFieldConcordanceIntegrationTest.java	`96% <96%> (ø)`	`3 <3> (?)`
...utils/smithwaterman/SmithWatermanIntelAligner.java	`50% <0%> (-30%)`	`1% <0%> (-2%)`
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java	`60% <0%> (ø)`	`2% <0%> (ø)`	⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce669d1...65d0edd. Read the comment docs.

cmnbroad

Round 2.

cmnbroad · 2018-09-26T13:24:16Z

...ava/org/broadinstitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordance.java

+import org.broadinstitute.hellbender.utils.variant.GATKVCFConstants;
+import picard.cmdline.programgroups.VariantEvaluationProgramGroup;
+
+@CommandLineProgramProperties(


This tool needs javadoc, including a usage example.

The doc here, and the summary lines below, should reflect/mention that this handles one key at a time (currently, the doc uses plural "keys" so it sounds like it can evaluate more than one at a time).

Doc added, we can use two different keys from 1 VCF or 1 or 2 keys for 2 different vcfs. Clarified in the doc.

cmnbroad · 2018-09-26T13:24:37Z

...ava/org/broadinstitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordance.java

+    static final String USAGE_SUMMARY = "This tool evaluates info fields from an input VCF against a VCF that has been validated and is considered to represent ground truth.\n";
+    public static final String SUMMARY_LONG_NAME = "summary";
+    public static final String SUMMARY_SHORT_NAME = "S";
+    @Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME)


fullName can use the constant above.

cmnbroad · 2018-09-26T13:27:19Z

...ava/org/broadinstitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordance.java

+    public static final String SUMMARY_SHORT_NAME = "S";
+    @Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME)
+    protected String summary;
+    @Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true)


Another nit - can you put a blank line between each command line arg definition.

cmnbroad · 2018-09-26T14:00:03Z

...ava/org/broadinstitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordance.java

+    @Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME)
+    protected String summary;
+    @Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true)
+    protected String evalInfoKey = GATKVCFConstants.CNN_2D_KEY;


If these are going to be optional, and have this default, the doc should specify that that it uses this key, and where it comes from.

made them required, I think that is less error-prone.

cmnbroad · 2018-09-26T14:02:36Z

...ava/org/broadinstitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordance.java

+
+    private void infoDifference(VariantContext eval, VariantContext truth) {
+        double evalVal = Double.valueOf((String)eval.getAttribute(this.evalInfoKey));
+        double truthVal = Double.valueOf((String)truth.getAttribute(this.truthInfoKey));


These should handle the case where the key isn't present, otherwise it will throw a null pointer exception.

This tool should inspect both truth and eval headers and throw, or at least warn/log, if the header doesn't contain an info header line for the respective keys.

cmnbroad · 2018-09-26T18:14:25Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/validation/InfoConcordanceRecord.java

+
+import htsjdk.variant.variantcontext.VariantContext;
+
+import java.io.File;


Unused import.

cmnbroad · 2018-09-26T18:25:27Z

...ava/org/broadinstitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordance.java

+        if (Math.abs(delta) > this.epsilon) {
+            this.logger.warn(String.format("Difference (%f) greater than epsilon (%f) at %s:%d %s:", delta, this.epsilon, eval.getContig(), eval.getStart(), eval.getAlleles().toString()));
+            this.logger.warn(String.format("\t\tTruth info: " + truth.getAttributes().toString()));
+            this.logger.warn(String.format("\t\t Eval info: " + eval.getAttributes().toString()));


Extra space before " Eval".

cmnbroad · 2018-09-26T18:34:30Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/validation/InfoConcordanceRecord.java

+    final double mean;
+    final double std;
+
+    public InfoConcordanceRecord(VariantContext.Type type, String evalKey, String trueKey, double mean, double std) {


Public methods and classes need javadoc - it should be trivial to add.

cmnbroad · 2018-09-26T18:39:31Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

+        }
+
+        String getArgsAndModel;
+        if (weights != null && architecture != null) {


In my comment in the last review round, I was thinking more along the lines of collapsing the logic to make it easier to see whats going on, ie.

if (!(tensorType.equals(TensorType.read_tensor) || tensorType.equals(TensorType.reference))) { throw new GATKException("No default architecture for tensor type:" + tensorType.name()); } if (architecture == null) { architecture = tensorType.equals(TensorType.read_tensor) ? IOUtils.writeTempResourceFromPath(resourcePathReadTensor, null).getAbsolutePath() : IOUtils.writeTempResourceFromPath(resourcePathReferenceTensor, null).getAbsolutePath(); } if (weights == null) { weights = tensorType.equals(TensorType.read_tensor) ? IOUtils.writeTempResourceFromPath( resourcePathReadTensor.replace(".json", ".hd5"), null).getAbsolutePath() : IOUtils.writeTempResourceFromPath( resourcePathReferenceTensor.replace(".json", ".hd5"), null).getAbsolutePath(); } // single logger call... // single start_session_get_args_and_model call...

There would always be some initial value for weights, even if the user provides an architecture but not weights, so there could be a single call to the python code with everything, instead of varying the number of args. It would be much cleaner.

Also, what is the behavior if the user provides arch/weights that are out of sync with each other ? I still think it might make more sense to have the input just be a folder name, so it matches the output from CNNVariantTrain, maybe with optional per-file arch/weight overrides.

I agree, but can we save this for the PEP8 PR that is on-deck? It will require several python changes to the python code as well as updates to the WDLs, etc so I would rather only refactor that stuff once.

cmnbroad · 2018-09-26T21:05:28Z

src/main/python/org/broadinstitute/hellbender/vqsr_cnn/vqsr_cnn/defines.py

@@ -5,6 +5,7 @@

 TENSOR_MAPS_2D = ['read_tensor']
 TENSOR_MAPS_1D = ['reference']
+# noinspection PyInterpreter


Curious about why this is necessary, or if its accidental ? this applies to the next line right ?

accident, I don't know.

cmnbroad · 2018-09-26T21:24:21Z

...stitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordanceIntegrationTest.java

+        final Path summary = createTempPath("summary", ".txt");
+        final ArgumentsBuilder argsBuilder = new ArgumentsBuilder();
+        argsBuilder.addArgument(AbstractConcordanceWalker.EVAL_VARIANTS_SHORT_NAME, inputVcf)
+                .addArgument(AbstractConcordanceWalker.TRUTH_VARIANTS_LONG_NAME, inputVcf)


Also meant to suggest adding at least one test that uses a different file for truth than eval.

lucidtronix · 2018-10-02T17:45:29Z

@cmnbroad thanks for the review. I think addressed all comments except the arguments/weights simplification, which I would prefer to save for the PEP8 refactor we discussed. Back to you!

cmnbroad

Almost there - a few remaining cleanup requests and a bug.

cmnbroad · 2018-10-03T19:12:46Z

...ava/org/broadinstitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordance.java

+        switch (concordanceState) {
+            case TRUE_POSITIVE: {
+                snpCount++;
+                indelCount++;


This is counting every variant as a snp and an indel.

oh man that's bad. Thanks for the catch, fixed.

cmnbroad · 2018-10-03T19:13:25Z

...ava/org/broadinstitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordance.java

+            this.logger.warn(String.format("Difference (%f) greater than epsilon (%f) at %s:%d %s:", delta, this.epsilon, eval.getContig(), eval.getStart(), eval.getAlleles().toString()));
+            this.logger.warn(String.format("\t\tTruth info: " + truth.getAttributes().toString()));
+            this.logger.warn(String.format("\t\t Eval info: " + eval.getAttributes().toString()));
+        }


We generally don't leave debugging code enabled. If its just for (your) debugging, I'd either remove it, or else make it conditional on an @Advanced or @Hidden arg (and probably make epsilon @Advanced or @Hidden as well).

cmnbroad · 2018-10-03T19:15:18Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/validation/InfoConcordanceRecord.java

+        this.std = std;
+    }
+
+    public VariantContext.Type getVariantType() {


All of the public methods and (embedded classes) still need javadoc.

cmnbroad · 2018-10-03T19:15:57Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/vqsr/CNNScoreVariants.java

+        }
+
+        String getArgsAndModel;
+        if (weights != null && architecture != null) {


cmnbroad · 2018-10-03T19:23:08Z

...stitute/hellbender/tools/walkers/validation/EvaluateInfoFieldConcordanceIntegrationTest.java

+    }
+
+    @Test
+    public void test2Vcfs() throws Exception {


These two test methods are identical except for the test params. It would be preferable to use a DataProvider with truth/eval files, keys, and snp/indel mean/stdev and remove the redundancy.

Yup that is much nicer. Thanks!

lucidtronix · 2018-10-04T18:13:48Z

@cmnbroad back to you. The build.gradle conflict is from the genomicsdb and tensorflow version updates. I can resolve and rebase if it looks good to you.

cmnbroad · 2018-10-10T21:35:02Z

@lucidtronix Looks good now - lets squash and rebase/resolve on master and run tests again and should be good to go.

lucidtronix · 2018-10-11T14:26:42Z

Squashed rebased and checks pass, @cmnbroad good to merge?

cmnbroad

Thanks @lucidtronix!

…itute#5175)

lucidtronix requested a review from cmnbroad September 10, 2018 18:24

droazen assigned cmnbroad Sep 10, 2018

cmnbroad requested changes Sep 19, 2018

View reviewed changes

cmnbroad requested changes Sep 26, 2018

View reviewed changes

cmnbroad reviewed Sep 26, 2018

View reviewed changes

cmnbroad requested changes Oct 3, 2018

View reviewed changes

cmnbroad mentioned this pull request Oct 5, 2018

Support Java/Python bidirectional data streaming #4316

Closed

cnn variant update models validate scores cleanup training

65d0edd

lucidtronix force-pushed the sf_validate branch from fb661e4 to 65d0edd Compare October 11, 2018 13:23

cmnbroad approved these changes Oct 11, 2018

View reviewed changes

cmnbroad merged commit 50dcd18 into master Oct 11, 2018

cmnbroad deleted the sf_validate branch October 12, 2018 14:32

EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018

CNN variant update models validate scores cleanup training (broadinst…

cffa438

…itute#5175)

		pythonExecutor.sendSynchronousCommand(String.format("tempFile = open('%s', 'w+')" + NL, scoreFile.getAbsolutePath()));
		pythonExecutor.sendSynchronousCommand("import vqsr_cnn" + NL);


		import htsjdk.variant.variantcontext.VariantContext;

		import java.io.File;

CNNVariant Update models, validate scores, cleanup training #5175

CNNVariant Update models, validate scores, cleanup training #5175

Conversation

lucidtronix commented Sep 10, 2018

cmnbroad left a comment

Choose a reason for hiding this comment

cmnbroad Sep 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucidtronix commented Sep 21, 2018

codecov-io commented Sep 22, 2018 • edited Loading

Codecov Report

cmnbroad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucidtronix commented Oct 2, 2018

cmnbroad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucidtronix commented Oct 4, 2018

cmnbroad commented Oct 10, 2018 • edited Loading

lucidtronix commented Oct 11, 2018

cmnbroad left a comment

Choose a reason for hiding this comment

cmnbroad Sep 19, 2018 •

edited

Loading

codecov-io commented Sep 22, 2018 •

edited

Loading

cmnbroad commented Oct 10, 2018 •

edited

Loading