Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNNVariant Update models, validate scores, cleanup training #5175

Merged
merged 1 commit into from
Oct 11, 2018

Conversation

lucidtronix
Copy link
Contributor

Code cleanup and technical debt payback, plus updated models.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took a while to get to this - did a first pass on everything except the CNNVariantTrain changes, but I had some questions, and the concordance test fails (not sure why that doesn't show up as a red x on the PR...), I think its just that the test file name is misspelled in the test.

@@ -43,9 +43,8 @@ dependencies:
- scikit-learn==0.19.1
- scipy==1.0.0
- six==1.11.0
- $tensorFlowDependency
- tensorflow-tensorboard==0.4.0rc3
- tensorflow==1.9.0
Copy link
Collaborator

@cmnbroad cmnbroad Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we still have the build script generating two conda yml files, we should keep the $tensorFlowDependency reference in here. Also, the version referenced in the yml doesn't match the version in build.gradle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

import org.broadinstitute.hellbender.utils.tsv.TableWriter;
import org.broadinstitute.hellbender.utils.tsv.TableReader;

public class InfoConcordanceRecord {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whole class needs javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

return writer;
}
catch (IOException e) {
throw new UserException(String.format("Encountered an IO exception while reading from %s.", outputTable), e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is writing, not reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

programGroup=VariantEvaluationProgramGroup.class)
@DocumentedFeature
@BetaFeature
public class VcfInfoConcordance extends AbstractConcordanceWalker {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficiently different from the Concordance tool to be a separate tool ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tool name should probably have a verb. Maybe EvaluateInfoFieldConcordance or EvaluateSiteConcordance ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, and yes I think this is different enough from concordance which evaluates a caller, because this tool is intended for filter scores or annotations.

@Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME)
protected File summary;
@Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true)
protected String evalInfoKey = "CNN_2D";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the existing GATKVCFConstant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

scoreScan.useDelimiter("\\n");
writeVCFHeader(vcfWriter);
} catch (IOException e) {
throw new GATKException("Error when trying to write annotated VCF.", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be the score file - might be worth having a nested try/catch for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the Scanner is never closed anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems createVCFWriter wont throw IOExceptions, so I just fixed the error message. Scanner is now closed.

logger.info("Done scoring variants with CNN.");
if ( vcfWriter != null ) {
vcfWriter.close();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should test the Scanner for null and also close that if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -486,5 +495,20 @@ private void setArchitectureAndWeightsFromResources() {
}
}

private void startPythonSession(){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to something more specific - maybe one suggestion would be initializePythonArgsAndModel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a lot simple if all of the code that figures out the values for weights and architecture was in one place, i.e., all of the if blocks in setArchitectureAndWeightsFromResources and startPythonSession together in one place. The case where you use Python None would require special treatment, but it would be much easier to read if the logic were consolidated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed and consolidated the fxns.

pythonExecutor.sendSynchronousCommand(String.format("tempFile = open('%s', 'w+')" + NL, scoreFile.getAbsolutePath()));
pythonExecutor.sendSynchronousCommand("import vqsr_cnn" + NL);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the above import vqsr_cnn into the python code like you did with the keras import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not easily, because the executePythonCommand calls vqsr_cnn.score_and_write_batch so this python session needs the vqsr_cnn module loaded

@@ -92,15 +92,15 @@
* -weights path/to/my_weights.hd5
* </pre>
*/
@BetaFeature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done only superficial review of the python code behind this. Has anyone else ever reviewed it ? We should talk about how to get that done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbabadi did a review too, but I would be happy for someone to take closer look if there's a volunteer.

@lucidtronix
Copy link
Contributor Author

@cmnbroad thanks for the review back to you!

@codecov-io
Copy link

codecov-io commented Sep 22, 2018

Codecov Report

Merging #5175 into master will decrease coverage by 0.01%.
The diff coverage is 74.89%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5175      +/-   ##
============================================
- Coverage     86.77%   86.75%   -0.02%     
- Complexity    29910    29935      +25     
============================================
  Files          1835     1838       +3     
  Lines        138574   138742     +168     
  Branches      15255    15276      +21     
============================================
+ Hits         120250   120369     +119     
- Misses        12772    12808      +36     
- Partials       5552     5565      +13
Impacted Files Coverage Δ Complexity Δ
...nder/tools/walkers/vqsr/FilterVariantTranches.java 92.24% <ø> (ø) 42 <0> (ø) ⬇️
...der/tools/walkers/vqsr/CNNVariantWriteTensors.java 85.71% <100%> (+2.38%) 4 <0> (ø) ⬇️
...hellbender/tools/walkers/vqsr/CNNVariantTrain.java 60% <46.66%> (-20.65%) 4 <0> (ø)
...lkers/validation/EvaluateInfoFieldConcordance.java 72.58% <72.58%> (ø) 14 <14> (?)
...ellbender/tools/walkers/vqsr/CNNScoreVariants.java 73.68% <77.14%> (-1.32%) 41 <17> (+1)
...ools/walkers/validation/InfoConcordanceRecord.java 93.93% <93.93%> (ø) 8 <8> (?)
...n/EvaluateInfoFieldConcordanceIntegrationTest.java 96% <96%> (ø) 3 <3> (?)
...utils/smithwaterman/SmithWatermanIntelAligner.java 50% <0%> (-30%) 1% <0%> (-2%)
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce669d1...65d0edd. Read the comment docs.

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2.

import org.broadinstitute.hellbender.utils.variant.GATKVCFConstants;
import picard.cmdline.programgroups.VariantEvaluationProgramGroup;

@CommandLineProgramProperties(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool needs javadoc, including a usage example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc here, and the summary lines below, should reflect/mention that this handles one key at a time (currently, the doc uses plural "keys" so it sounds like it can evaluate more than one at a time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc added, we can use two different keys from 1 VCF or 1 or 2 keys for 2 different vcfs. Clarified in the doc.

static final String USAGE_SUMMARY = "This tool evaluates info fields from an input VCF against a VCF that has been validated and is considered to represent ground truth.\n";
public static final String SUMMARY_LONG_NAME = "summary";
public static final String SUMMARY_SHORT_NAME = "S";
@Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fullName can use the constant above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

public static final String SUMMARY_SHORT_NAME = "S";
@Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME)
protected String summary;
@Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nit - can you put a blank line between each command line arg definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME)
protected String summary;
@Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true)
protected String evalInfoKey = GATKVCFConstants.CNN_2D_KEY;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are going to be optional, and have this default, the doc should specify that that it uses this key, and where it comes from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made them required, I think that is less error-prone.


private void infoDifference(VariantContext eval, VariantContext truth) {
double evalVal = Double.valueOf((String)eval.getAttribute(this.evalInfoKey));
double truthVal = Double.valueOf((String)truth.getAttribute(this.truthInfoKey));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should handle the case where the key isn't present, otherwise it will throw a null pointer exception.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool should inspect both truth and eval headers and throw, or at least warn/log, if the header doesn't contain an info header line for the respective keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


import htsjdk.variant.variantcontext.VariantContext;

import java.io.File;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

if (Math.abs(delta) > this.epsilon) {
this.logger.warn(String.format("Difference (%f) greater than epsilon (%f) at %s:%d %s:", delta, this.epsilon, eval.getContig(), eval.getStart(), eval.getAlleles().toString()));
this.logger.warn(String.format("\t\tTruth info: " + truth.getAttributes().toString()));
this.logger.warn(String.format("\t\t Eval info: " + eval.getAttributes().toString()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space before " Eval".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

final double mean;
final double std;

public InfoConcordanceRecord(VariantContext.Type type, String evalKey, String trueKey, double mean, double std) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public methods and classes need javadoc - it should be trivial to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

}

String getArgsAndModel;
if (weights != null && architecture != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my comment in the last review round, I was thinking more along the lines of collapsing the logic to make it easier to see whats going on, ie.

        if (!(tensorType.equals(TensorType.read_tensor) || tensorType.equals(TensorType.reference))) {
            throw new GATKException("No default architecture for tensor type:" + tensorType.name());
        }
        if (architecture == null) {
            architecture = tensorType.equals(TensorType.read_tensor) ?
                    IOUtils.writeTempResourceFromPath(resourcePathReadTensor, null).getAbsolutePath() :
                    IOUtils.writeTempResourceFromPath(resourcePathReferenceTensor, null).getAbsolutePath();
        }
        if (weights == null) {
            weights = tensorType.equals(TensorType.read_tensor) ?
                    IOUtils.writeTempResourceFromPath(
                            resourcePathReadTensor.replace(".json", ".hd5"),
                            null).getAbsolutePath() :
                    IOUtils.writeTempResourceFromPath(
                            resourcePathReferenceTensor.replace(".json", ".hd5"), null).getAbsolutePath();
        }
    // single logger call...
    // single start_session_get_args_and_model call...

There would always be some initial value for weights, even if the user provides an architecture but not weights, so there could be a single call to the python code with everything, instead of varying the number of args. It would be much cleaner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is the behavior if the user provides arch/weights that are out of sync with each other ? I still think it might make more sense to have the input just be a folder name, so it matches the output from CNNVariantTrain, maybe with optional per-file arch/weight overrides.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but can we save this for the PEP8 PR that is on-deck? It will require several python changes to the python code as well as updates to the WDLs, etc so I would rather only refactor that stuff once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -5,6 +5,7 @@

TENSOR_MAPS_2D = ['read_tensor']
TENSOR_MAPS_1D = ['reference']
# noinspection PyInterpreter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about why this is necessary, or if its accidental ? this applies to the next line right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accident, I don't know.

final Path summary = createTempPath("summary", ".txt");
final ArgumentsBuilder argsBuilder = new ArgumentsBuilder();
argsBuilder.addArgument(AbstractConcordanceWalker.EVAL_VARIANTS_SHORT_NAME, inputVcf)
.addArgument(AbstractConcordanceWalker.TRUTH_VARIANTS_LONG_NAME, inputVcf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also meant to suggest adding at least one test that uses a different file for truth than eval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@lucidtronix
Copy link
Contributor Author

@cmnbroad thanks for the review. I think addressed all comments except the arguments/weights simplification, which I would prefer to save for the PEP8 refactor we discussed. Back to you!

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there - a few remaining cleanup requests and a bug.

switch (concordanceState) {
case TRUE_POSITIVE: {
snpCount++;
indelCount++;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is counting every variant as a snp and an indel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man that's bad. Thanks for the catch, fixed.

this.logger.warn(String.format("Difference (%f) greater than epsilon (%f) at %s:%d %s:", delta, this.epsilon, eval.getContig(), eval.getStart(), eval.getAlleles().toString()));
this.logger.warn(String.format("\t\tTruth info: " + truth.getAttributes().toString()));
this.logger.warn(String.format("\t\t Eval info: " + eval.getAttributes().toString()));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't leave debugging code enabled. If its just for (your) debugging, I'd either remove it, or else make it conditional on an @Advanced or @Hidden arg (and probably make epsilon @Advanced or @Hidden as well).

this.std = std;
}

public VariantContext.Type getVariantType() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the public methods and (embedded classes) still need javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

}

String getArgsAndModel;
if (weights != null && architecture != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

}

@Test
public void test2Vcfs() throws Exception {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two test methods are identical except for the test params. It would be preferable to use a DataProvider with truth/eval files, keys, and snp/indel mean/stdev and remove the redundancy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that is much nicer. Thanks!

@lucidtronix
Copy link
Contributor Author

@cmnbroad back to you. The build.gradle conflict is from the genomicsdb and tensorflow version updates. I can resolve and rebase if it looks good to you.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Oct 10, 2018

@lucidtronix Looks good now - lets squash and rebase/resolve on master and run tests again and should be good to go.

@lucidtronix
Copy link
Contributor Author

Squashed rebased and checks pass, @cmnbroad good to merge?

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lucidtronix!

@cmnbroad cmnbroad merged commit 50dcd18 into master Oct 11, 2018
@cmnbroad cmnbroad deleted the sf_validate branch October 12, 2018 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants