Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix 4649 #4677

Merged
merged 1 commit into from
Apr 26, 2018
Merged

fix 4649 #4677

merged 1 commit into from
Apr 26, 2018

Conversation

SHuang-Broad
Copy link
Contributor

fix #4649

The cause of the exception is a new edge case that was not imagined when the reference region segmenting logic was initially written.
It is now covered, with updated tests.

@codecov-io
Copy link

codecov-io commented Apr 20, 2018

Codecov Report

Merging #4677 into master will increase coverage by 0.03%.
The diff coverage is 90.476%.

@@              Coverage Diff               @@
##              master     #4677      +/-   ##
==============================================
+ Coverage     79.894%   79.924%   +0.03%     
+ Complexity     17388     17348      -40     
==============================================
  Files           1080      1074       -6     
  Lines          63091     62942     -149     
  Branches       10180     10186       +6     
==============================================
- Hits           50406     50306     -100     
+ Misses          8701      8652      -49     
  Partials        3984      3984
Impacted Files Coverage Δ Complexity Δ
...ry/inference/CpxVariantInducingAssemblyContig.java 84.141% <100%> (+0.963%) 27 <3> (+3) ⬆️
...y/inference/CpxVariantCanonicalRepresentation.java 78.992% <71.429%> (+0.022%) 52 <0> (+2) ⬆️
...transforms/markduplicates/MarkDuplicatesSpark.java 90.909% <0%> (-4.213%) 9% <0%> (-6%)
...hellbender/tools/walkers/mutect/Mutect2Engine.java 87.654% <0%> (-3.39%) 50% <0%> (+1%)
...tools/spark/validation/CompareDuplicatesSpark.java 82.927% <0%> (-1.518%) 24% <0%> (ø)
...forms/markduplicates/MarkDuplicatesSparkUtils.java 89.5% <0%> (-1.083%) 58% <0%> (-9%)
...ellbender/tools/walkers/vqsr/CNNScoreVariants.java 74.057% <0%> (-0.829%) 40% <0%> (-1%)
...r/engine/filters/ReadGroupBlackListReadFilter.java 83.333% <0%> (-0.303%) 17% <0%> (ø)
...lkers/ReferenceConfidenceVariantContextMerger.java 94.979% <0%> (-0.278%) 69% <0%> (-2%)
...stitute/hellbender/tools/walkers/CombineGVCFs.java 93.75% <0%> (-0.25%) 60% <0%> (-2%)
... and 35 more

final SimpleInterval twoBaseSegment = new SimpleInterval(eventPrimaryChromosome, leftBoundary.getEnd(), rightBoundary.getStart());
if ( twoBaseBoundaries.contains(twoBaseSegment) ) {
segments.add(twoBaseSegment);
}
Copy link
Contributor

@TedBrookings TedBrookings Apr 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming I'm correct about the above being a typo, you might decrease scope for this kind of error be refactoring thusly:

final int start = rightBoundary.getStart();
final int end = leftBoundary.getEnd();
final int segmentLength = end - start;
if( segmentLength >= 1) {
    final SimpleInterval newSegment = new SimpleInterval(eventPrimaryChromosome, start, end);
    if(segmentLength > 1 || twoBaseBoundaries.contains(newSegment)) {
        segments.add(newSegment);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but like I replied in the other place, it's a "point" location so won't matter in this case.
And the suggestion you made is good, except that end is always >= start because it is from the "left" (I know, so many mind-numbing details...)

@@ -168,8 +171,14 @@ static SimpleInterval getAffectedReferenceRegion(final List<SimpleInterval> even
// there shouldn't be a segment constructed if two segmenting locations are adjacent to each other on the reference
// this could happen when (in the simplest case), two alignments are separated by a mapped insertion (hence 3 total alignments),
// and the two alignments' ref span are connected
// more generally: only segment when the two segmenting locations are boundaries of alignments that overlap each other (ref. span)
if (rightBoundary.getStart() - leftBoundary.getEnd() > 1) {
segments.add(new SimpleInterval(eventPrimaryChromosome, leftBoundary.getStart(), rightBoundary.getStart()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the new SimpleInterval use leftBoundary.getEnd() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't matter, because the boundaries are 1bp "point" locations.
I'll make it clear in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also changed to getEnd()


final String chr = basicInfo.eventPrimaryChromosome;

final Set<SimpleInterval> result = new HashSet<>(contigWithFineTunedAlignments.getAlignments().size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 2 * contigWithFineTunedAlignments.getAlignments().size(), since you get up to two SimpleIntervals per alignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep and done as suggested

Copy link
Contributor

@TedBrookings TedBrookings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one typo that must be fixed, however it is quite simple to fix so I'll just approve now.

The test code is nigh impenetrable (to me). Partly that's because I'm just looking at this part of the pipeline for the first time, partly it's because you're addressing an inherently complex problem, and partly it's all those "manually curated values". I have no idea if it's feasible or totally crazy, but it might be good to have an adversarial function that generates problems from known solutions rather than doing it by hand.

  * better exception message and fix bug in existing cpx variant test data

(SV) new cpx variant scenario commit:
  * optionally add reference segments when the would be segment is 2-bp long (before all such segments are skipped)
@SHuang-Broad
Copy link
Contributor Author

It's always difficult to come up with test data for these complex events.

What I usually do is to create test data on what I know should work, including imaginable edge cases, and accumulate more edge cases as we run into them.

Can you expand a little on what you mean by

to have an adversarial function that generates problems from known solutions rather than doing it by hand

@TedBrookings
Copy link
Contributor

Like I said, I don't know enough to say if this approach is feasible, but sometimes it's possible to test an algorithm with a problem-generating function. This works well when it's easy to generate self-consistent solutions and test data consistent with a given solution. I was able to do this for some of the interval overlap functions on the python side of my code. Other methods were too complicated (for me to figure out anyway).

Let's say you have an algorithm function f_algo that you want to test. You design a function test_data_factory that takes as input either a known solution, or a set of parameters from which it can easily generate a known solution (by a route that is logically distinct from the way f_algo works). Then f_tester generates and returns test_data. The flow looks like

// get either random test solutions or some distinct set that tests particular edge cases
test_params = get_test_params();
test_solution = easy_solution_from_params(test_params);

test_data = test_data_factory(test_params, test_solution);
assert f_algo(test_data) == test_solution;

In some sense you've already done that, but you've personally stepped in as test_data_factory(), and I'm wondering if it would be possible to automate that to get rid of all those numbers in the test code.

@SHuang-Broad
Copy link
Contributor Author

Hmm, I see what you mean.
It’s an interesting idea, and I think Valentin does that for some CIGAR related tests; but I think it is difficult to apply in our case without having a “simulator”. I’ll keep this in mind.

@SHuang-Broad SHuang-Broad merged commit 9f6838b into master Apr 26, 2018
@SHuang-Broad SHuang-Broad deleted the sh-sv-issues-4649 branch April 26, 2018 22:34
cwhelan pushed a commit to cwhelan/gatk-linked-reads that referenced this pull request May 25, 2018
* better exception message and fix bug in existing cpx variant test data

(SV) new cpx variant scenario commit:
  * optionally add reference segments when the would be segment is 2-bp long (before all such segments are skipped)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error in StructuralVariationDiscoveryPipelineSpark: UnhandledCaseSeen in CpxVariantInterpreter
3 participants