Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced IntervalsSkipList with OverlapDetector #4154

Merged
merged 2 commits into from
Feb 8, 2019
Merged

Conversation

davidbenjamin
Copy link
Contributor

Closes #3541. Closes #3608.

@lbergelson May I assign you by virtue of your github activity on this and related issues?

Note that OverlapDetector::getOverlaps returns an unsorted Set so I sorted its output in a few places. I don't know if this was the right solution.

@droazen
Copy link
Collaborator

droazen commented Jan 16, 2018

@davidbenjamin Have you actually profiled this change on a Spark/dataproc cluster?

@lbergelson
Copy link
Member

@droazen Adam did profiling that showed that overlaps detector was strictly better. We also had some suspicion that there was some bug lurking in the skip list implementation because of weird non-deterministic results of something that relied on it.

@droazen
Copy link
Collaborator

droazen commented Jan 16, 2018

@lbergelson Adam did the profiling? That must have been years ago. Someone definitely needs to profile this change on a Spark cluster using the latest master to see if his ancient findings still hold up.

@davidbenjamin
Copy link
Contributor Author

@lbergelson @droazen I can do it but I would need a lot of hand-holding for both the profiling and the Spark.

@droazen
Copy link
Collaborator

droazen commented Jan 16, 2018

We're happy to help -- I think it's worth doing even if this is a no-brainer substitution, as it's easy to break Spark functionality on a cluster by silly things like introducing a non-serializable class in the wrong place.

@davidbenjamin
Copy link
Contributor Author

@droazen When we left off you showed me how to use JProfiler (which I now use regularly) and had pointed me to instructions for running GATK on a Spark cluster. I realize I'm going to need another round of hand-holding, because I'm not sure which Spark cluster to test on, and I'm not sure which command to test. And beyond measuring total runtime and looking for any new hotspots in JProfiler I don't know what else to measure. Could I get some more help from someone on the engine team?

@droazen droazen assigned jamesemery and unassigned lbergelson Jan 11, 2019
@droazen droazen removed the request for review from lbergelson January 11, 2019 17:06
@droazen
Copy link
Collaborator

droazen commented Jan 11, 2019

@jamesemery Can you make a decision on what to do with this one after we merge #5292 ?

@codecov-io
Copy link

codecov-io commented Jan 14, 2019

Codecov Report

Merging #4154 into master will increase coverage by 0.006%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##              master     #4154       +/-   ##
===============================================
+ Coverage     87.037%   87.043%   +0.006%     
+ Complexity     31728     31682       -46     
===============================================
  Files           1943      1938        -5     
  Lines         146193    146014      -179     
  Branches       16141     16117       -24     
===============================================
- Hits          127242    127095      -147     
+ Misses         13064     13035       -29     
+ Partials        5887      5884        -3
Impacted Files Coverage Δ Complexity Δ
...der/engine/spark/datasources/ReadsSparkSource.java 61.333% <ø> (ø) 17 <0> (ø) ⬇️
...nder/utils/runtime/StreamingProcessController.java 67.773% <0%> (+0.474%) 33% <0%> (ø) ⬇️

@lbergelson
Copy link
Member

This is easily resolvable without performance profiling since the only remaining use of the IntervalSkipList was in ContextShard which is now also unused / untested. Removing context shard and rebasing.

@lbergelson
Copy link
Member

@davidbenjamin I've stolen this branch from you. I hope you don't mind.

@droazen droazen assigned lbergelson and unassigned jamesemery Feb 8, 2019
@davidbenjamin
Copy link
Contributor Author

@lbergelson No objection. It's in better hands than mine now.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this PR...

@lbergelson lbergelson merged commit a84f466 into master Feb 8, 2019
@lbergelson lbergelson deleted the db_issue_3608 branch February 8, 2019 23:06
@lbergelson
Copy link
Member

@davidbenjamin Thank you for this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants