Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Hadoop-BAM with Disq #5138

Merged
merged 4 commits into from
Dec 4, 2018
Merged

Replace Hadoop-BAM with Disq #5138

merged 4 commits into from
Dec 4, 2018

Conversation

tomwhite
Copy link
Contributor

Preview of the changes needed to use Disq.

@tomwhite
Copy link
Contributor Author

There are currently two failing tests, both of which need fixes in htsjdk.

@jamesemery
Copy link
Collaborator

@tomwhite After spending some time searching for this feature for my testing purposes, it would be helpful to expose the NIO adapter toggle directly from the command line in this branch.

@tomwhite tomwhite force-pushed the tw_disq branch 2 times, most recently from 0be0976 to e7afd79 Compare September 10, 2018 10:50
@tomwhite tomwhite changed the title Disq (DO NOT MERGE) Replace Hadoop-BAM with Disq Sep 18, 2018
@tomwhite
Copy link
Contributor Author

Htsjdk and the Disq snapshot have been updated and now the previously failing tests are passing.

@codecov-io
Copy link

codecov-io commented Oct 8, 2018

Codecov Report

Merging #5138 into master will decrease coverage by 13.875%.
The diff coverage is 77.992%.

@@               Coverage Diff                @@
##              master     #5138        +/-   ##
================================================
- Coverage     86.988%   73.113%   -13.875%     
+ Complexity     31224     24503      -6721     
================================================
  Files           1914      1813       -101     
  Lines         144264    134638      -9626     
  Branches       15956     14915      -1041     
================================================
- Hits          125492     98438     -27054     
- Misses         13003     31489     +18486     
+ Partials        5769      4711      -1058
Impacted Files Coverage Δ Complexity Δ
...der/tools/walkers/CombineGVCFsIntegrationTest.java 0.873% <ø> (-86.574%) 2 <0> (-22)
...er/tools/walkers/GenotypeGVCFsIntegrationTest.java 4.878% <ø> (-76.014%) 2 <0> (-23)
...ers/annotator/VariantAnnotatorIntegrationTest.java 2.484% <ø> (-95.031%) 2 <0> (-53)
.../hellbender/utils/variant/writers/HomRefBlock.java 88.158% <0%> (-5.085%) 35 <0> (-1)
...ellbender/utils/iterators/PushPullTransformer.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...dinstitute/hellbender/tools/spark/PileupSpark.java 0% <0%> (-98.148%) 0 <0> (-15)
...der/tools/spark/ApplyBQSRSparkIntegrationTest.java 8.451% <0%> (-66.197%) 2 <0> (-7)
...der/tools/spark/CreateHadoopBamSplittingIndex.java 0% <0%> (-88.462%) 0 <0> (-14)
...bender/tools/spark/PileupSparkIntegrationTest.java 2.041% <0%> (-97.959%) 2 <0> (-13)
.../CreateHadoopBamSplittingIndexIntegrationTest.java 4.688% <0%> (-84.375%) 1 <0> (-12)
... and 875 more

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking the changes are good and clear up some of the clutter in our spark methods. Unfortunately there are some unrelated changes bundled into this branch that make it difficult to evaluate properly. I will take a closer look at the changes made in the GVCF code soon.

build.gradle Outdated Show resolved Hide resolved

// The underlying reads are required to be in SAMRecord format in order to be
// written out, so we convert them to SAMRecord explicitly here. If they're already
// SAMRecords, this will effectively be a no-op. The SAMRecords will be headerless
// for efficient serialization.
// TODO: add header here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

final JavaSparkContext ctx, final String outputFile, final String referenceFile, final JavaRDD<SAMRecord> reads,
final SAMFileHeader header, final int numReducers, final WriteOption... writeOptions) throws IOException {

final JavaRDD<SAMRecord> sortedReads = sortSamRecordsToMatchHeader(reads, header, numReducers);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this not break tests? There are tests in the codebase right now that forcing a sort on every sharded output should and does break, #4874 ran into this problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, there should be something written before and after this stage to stdout, as we found when we added the sort that this step resulted in an opaque final step for spark outputting where it was apparently not providing any output.

A line like logger.info("Finished sorting the bam file and dumping read shards to disk, proceeding to merge the shards into a single file using the master thread"); from the old incarnation of this method should be preserved in the appropriate case.

Though on second thought this may or may not be necessary given that we stay within spark for the next several stages...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests have been removed, and they all pass. Do you think there is another case that needs covering?

We can add logging to Disq for some of these operations.

}
else {
if (null == referenceName) {
static String checkCramReference(final JavaSparkContext ctx, final String filePath, final String referencePath) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a more informative javadoc that enumerates which cases are accepted and which aren't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (outputFile.endsWith(IOUtil.BCF_FILE_EXTENSION) || outputFile.endsWith(IOUtil.BCF_FILE_EXTENSION + ".gz")) {
throw new UserException.UnimplementedFeature("It is currently not possible to write a BCF file on spark. See https://github.com/broadinstitute/gatk/issues/4303 for more details .");
}

if (outputFile.endsWith(BGZFCodec.DEFAULT_EXTENSION) || outputFile.endsWith(".gz")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we no longer handling vcf.gz extension files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -65,16 +59,13 @@ public VariantsSparkSource(JavaSparkContext ctx) {
* @return JavaRDD<VariantContext> of variants from all files.
*/
public JavaRDD<VariantContext> getParallelVariantContexts(final String vcf, final List<SimpleInterval> intervals) {
Configuration conf = new Configuration();
conf.setStrings("io.compression.codecs", BGZFEnhancedGzipCodec.class.getCanonicalName(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codecs? Where did they go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled by Disq.

@@ -152,7 +153,7 @@ private void assertSingleShardedWritingWorks(String inputBam, String referenceFi

// check that a splitting bai file is created
if (IOUtils.isBamFileName(outputPath)) {
Assert.assertTrue(Files.exists(IOUtils.getPath(outputPath + SplittingBAMIndexer.OUTPUT_FILE_EXTENSION)));
//Assert.assertTrue(Files.exists(IOUtils.getPath(outputPath + SBIIndex.FILE_EXTENSION)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented out code should probably be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened disq-bio/disq#45 to do this. It shouldn't block this being merged though.


import java.util.*;

public class GVCFBlockMergingIterator extends PushToPullIterator<VariantContext> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here making explicit the fact that you are reusing the PushToPullIterator which talks about downsampling in its comments and indeed appears to have that hardwired into its behavior. Or I would make the commenting on PushToPullIterator more generic (referring to downsampling as an example) to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworked the javadoc in #5311 to address this.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pull out the GVCFBlockMergingIterator refactor here into another PR so it can be given its own review.

import static htsjdk.variant.vcf.VCFConstants.MAX_GENOTYPE_QUAL;
import static org.broadinstitute.hellbender.utils.variant.writers.GVCFWriter.GVCF_BLOCK;

public final class GVCFBlockCombiner implements PushPullTransformer<VariantContext> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you actually spin all of the push/pull iterator code into a separate branch? It deserves a separate PR from the rest of the code in this branch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - the push/pull stuff needs to be reviewed separately.

@tomwhite
Copy link
Contributor Author

Opened #5311 for the push/pull part.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me.

…e-sorted, but it was not.

Used 'sort -k1,1 -s' to sort by QNAME field.
@lbergelson
Copy link
Member

Yay!

@lbergelson lbergelson deleted the tw_disq branch December 4, 2018 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants