Bypass FeatureReader for GenomicsDBImport #7393

mlathara · 2021-08-03T20:59:01Z

This PR adds the option to bypass feature reader for GenomicsDBImport. In our testing, this sees about 10-15% speedup, and uses roughly an order of magnitude less memory in the case where vcfs and genomicsdb workspaces are both on local disk. We don't have extensive benchmarking of how this affects GenomicsDBImport in the cloud, but would be interested in exploring that (in conjunction with some of the recent changes for native cloud support).

cc: @droazen @lbergelson @ldgauthier

…ml_bypass_featurereader

droazen

@mlathara A couple issues in the test code that need to be addressed before this can be merged -- back to you.

droazen · 2021-10-15T17:54:43Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

    @Test(groups = {"bucket"}, dataProvider = "batchSizes")
    public void testGenomicsDBImportGCSInputsInBatches(final int batchSize) throws IOException {
        testGenomicsDBImporterWithBatchSize(resolveLargeFilesAsCloudURIs(LOCAL_GVCFS), INTERVAL, COMBINED, batchSize);
    }

+    @Test(groups = {"bucket"}, dataProvider = "batchSizes")
+    public void testGenomicsDBImportGCSInputsInBatchesNativeReader(final int batchSize) throws IOException {
+        testGenomicsDBImporterWithBatchSize(resolveLargeFilesAsCloudURIs(LOCAL_GVCFS), INTERVAL, COMBINED, batchSize, true);


testGenomicsDBImporterWithBatchSize() does not propagate the useNativeReader boolean correctly into writeToGenomicsDB()

droazen · 2021-10-15T17:57:51Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

+    @Test
+    public void testGenomicsDBIncrementalAndBatchSize1WithNonAdjacentIntervalsNativeReader() throws IOException {
+        final String workspace = createTempDir("genomicsdb-incremental-tests").getAbsolutePath() + "/workspace";
+        testIncrementalImport(2, MULTIPLE_NON_ADJACENT_INTERVALS_THAT_WORK_WITH_COMBINE_GVCFS, workspace, 1, false, true, "", 0, true);


testIncrementalImport() does not use the native reader for the first batch (i == 0) -- why is that?

Ah - not entirely sure anymore, I think I wanted to check that a given workspace could be imported to using feature reader and htslib. Refactored a bit to make that a bit more clear, and added a test that does all htslib/native incremental import

droazen · 2021-10-15T17:59:22Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

@@ -512,6 +530,9 @@ private void initializeHeaderAndSampleMappings() {
            final List<VCFHeader> headers = new ArrayList<>(variantPaths.size());
            for (final String variantPathString : variantPaths) {
                final Path variantPath = IOUtils.getPath(variantPathString);
+                if (bypassFeatureReader) {
+                    assertVariantFileIsCompressedAndIndexed(variantPath);


In your testing, did you find that these extra checks for whether the inputs are block-compressed and indexed added significantly to the runtime when dealing with remote files?

We haven't done a lot of remote testing -- just sanity tests to ensure that they work. In the small remote cases we've tried the native reader is actually slower, but I haven't dug into it to see where the bottleneck is (potentially tweaking buffer sizes, etc). As I mentioned in the PR, that is something we were hoping to explore with Broad.

droazen · 2021-10-15T17:59:35Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

@@ -348,6 +350,13 @@
            optional = true)
    private boolean sharedPosixFSOptimizations = false;

+    @Argument(fullName = BYPASS_FEATURE_READER,
+            doc = "Used htslib to read input VCFs instead of FeatureReader. This will reduce memory usage and potentially speed up " +


Used -> Use
FeatureReader -> GATK's FeatureReader

mlathara · 2021-10-22T21:29:47Z

@droazen Made some changes, I think the PR build failing is unrelated...?

droazen

@mlathara Back to you with a few lingering issues in the test code

droazen · 2021-10-26T17:30:11Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

        for(int i=0; i<LOCAL_GVCFS.size(); i+=stepSize) {
            int upper = Math.min(i+stepSize, LOCAL_GVCFS.size());
            writeToGenomicsDB(LOCAL_GVCFS.subList(i, upper), intervals, workspace, batchSize, false, 0, 1, false, false, i!=0, 
-                              chrsToPartitions, i!=0 && useNativeReader);
+                              chrsToPartitions, useNativeReaderInitial && useNativeReader);


Here useNativeReaderInitial and useNativeReader are doing the same thing -- the native reader will only be used if both are true, regardless of whether we're on the first batch or a later batch. I think the intent was for useNativeReaderInitial to control whether the native reader should be used for batch 0? In that case, we'd want something like:

(i == 0 && useNativeReaderInitial) || (i > 0 && useNativeReader)

droazen · 2021-10-26T17:31:18Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

+    @Test
+    public void testGenomicsDBBasicIncrementalAllNativeReader() throws IOException {
+        final String workspace = createTempDir("genomicsdb-incremental-tests").getAbsolutePath() + "/workspace";
+        testIncrementalImport(2, INTERVAL, workspace, 0, true, true, COMBINED_WITH_GENOTYPES, 0, false, true);


Both the useNativeReader and useNativeReaderInitial booleans should be true here if this is testing the "all native reader" case

mlathara · 2021-10-26T18:23:33Z

Done - sorry for the 😶‍🌫️ 🤦‍♂️

droazen · 2021-10-27T18:17:29Z

Looks like the integration tests failed with an unrelated error -- I'll try re-running them.

lbergelson · 2021-10-27T18:18:00Z

rebase them first

droazen · 2021-10-27T20:21:06Z

The test failures in the branch build are clearly related to the recent travis key migration. The PR build (which is the one we care about) passes, so this should be safe to merge.

mlathara and others added 2 commits July 21, 2021 12:04

bypass feature reader

32f446c

Merge remote-tracking branch 'upstream/ml_bypass_featurereader' into …

f87ab74

…ml_bypass_featurereader

droazen self-requested a review August 16, 2021 16:38

droazen self-assigned this Aug 16, 2021

droazen requested changes Oct 15, 2021

View reviewed changes

droazen assigned mlathara Oct 15, 2021

address review comments

edae3ce

mlathara requested a review from droazen October 22, 2021 21:29

droazen requested changes Oct 26, 2021

View reviewed changes

fix tests

79c5b82

mlathara requested a review from droazen October 26, 2021 18:23

droazen approved these changes Oct 27, 2021

View reviewed changes

droazen merged commit 00a4280 into master Oct 27, 2021

droazen deleted the ml_bypass_featurereader branch October 27, 2021 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass FeatureReader for GenomicsDBImport #7393

Bypass FeatureReader for GenomicsDBImport #7393

mlathara commented Aug 3, 2021

droazen left a comment

droazen Oct 15, 2021

mlathara Oct 22, 2021

droazen Oct 15, 2021

mlathara Oct 22, 2021

droazen Oct 15, 2021

mlathara Oct 22, 2021

droazen Oct 15, 2021

mlathara Oct 22, 2021

mlathara commented Oct 22, 2021

droazen left a comment

droazen Oct 26, 2021

mlathara Oct 26, 2021

droazen Oct 26, 2021

mlathara Oct 26, 2021

mlathara commented Oct 26, 2021

droazen commented Oct 27, 2021

lbergelson commented Oct 27, 2021

droazen commented Oct 27, 2021

Bypass FeatureReader for GenomicsDBImport #7393

Bypass FeatureReader for GenomicsDBImport #7393

Conversation

mlathara commented Aug 3, 2021

droazen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlathara commented Oct 22, 2021

droazen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlathara commented Oct 26, 2021

droazen commented Oct 27, 2021

lbergelson commented Oct 27, 2021

droazen commented Oct 27, 2021