Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypass FeatureReader for GenomicsDBImport #7393

Merged
merged 4 commits into from
Oct 27, 2021
Merged

Conversation

mlathara
Copy link
Contributor

@mlathara mlathara commented Aug 3, 2021

This PR adds the option to bypass feature reader for GenomicsDBImport. In our testing, this sees about 10-15% speedup, and uses roughly an order of magnitude less memory in the case where vcfs and genomicsdb workspaces are both on local disk. We don't have extensive benchmarking of how this affects GenomicsDBImport in the cloud, but would be interested in exploring that (in conjunction with some of the recent changes for native cloud support).

cc: @droazen @lbergelson @ldgauthier

@droazen droazen self-requested a review August 16, 2021 16:38
@droazen droazen self-assigned this Aug 16, 2021
Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlathara A couple issues in the test code that need to be addressed before this can be merged -- back to you.

@Test(groups = {"bucket"}, dataProvider = "batchSizes")
public void testGenomicsDBImportGCSInputsInBatches(final int batchSize) throws IOException {
testGenomicsDBImporterWithBatchSize(resolveLargeFilesAsCloudURIs(LOCAL_GVCFS), INTERVAL, COMBINED, batchSize);
}

@Test(groups = {"bucket"}, dataProvider = "batchSizes")
public void testGenomicsDBImportGCSInputsInBatchesNativeReader(final int batchSize) throws IOException {
testGenomicsDBImporterWithBatchSize(resolveLargeFilesAsCloudURIs(LOCAL_GVCFS), INTERVAL, COMBINED, batchSize, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testGenomicsDBImporterWithBatchSize() does not propagate the useNativeReader boolean correctly into writeToGenomicsDB()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@Test
public void testGenomicsDBIncrementalAndBatchSize1WithNonAdjacentIntervalsNativeReader() throws IOException {
final String workspace = createTempDir("genomicsdb-incremental-tests").getAbsolutePath() + "/workspace";
testIncrementalImport(2, MULTIPLE_NON_ADJACENT_INTERVALS_THAT_WORK_WITH_COMBINE_GVCFS, workspace, 1, false, true, "", 0, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testIncrementalImport() does not use the native reader for the first batch (i == 0) -- why is that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - not entirely sure anymore, I think I wanted to check that a given workspace could be imported to using feature reader and htslib. Refactored a bit to make that a bit more clear, and added a test that does all htslib/native incremental import

@@ -512,6 +530,9 @@ private void initializeHeaderAndSampleMappings() {
final List<VCFHeader> headers = new ArrayList<>(variantPaths.size());
for (final String variantPathString : variantPaths) {
final Path variantPath = IOUtils.getPath(variantPathString);
if (bypassFeatureReader) {
assertVariantFileIsCompressedAndIndexed(variantPath);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your testing, did you find that these extra checks for whether the inputs are block-compressed and indexed added significantly to the runtime when dealing with remote files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't done a lot of remote testing -- just sanity tests to ensure that they work. In the small remote cases we've tried the native reader is actually slower, but I haven't dug into it to see where the bottleneck is (potentially tweaking buffer sizes, etc). As I mentioned in the PR, that is something we were hoping to explore with Broad.

@@ -348,6 +350,13 @@
optional = true)
private boolean sharedPosixFSOptimizations = false;

@Argument(fullName = BYPASS_FEATURE_READER,
doc = "Used htslib to read input VCFs instead of FeatureReader. This will reduce memory usage and potentially speed up " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used -> Use
FeatureReader -> GATK's FeatureReader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mlathara
Copy link
Contributor Author

@droazen Made some changes, I think the PR build failing is unrelated...?

@mlathara mlathara requested a review from droazen October 22, 2021 21:29
Copy link
Collaborator

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlathara Back to you with a few lingering issues in the test code

for(int i=0; i<LOCAL_GVCFS.size(); i+=stepSize) {
int upper = Math.min(i+stepSize, LOCAL_GVCFS.size());
writeToGenomicsDB(LOCAL_GVCFS.subList(i, upper), intervals, workspace, batchSize, false, 0, 1, false, false, i!=0,
chrsToPartitions, i!=0 && useNativeReader);
chrsToPartitions, useNativeReaderInitial && useNativeReader);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here useNativeReaderInitial and useNativeReader are doing the same thing -- the native reader will only be used if both are true, regardless of whether we're on the first batch or a later batch. I think the intent was for useNativeReaderInitial to control whether the native reader should be used for batch 0? In that case, we'd want something like:

(i == 0 && useNativeReaderInitial) || (i > 0 && useNativeReader)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Test
public void testGenomicsDBBasicIncrementalAllNativeReader() throws IOException {
final String workspace = createTempDir("genomicsdb-incremental-tests").getAbsolutePath() + "/workspace";
testIncrementalImport(2, INTERVAL, workspace, 0, true, true, COMBINED_WITH_GENOTYPES, 0, false, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the useNativeReader and useNativeReaderInitial booleans should be true here if this is testing the "all native reader" case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mlathara
Copy link
Contributor Author

Done - sorry for the 😶‍🌫️ 🤦‍♂️

@mlathara mlathara requested a review from droazen October 26, 2021 18:23
@droazen
Copy link
Collaborator

droazen commented Oct 27, 2021

Looks like the integration tests failed with an unrelated error -- I'll try re-running them.

@lbergelson
Copy link
Member

rebase them first

@droazen
Copy link
Collaborator

droazen commented Oct 27, 2021

The test failures in the branch build are clearly related to the recent travis key migration. The PR build (which is the one we care about) passes, so this should be safe to merge.

@droazen droazen merged commit 00a4280 into master Oct 27, 2021
@droazen droazen deleted the ml_bypass_featurereader branch October 27, 2021 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants