Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for hdfs and gcs URI's to be passed to GenomicsDB #5197

Merged
merged 30 commits into from
Oct 17, 2018

Conversation

droazen
Copy link
Collaborator

@droazen droazen commented Sep 18, 2018

This PR is a finalized version of #5017. I've copied the branch into our repo so that travis will run cloud tests on it.

Currently, only Posix filesystem paths can be passed as workspaces and arrays to GenomicsDB via GenomicsDBImport and SelectVariants. This PR will allow for hdfs and gcs (and emrfs/s3) URIs to be supported as well.
Examples

./gatk GenomicsDBImport -V /vcfs/sample.vcf.gz --genomicsdb-workspace-path hdfs://master:9000/gdb_ws  -L 1:500-10000
./gatk GenomicsDBImport -V /vcfs/sample.vcf.gz --genomicsdb-workspace-path gs://my_bucket/gdb_ws  -L 1:500-10000
  ./gatk SelectVariants -V gendb.hdfs://master:9000/gdb_ws -R hs37d5.fa -O out.vcf
 ./gatk SelectVariants -V gendb.gs://my_bucket/gdb_ws -R hs37d5.fa -O out.vcf

GenomicsDB supports GCS via the Cloud Storage Connector. Set environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the GCS Service Account json file.

@droazen
Copy link
Collaborator Author

droazen commented Sep 19, 2018

@nalinigans There are some test failures in the GenomicsDB cloud tests. See: https://storage.googleapis.com/hellbender-test-logs/build_reports/master_22153.1/tests/test/index.html

We'll have to fix these before we can merge.

@nalinigans
Copy link
Contributor

@droazen, will put some debug print statements in the two tests that are failing while authenticating with GCS and issue another pull request to nalinigans_genomicsdb_uri_support branch. Hope that is OK. Thanks.

@nalinigans
Copy link
Contributor

nalinigans commented Sep 19, 2018

@droazen, we have a free GCS account, so it is possible that Hadoop requires extra configuration for authenticating/connecting with the HELLBENDER travis service account. Can anyone help here? This the code we have for connecting to GCS via Hadoop.

hdfsFS gcs_connect(struct hdfsBuilder *builder, const std::string& working_dir) {
  char *gcs_creds = getenv("GOOGLE_APPLICATION_CREDENTIALS");
  if (gcs_creds) {
    value = parse_json(gcs_creds, "project_id"); // free value after hdfsBuilderConnect as it is shallow copied.
    if (value) {
      hdfsBuilderConfSetStr(builder, "google.cloud.auth.service.account.enable", "true");
      hdfsBuilderConfSetStr(builder, "google.cloud.auth.service.account.json.keyfile", gcs_creds);
      hdfsBuilderConfSetStr(builder, "fs.gs.project.id", value);
    }
  }

  if (working_dir.empty()) {
    hdfsBuilderConfSetStr(builder, "fs.gs.working.dir", "/");
  } else {
    hdfsBuilderConfSetStr(builder, "fs.gs.working.dir", working_dir.c_str());
  }

  // Default buffer sizes are huge in the GCS connector. GenomicsDB reads/writes in smaller chunks,
  // so the buffer size can be made a little smaller.
  hdfsBuilderConfSetStr(builder, "fs.gs.io.buffersize.write", "262144");

  hdfsFS hdfs_handle = hdfsBuilderConnect(builder);
  free(value);
  return hdfs_handle;
}

This is the error from Travis logs-

Running Test: Test method testWriteToAndQueryFromGCS(org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImportIntegrationTest)
hdfsBuilderConnect(forceNewInstance=1, nn=gs://hellbender-test-logs, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
java.io.IOException: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token
        at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:210)
        at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:75)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1826)
  Caused by: com.google.api.client.http.HttpResponseException: 404 Not Found
{"error":"invalid_request","error_description":"Service account not enabled on this instance"}
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1072)
        at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:159)
        at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:493)
        at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:208)
        ... 77 more
[GenomicsDB::StorageManagerConfig] Error: Error getting hdfs connection: Connection refused.

@codecov-io
Copy link

codecov-io commented Sep 24, 2018

Codecov Report

Merging #5197 into master will decrease coverage by 0.288%.
The diff coverage is 80.838%.

@@               Coverage Diff               @@
##              master     #5197       +/-   ##
===============================================
- Coverage     86.785%   86.497%   -0.288%     
- Complexity     30025     30053       +28     
===============================================
  Files           1838      1841        +3     
  Lines         139112    139748      +636     
  Branches       15340     15476      +136     
===============================================
+ Hits          120729    120878      +149     
- Misses         12806     13279      +473     
- Partials        5577      5591       +14
Impacted Files Coverage Δ Complexity Δ
...titute/hellbender/engine/FeatureInputUnitTest.java 90.526% <ø> (ø) 23 <0> (ø) ⬇️
...e/hellbender/engine/FeatureDataSourceUnitTest.java 91.022% <ø> (ø) 41 <0> (ø) ⬇️
...e/hellbender/tools/genomicsdb/GenomicsDBUtils.java 86.047% <100%> (+1.956%) 8 <0> (-2) ⬇️
...broadinstitute/hellbender/engine/FeatureInput.java 94.366% <100%> (ø) 18 <3> (ø) ⬇️
...institute/hellbender/utils/io/IOUtilsUnitTest.java 81.731% <100%> (+2.92%) 45 <3> (+3) ⬆️
...nstitute/hellbender/utils/gcs/BucketUtilsTest.java 56.303% <38.889%> (-3.103%) 13 <1> (+1)
...ls/genomicsdb/GenomicsDBImportIntegrationTest.java 86.682% <47.826%> (-2.388%) 77 <3> (ø)
...institute/hellbender/engine/FeatureDataSource.java 77.444% <71.429%> (+1.015%) 42 <3> (ø) ⬇️
.../hellbender/tools/genomicsdb/GenomicsDBImport.java 79.087% <82.353%> (+2.421%) 52 <3> (ø) ⬇️
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 80.488% <90.909%> (-0.032%) 42 <2> (+2)
... and 16 more

@nalinigans
Copy link
Contributor

nalinigans commented Oct 5, 2018

@droazen , I got a genomicsdb.jar from @kgururaj and just tried out the GenomicsDB cloud tests. The call stack that I got from my test run in nalini_new_genomicsdb_jar branch mentions that we do need the fs.gs.project.id hadoop configuration set. The google service json I use for our internal testing has this key, but the Hellbender service json does not.

Any ideas on how to get this key for the tests? Would this value be HELLBENDER_TEST_PROJECT? How is it being made available to the spark cloud tests for example? I do see it being configured in src/main/java/org/broadinstitute/hellbender/engine/spark/SparkContextFactory.java.

hdfsBuilderConnect(forceNewInstance=1, nn=gs://hellbender-test-logs, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
java.io.IOException: Must supply a value for configuration setting: **fs.gs.project.id**
	at com.google.cloud.hadoop.util.ConfigurationUtil.getMandatoryConfig(ConfigurationUtil.java:39)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createOptionsBuilderFromConfig(GoogleHadoopFileSystemBase.java:2185)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1832)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1013)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:976)

…tute/hellbender/engine/FeatureDataSource.java, src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBUtils.java and src/test/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java
@nalinigans nalinigans self-assigned this Oct 14, 2018
@nalinigans nalinigans assigned droazen and unassigned droazen Oct 14, 2018
@nalinigans
Copy link
Contributor

PR is ready to be merged. The 2 GCS tests in GenomicsDBImportIntegrationTest.java are commented out, but they have been tested with the HELLBENDER test project and GOOGLE_APPLICATION_CREDENTIALS in the nalini_new_genomicsdb_jar branch and will be uncommented as soon as a new GenomicsDB jar is released.

@lbergelson
Copy link
Member

#5305

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting! This should simplify our workflows and hopefully speed things up a bit too.

Two minor comments.

@droazen
Copy link
Collaborator Author

droazen commented Oct 15, 2018

@lbergelson is going to try regenerating our service account key and see if that allows us to uncomment the GCS tests in this branch.

@nalinigans
Copy link
Contributor

nalinigans commented Oct 17, 2018

@lbergelson, thanks for regenerating the service account key. All tests are enabled and pass now.
@droazen, please feel free to merge with master.

Copy link
Collaborator Author

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- I'll merge once Travis finishes running on the final version.

@droazen
Copy link
Collaborator Author

droazen commented Oct 17, 2018

@nalinigans Tests passed, but it's worth pointing out that the new testWriteToAndQueryFromGCS() test took 9 minutes to complete, which seems very slow. Possible performance issue?

@droazen droazen merged commit 63d8640 into master Oct 17, 2018
@droazen droazen deleted the nalinigans_genomicsdb_uri_support branch October 17, 2018 19:37
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
…te#5197)

* Allow for hdfs and gcs URI's to be passed to GenomicsDB

* Push the URI processing for GenomicsDB to IOUtils
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants