Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor code around cloud support #1975

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Refactor code around cloud support #1975

wants to merge 2 commits into from

Conversation

takutosato
Copy link
Contributor

Description

As requested by @cmnbroad. The changes include:

  • Replace the String input type with IOPath type where appropriate (Rule: if a the String variable is a legal path to a file/directory, convert to PicardHTSPath as soon as possible. And the method argument for such an object should be IOPath.)
  • Remove all the legacy methods copied from GATK in PicardBucketUtils that takes String where it should take IOPath.
  • Replace PicardHtsPath type with the interface IOPath where appropriate (makes clear which methods can and should be moved to htsjdk).
  • Remove Hadoop support, which is a remnant of code ported from GATK.

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on github actions

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@takutosato
Copy link
Contributor Author

@cmnbroad will you review?

Copy link
Contributor

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsato The PicardHtsPath to IOPath changes seem to be mostly ok (although there are other places where the change can be made), but there are a few other problems that I didn't even notice before, the biggest one being that we need to reconcile GCloudTestUtils with PicardBucketUtils. (I didn't realize, for instance, that GCloudTestUtils has support for a custom staging area specified via the environment, which PicardBucketUtils completely ignores.) I know thats probably a bigger issue than you were trying to solve here - we should address that in a separate PR (again, I'm happy to do that if you want), but I really think we need to resolve these issues before we do any more cloudification.

@@ -182,7 +182,7 @@ public static PicardHtsPath replaceExtension(final IOPath path, final String new
/**
* Wrapper for Path.resolve()
*/
public static PicardHtsPath resolve(final PicardHtsPath absPath, final String relativePath){
public static PicardHtsPath resolve(final IOPath absPath, final String relativePath){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return value for this should also be IOPath, unless there is some reason that would cause problems. (Ideally this would be a method on IOPath, but since it has to return a new object of the subclass type, it can't live there, unless we add some kind of abstract "toSelf" method that each subclass has to implement. For another day).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and I see your point. Another day sounds good.

return scheme != null
&& (scheme.equals(GOOGLE_CLOUD_STORAGE_FILESYSTEM_SCHEME)
|| scheme.equals(HTTP_FILESYSTEM_PROVIDER_SCHEME)
|| scheme.equals(HTTPS_FILESYSTEM_PROVIDER_SCHEME));
}

public static boolean isLocalPath(final IOPath path){
return path.getScheme().equals(FILE_SCHEME);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should really be an instance method on PicardHtsPath for now (ultimately, really it should be on IOPath - I may create a PR to add it to the IOPath interface in htsjdk).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to happen in IOPath, because all the objects that would call the instance method isLocalPath have type IOPath, not PicardHtsPath. If you could add it to htsjdk, that would be great.

PicardIOUtils.deleteOnExit(PicardHtsPath.replaceExtension(path, ".md5", true).toPath());
return path;
} else {
throw new PicardException("Unsupported cloud filesystem: " + directory.getURIString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSupportedCloudFileSystem currently throws (I think it should not - see my comments there). If you keep this code, I suggest it to IllegalArgumentException (I'm still not sure what the intended use of PicardException is in this codebase, but IllegalArgumentException would seem more correct to me), and changing isSupportedFileSystem to just return a boolean and never throw.

@@ -61,7 +62,7 @@ default Path getReferencePath(){
*
* @return The reference provided by the user, if any, or the default, if any, as a PicardHtsPath. May be null.
*/
default PicardHtsPath getHtsPath(){
default IOPath getHtsPath(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this method should probably change to reflect the new return type , i.e., getIOPath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private static boolean isHadoopUrl(String path) {
return path.startsWith(HDFS_SCHEME + "://");
public static boolean isSupportedCloudFilesystem(final IOPath path){
ValidationUtils.validateArg(! isLocalPath(path), "isSupportedCloudFilesystem should be called on a cloud path but was given: " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think this method should be retained (or adds much value), for several reasons.

First, enumerating all of the supported protocol schemes may become difficult as we add additional cloud providers (I think azure, for example, uses multiple schemes). Also, what schemes will work may depend on what additional providers the user has on their classpath.

Having said that, if you're going to keep it, it should be moved to PicardHtsPath, and a test added, with test cases at least for file, http, hdfs, s3, and gcs schemes.

Finally, for a predicate method like this, where you're just testing the protocol scheme, it should be pure and not have a side effect like throwing an exception (with the possible exception of requiring that the input is non-null, since the code can't recover from that). I don't see any reason to require the caller to first test if the path has a cloud scheme to prevent it from throwing. It should just return a boolean if the scheme is a supported cloud scheme, and let the caller decide when to throw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing sounds good

public static final String GCLOUD_PICARD_STAGING_DIRECTORY = "gs://hellbender-test-logs/staging/picard/";
public static final String GCLOUD_PICARD_STAGING_DIRECTORY_STR = "gs://hellbender-test-logs/staging/picard/";
public static final PicardHtsPath GCLOUD_PICARD_STAGING_DIRECTORY = new PicardHtsPath(GCLOUD_PICARD_STAGING_DIRECTORY_STR);


// slashes omitted since hdfs paths seem to only have 1 slash which would be weirder to include than no slashes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git blame seems to implicate me on this comment, but I have no clue what it means ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a remnant from when we copied over code from GATK. Removed.

@@ -289,7 +290,7 @@ final Path makeTempDictionary(final Path inputFasta, final String dictNamePrefix
final PicardHtsPath HG19_MINI = PicardHtsPath.resolve(GCloudTestUtils.TEST_INPUTS_DEFAULT_GCLOUD, "picard/references/hg19mini.fasta");
final PicardHtsPath HG19_MINI_LOCAL = new PicardHtsPath("testdata/picard/reference/hg19mini.fasta");

final PicardHtsPath CLOUD_OUTPUT_DIR = PicardHtsPath.resolve(GCloudTestUtils.TEST_STAGING_DEFAULT_GCLOUD, "picard/");
final PicardHtsPath CLOUD_OUTPUT_DIR = PicardHtsPath.resolve(GCloudTestUtils.TEST_STAGING_DEFAULT_GCLOUD, "picard/CreateSequenceDictionary/");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be consistent with the types of these test constants. Elsewhere in this PR, you changed some of them to have the IOPath type, which is good. But most of these remaining PicardHtsPath constants can also be changed (of course there may be exceptions, such as if you need to subsequently call a method that is only on to PicardHtsPath, but we should try to minimize the number of those).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed them to IOPath

return new PicardHtsPath(PicardIOUtils.createTempFileInDirectory(prefix, extension, new File(directory)));
return new PicardHtsPath(PicardIOUtils.createTempFileInDirectory(prefix, extension, directory.toPath().toFile()));
} else {
if (isSupportedCloudFilesystem(directory)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right way to branch here is to invert the logic and test for a file protocol scheme. See my comments on isSupportedCloudFileSystem.

ValidationUtils.validateArg(relativePath.endsWith("/"), "relativePath must end in backslash '/': " + relativePath);

return PicardHtsPath.fromPath(PicardBucketUtils.randomRemotePath(GCLOUD_PICARD_STAGING_DIRECTORY + relativePath, "", "/"));
return PicardBucketUtils.randomRemotePath(PicardHtsPath.resolve(GCLOUD_PICARD_STAGING_DIRECTORY, relativePath), "", "/");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every direct reference to GCLOUD_PICARD_STAGING_DIRECTORY like this is bypassing the GCloudUtils methods such as getTestStaging that check the environment to determine if the user has specified an override staging area to use.

public static final String FILE_SCHEME = "file";

// This Picard test staging bucket has a TTL of 180 days (DeleteAction with Age = 180)
public static final String GCLOUD_PICARD_STAGING_DIRECTORY = "gs://hellbender-test-logs/staging/picard/";
public static final String GCLOUD_PICARD_STAGING_DIRECTORY_STR = "gs://hellbender-test-logs/staging/picard/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These really need to be reconciled with the ones in GCloudTestUtils, where there is a redundant definition of the staging area root gs://hellbender-test-logs/staging/. There really should be only one definition, and all direct usages of this constant should be replaced with calls to the methods in GCloudTestUtils that interrogate where the user wants the staging area to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I think it should be handled in a separate PR, if that's alright with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants