Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command-line interface (CLI) with plugins #53

Closed
hammer opened this issue Jul 20, 2020 · 6 comments
Closed

Command-line interface (CLI) with plugins #53

hammer opened this issue Jul 20, 2020 · 6 comments
Labels
CLI Issues related to the command-line interface IO Issues related to reading and writing common third-party file formats multi-repo Issues related to having multiple repos

Comments

@hammer
Copy link
Contributor

hammer commented Jul 20, 2020

The discussion in sgkit-dev/sgkit-plink#8 made it clear it would be nice to have a CLI with plugins to support IO and other operations.

@hammer hammer added the IO Issues related to reading and writing common third-party file formats label Jul 20, 2020
@hammer
Copy link
Contributor Author

hammer commented Jul 20, 2020

click-contrib/click-plugins looks to be one way to accomplish this task.

@hammer hammer added the multi-repo Issues related to having multiple repos label Jul 20, 2020
@hammer
Copy link
Contributor Author

hammer commented Jul 20, 2020

Also it's not necessary but if we are going to invest in the CLI as an interface we may want to make use of a library like willmcgugan/rich to ensure we make effective use of all of the capabilities of the shell/terminal/console.

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Jul 20, 2020

Re plugins, I think we should make the interface more tightly defined than plugging in some script to the CLI. Each format plugin should provide:

  1. function to convert an input path in that format into an sgkit dataset
  2. a function to convert an sgkit dataset into that format and write to a path
  3. a function that "sniffs" a path to see if it's a valid file in that format.

The only extra thing we'd need apart from these functions in order to make a functioning CLI like sgkit import would be some way of pulling out the documentation for the extra kwargs on the functions (1) and (2), so that we can push this into the CLI help text.

@hammer hammer added the CLI Issues related to the command-line interface label Jul 21, 2020
@eric-czech
Copy link
Collaborator

re PLINK conversion CLI:

While these are out of scope for what a simple CLI would likely want to support, I wanted to share the edge cases that came up in importing UKB PLINK for future reference. This could be good "first issue" work in future CLI enhancements:

  • Many of the fam fields are irrelevant, so options to choose a projection for conversion would be helpful
  • Sample ids happen to be ints rather than strings as usual so it may also be helpful to pair projected fields with a preferred dtype
  • The files are split by contig and there is a little bit of extra merging logic necessary to make that work
    • Namely, our contig index will be 0 for all files so that has to be overwritten
  • It would be helpful to choose remote destinations for output Zarr (GCS, S3, etc.)
  • Compression and filter options would also be good parameters, though hopefully we can standardize those enough that it may not be necessary

My preamble before Zarr writes looks like this:

path = osp.join(data_dir, f'ukb_chr{contig_name}.zarr')
store = gcsfs.GCSMap(path, gcs=gcs, check=False, create=True)
compressor = zarr.Blosc(cname='zstd', clevel=3, shuffle=2)
encoding = {v: {'compressor': compressor} for v in ds}
logger.info(f'Writing dataset for contig {contig} to {path}')
with dask.config.set(scheduler='processes'), ProgressBar():
    ds.to_zarr(store=store, mode='w', consolidated=True, encoding=encoding)

which could be fairly easy to parameterize.

@jeromekelleher
Copy link
Collaborator

Resurrecting this, I think a CLI to do basic conversion and dataset inspection/summary would be really handy.

Is anyone dead-against the idea of this, or should I create a batch of issues to track?

@jeromekelleher
Copy link
Collaborator

We discussed this on the dev call today, and the consensus was that a basic CLI with to do inspection and data format conversion would be useful. Things have moved on since this issue, so I'm going to close this and open some new ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLI Issues related to the command-line interface IO Issues related to reading and writing common third-party file formats multi-repo Issues related to having multiple repos
Projects
None yet
Development

No branches or pull requests

3 participants