Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

Initial cli #8

Closed
wants to merge 2 commits into from
Closed

Conversation

jeromekelleher
Copy link
Contributor

Closes #7

This is an initial outline of the conversion CLI.

I haven't written any testing code yet, as I wanted to see if people agreed with the basic organisation first (and ran out of time!).

This adds a plink2sg CLI that can be run either via python -m sgkit_plink plink2sg or via plink2sg, if the package has been installed and the user's PATH set up accordingly. There's a bit of indirection needed to do this, but I think it's worth it as it's always useful to be able to use the python -m version. I haven't made python -m sgkit_plink call plink2sg directly because I think we'll inevitably need to have another method, sg2plink, which does the conversion in the other direction. There's some extra stuff in there that will facilitate testing, also, and having it broken up into little pieces makes things easier to mock out.

What do we think of this organisation? Do we like the names?

@jeromekelleher jeromekelleher marked this pull request as draft July 19, 2020 20:06
@tomwhite
Copy link
Collaborator

The naming looks fine to me, although plink2sgkit might be slightly preferable.

@jeromekelleher
Copy link
Contributor Author

The naming looks fine to me, although plink2sgkit might be slightly preferable.

Sure, I can see the advantage there. I'm happy to follow the consensus on this one.

# format used. We probably want to have the option of
# writing to a ZipFile anyway, so that we don't have
# gazillions of files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, here's an example that sets custom numcodec filters for Zarr: https://github.com/related-sciences/gwas-analysis/blob/master/notebooks/platform/xarray/lib/io/core.py#L60. I can see that becoming another thing that we do often between read_* and ds.to_zarr in addition to using options for zipping files and maybe rechunking.



def get_sgkit_plink_parser():
top_parser = argparse.ArgumentParser(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any strong opinions on using argparse vs Click or Fire?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, haven't got strong opinions. I haven't tried either of these. How about I get something basic up and running with argparse, and we can weigh up the pros and cons of external packages based on this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also tiangolo/typer which is not likely to be our solution but is worth knowing about. I like it because it's from the same author as FastAPI so you get a consistent experience across CLI and web Interfaces.

@eric-czech
Copy link
Collaborator

Looks great to me @jeromekelleher. I think this will be a good place to build up some of the necessary options or hard-coded behaviors that should live between calls to create the readers (like read_plink) and .to_zarr.

@hammer
Copy link

hammer commented Jul 20, 2020

Thanks for the quick work on this @jeromekelleher!

Before we land a CLI tool in this repo, I just wanted to step back and think about the CLI interaction we want for sgkit users.

I tend to prefer tools that have a single point of entry for CLI interactions such as git or conda. Is there a way we could have a single sgkit CLI tool that could have subcommands for each file format?

@jeromekelleher
Copy link
Contributor Author

I tend to prefer tools that have a single point of entry for CLI interactions such as git or conda. Is there a way we could have a single sgkit CLI tool that could have subcommands for each file format?

Yes, good call, let's think this through. For the core sgkit command line ideally I'd like to see something like

sgkit info <sgk dataset>        - high level details about a sgkit zarr dataset. Size, compression etc.
sgkit ls <sgk_dataset>          - List the arrays in the dataset (with -l, show more details)
sgkit rechunk  <sgk_dataset>  - recompute chunk sizes
sgkit import-plink  <input data> <output sgk dataset>  - Convert input data into sgkit format. 
sgkit export-plink <sgk_dataset> <output_file>
sgkit import-vcf  <input data> <output sgk dataset>  - Convert input data into sgkit format. 
sgkit export-vcf <sgk_dataset> <output_file>

So, how could we do sgkit import-x format? I guess if each of the known file format packages registered themselves with sgkit (kinda like plugins), and registration requires a import function and an export function. We could even try having a sgkit import command which tried each of the known formats using a sniffer function defined by the format plugin. The tricky bit, I guess, would be defining the UIs for the different file formats, as we'll need format specific options for each one. I'm sure this could be done, though.

I think I like this approach better, good idea @hammer! It also makes the requirements for one of the file format plugins quite clear, with a well defined interface. Hopefully that keeps the format plugins relatively simple, and concerned only with reading and writing the format in question.

What do others think?

I think I should probably finish the initial version of this CLI just to get something working, and so we have a good idea of what the format specific arguments are going to look like. It'll be easy enough to pull it out afterwards.

@hammer
Copy link

hammer commented Jul 20, 2020

We could even try having a sgkit import command which tried each of the known formats

Yes! I love IO libraries that infer file formats and their properties for me. For example, inferring the presence of a header and delimiter used in a CSV file.

I can also imagine this interface is where most users will discover they need a separate package to read/write a particular file format, and we could guide them through the installation here.

@jeromekelleher
Copy link
Contributor Author

Yes, I like this plugin idea a lot actually, it seems to be a good compromise between supporting lots of formats but not letting the main sgkit package get all sprawly and horrible.

Any thoughts on the CLI structure here @tomwhite, @eric-czech, @ravwojdyla?

@hammer
Copy link

hammer commented Jul 20, 2020

It's a minor comment, but from a plugin perspective, I think it's going to be easier for the commands to be something like sgkit vcf import/sgkit vcf export versus sgkit import-vcf/sgkit export-vcf, but that's to be determined.

I've filed https://github.com/pystatgen/sgkit/issues/53 to track the design and implementation of a unified CLI that can accommodate plugins.

@tomwhite
Copy link
Collaborator

Any thoughts on the CLI structure here?

I like the subcommand structure, implemented using interfaces and plugins.

@jeromekelleher
Copy link
Contributor Author

OK, thanks. Rather than spend more time here then, I might just make a start on what this would look like in the main repo. There'll be some rough edges, but it's good to make a start. I'll leave this one open for now anyway.

@jeromekelleher
Copy link
Contributor Author

Closing in favour of pystatgen/sgkit#57

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add conversion CLI
4 participants