Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monorepo vs. multi-repo #65

Closed
ravwojdyla opened this issue Jul 24, 2020 · 20 comments
Closed

Monorepo vs. multi-repo #65

ravwojdyla opened this issue Jul 24, 2020 · 20 comments
Labels
multi-repo Issues related to having multiple repos process + tools

Comments

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Jul 24, 2020

We have a solid style check setup in the main repo, IO repos have a bit older setup, it might be hard to keep them in sync plus we also have repos in RS that also have similar setups. I wonder if we should explore some kind of centralisation of this style check setup (akin to common plugin in sbt or root parent in maven). The best I could find for precommit is this, we could give it a try, one downside I can see is that it might not work well with mypy check (but that already doesn't work well anyway even in solo repo, related: https://github.com/pystatgen/sgkit/issues/39)

An alternative solution to this problem would be to have a single repo with proper separation of modules to prevent dependency hell.

@ravwojdyla ravwojdyla added process + tools multi-repo Issues related to having multiple repos labels Jul 24, 2020
@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Jul 30, 2020

It came up during our weekly sync, so I wanted to document it, we could use extras_require to separate dependencies, and even provide a convenient way to install a complete sgkit ex: sgkit[complete].

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Jul 31, 2020

@jeromekelleher @alimanfoo @tomwhite @eric-czech @hammer any thoughts/objections/-1/+1 to merging IO repos into the main repo and using extra_requires to separate dependencies. If you prefer to discuss this during our ~weekly meetings, this can obviously wait until the next next one.

@eric-czech
Copy link
Collaborator

eric-czech commented Jul 31, 2020

I'm a +1 to that as long as it's always easy to install the io-related packages in a separate environment from all the other sgkit packages (and not need to share common dependencies). Something like this seems reasonable to me:

conda activate base
conda install sgkit[core]
conda activate io
conda install sgkit[plink,bgen]

I think making CI run for everything in a common environment will work for now, but I don't think that would be a good long term solution. How hard is it to split the build up like that?

@ravwojdyla
Copy link
Collaborator Author

@eric-czech I believe (we should double check), install of any extra_requires will always install the root package(s). I'm not sure right now how idiomatic is it to not install root package for any of the extra, and I'm not sure if that is a good idea at the moment, BUT if we do want that, I guess a workaround for this could be to have a dummy root package sgkit_parent, and then everything else is in extra packages like core, plink, bgen. Regarding splitting CI, I agree for now we should just keep a single env, but later we can split things up (it shouldn't be too difficult).

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Jul 31, 2020

And to provide some further reasoning for this:

  • it would lower the overhead of our style check, doc, test setup etc
  • potentially increase the discoverability of features, since full codebase would be in a single repo and would be installed together

Regarding 2nd point, to be clear, we could likely have a "core" package that brings all packages, some of which would require extra dependencies, and lack of those would trigger a ImportError and a informative message that says you need to install extra package ex: sgkit[blah] for a specific blah feature, akin to say dask:

> pip install dask # but NOT dask[bag]
> python
# dask.bag code/package is present (discoverable), but its dependencies are not installed
>>> import dask.bag
ImportError: No module named cloudpickle

Dask bag requirements are not installed.

Please either conda or pip install as follows:

  conda install dask               # either conda install
  pip install dask[bag] --upgrade  # or pip install

@eric-czech
Copy link
Collaborator

There's a long past discussion on this starting around here btw: https://github.com/pystatgen/sgkit/issues/2#issuecomment-646488636.

I definitely agree with @jeromekelleher here in that I plan on using our IO tools once for most analyses, or twice to export as VCF/PLINK, and that some of those tools won't ever work well in the same environment. On the other hand, packages like pysnptools align with our stack (at the moment) and I agree that there are a bunch of advantages to making them colocate in the same environment. I'm not sure which way to lean with other IO tools but I think it will be important to understand sooner than later how packages that can't share environments will be integrated.

For some more context on the current IO packages, pysnptools (requirements) depends on bgen-reader (requirements) and cumulatively they have pretty much the same dependencies as us, plus a few things like the native bgen c library. At least those two would be manageable for now and arguably we should pick up the maintenance if the two current maintainers ever drop it.

@hammer
Copy link
Contributor

hammer commented Jul 31, 2020

Thanks for moving this discussion forward, @ravwojdyla! It's a big decision so we will definitely not decide one way or the other without input from @jeromekelleher and @tomwhite when they return. And hopefully we will get an email response from @CarlKCarlK and @horta so that we can more closely coordinate with their work on PLINK and BGEN IO.

@horta
Copy link
Collaborator

horta commented Jul 31, 2020

Hi guys,

I will respond you later tonight. Sorry about that. Also, I'm getting one week soon to work on those projects again!

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 3, 2020

I would be onboard with moving sgkit-plink and sgkit-bgen into the sgkit repo, and using extra_requires to separate dependencies. It would make our lives easier, and users' lives easier since they could do pip install sgkit[all].

For VCF we currently depend on scikit-allel to convert to Zarr as an intermediate format (see https://github.com/pystatgen/sgkit/pull/40). In the future we may want a tool to go directly from VCF to sgkit format, and that might be complex enough to warrant a separate package (let's see). But the sgkit-specific code for plink and bgen has turned out to be pretty minimal, so I think it could happily live in sgkit.

@jeromekelleher
Copy link
Collaborator

I do see the arguments for merging everything together @ravwojdyla, and I agree it is a burden on us to keep the repos separate. However, I do think it would be a fundamental mistake to merge everything together and to put the burden of working out dependencies on the user. It is a lot easier for a user to understand whether they need to install sgkit-plinkrather than pysnptools if they want to process plink data.

Suppose I'm a user who wants to convert and process their VCF data. I'm lazy, so I don't read all the documentation saying what packages I need to install for what and so I run pip install sgkit[all] (I don't really understand what this means either, I'm just copying and pasting commands). Installation then fails on some library I've never heard of, and so I give up. By putting the dependencies for the third-party libs into sgkit, we're making it the users responsibility to understand what dependencies they need in order to process particular file formats.

Under the current scheme, there's two ways we could recommend users approach this:

  1. Encourage them to install sgkit on its own, and the run sgkit import-vcf (see Initial proposal for CLI plugin system. #57 for proposal). If the sgkit-vcf package isn't installed, we'll have a nice friendly message saying "make sure you install sgkit-vcf".
  2. We could have simple guidance like "if you're working with X data format, then install sgkit-X, which will also pull in sgkit as a dependency.

The point is, it's much easier for inexperienced users to understand that they need sgkit-x installed to work with format x, than some third party package that they've never heard of. We have to assume that most of the users of sgkit will not be experienced Python users, and everything we can do to simplify their lives will be a big help.

So, I agree it's a burden on us to maintain n different sgkit import/export libraries, but this is much better than putting the burden of understanding the dependency network for n different formats on users. OK, it's a PITA now to sync up n different repos in terms of coding style and so on, but this is just the initial phase when we're getting things up and running. Once the basic import/export functions are done, I'd bet that we'll rarely touch the format repos. All of the real development will happen in the sgkit repo, and import/export repos will basically be in maintenance mode, updating every now and again to deal with upstream updates.

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Aug 3, 2020

@jeromekelleher good points, so maybe a good middle ground would be to not have pip install sgkit[all]. And instead have the code fail with a nice friendly message saying "make sure you install sgkit[vcf]"? Less burden on us, and the same level of experience for the users.

@jeromekelleher
Copy link
Collaborator

Yeah, this is a good idea @ravwojdyla - but I still don't think we should merge the repos! Fundamentally, I think it's short-sighted, but it's probably best to talk this through during a call. If everyone else agrees that we should merge the repos then, I'll go with the flow (and only complain and grumble occasionally 😉). Fair enough?

@jeromekelleher
Copy link
Collaborator

FWIW, here's the list of packages we install in a fresh venv for sgkit:

numpy-1.19.1 pandas-1.1.0 python-dateutil-2.8.1 pytz-2020.1 sgkit-0.1.dev51+g18ed63b six-1.15.0 xarray-0.16.0

and this is what we have for sgkit-plink:

appdirs-1.4.4 attrs-19.3.0 backports.tempfile-1.0 backports.weakref-1.0.post1 bgen-reader-4.0.4 cachetools-4.1.1 certifi-2020.6.20 cffi-1.14.1 chardet-3.0.4 cloudpickle-1.5.0 dask-2.22.0 dill-0.3.2 fsspec-0.8.0 h5py-2.10.0 idna-2.10 iniconfig-1.0.1 locket-0.2.0 more-itertools-8.4.0 numpy-1.19.1 packaging-20.4 pandas-1.1.0 partd-1.1.0 pluggy-0.13.1 psutil-5.7.2 py-1.9.0 pycparser-2.20 pyparsing-2.4.7 pysnptools-0.4.20 pytest-6.0.1 pytest-runner-5.2 python-dateutil-2.8.1 pytz-2020.1 pyyaml-5.3.1 requests-2.24.0 scipy-1.5.2 sgkit-plink-0.1.dev21+ge829857 six-1.15.0 texttable-1.6.2 toml-0.10.1 toolz-0.10.0 tqdm-4.48.2 urllib3-1.25.10 wheel-0.34.2 xarray-0.16.0

The first looks nice and light to me, and the second... doesn't.

@CarlKCarlK
Copy link
Collaborator

CarlKCarlK commented Aug 3, 2020 via email

@eric-czech
Copy link
Collaborator

Welcome @CarlKCarlK!

I'm excited to see how we can best integrate your work -- thanks for being willing to collaborate on it.

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Aug 6, 2020

For our next meeting tomorrow, I thought maybe it would be good to document the extra_requires, the benefits and drawbacks.

But first I will describe the potential setup:

  1. since all the IOs right now are relatively lightweight, we would merge all of them into this repo as separate modules
  2. we won't have sgkit[all] package, but instead at least {core, io_bgen, io_plink} (the names are arbitrary, and totally up for discussion)
  3. use of codebase that requires extra dependencies would trigger a friendly error, with message that says that users must install sgkit[foo] for foo feature (very much like dask does for example)

Now I will dive into consequences, will use: {benefit, drawback, same} to mark the points:

  • benefit/same: install of root repo ex: pip install sgkit would install full codebase but not all the dependencies, specifically not the dependencies of the extra IO modules
  • same: ^ this means that on simple/root install, you would end up with clean environment like in https://github.com/pystatgen/sgkit/issues/65#issuecomment-668130037
  • same: each IO modules would have it's own list of dependencies that get installed iff requested by the user, in which case you end up with more dependencies in your environment exactly the same way as if the IOs were separate project
  • benefit: since all codebase is in one place we don't have extra overhead of CICD, doc setup, inter-package dependencies, PRs etc, see Mypy integration sgkit-plink#25 for an example
  • same: we won't have sgkit[all] package to mitigate potential issues with heavy IO dependencies, see: https://github.com/pystatgen/sgkit/issues/65#issuecomment-668002409
  • same: to @eric-czech's requirement/point, if you want to install just the IO package, that would still trigger install of whole codebase (afaiu), but again, not all dependencies and we could have this setup in a way that installs root dependencies or not, it depends how important that is for us and whether the root dependencies are already part of the IO dependencies, so we would need to discuss this further and make a decision about this one
  • drawback/same: iff one of the IO modules grows some much that it would actually justify being a separate package, afaiu, that would be relatively doable, nevertheless I marked it as drawback. We would setup the new repo with IO's codebase, and still keep extra package (without the IO's codebase obviously) in main repo that now depends on the new separate package. Likely we could use namespace package.

This is not an exhaustive list, nevertheless I see more benefits in merging the repos (I might be missing something tho). Feel free to comment here, and let's also discuss this further during our weekly tomorrow.

@jeromekelleher
Copy link
Collaborator

Thanks for the great summary @ravwojdyla, this is very helpful!

@hammer hammer changed the title Enforce style across all pystatgen repos Monorepo vs. multi-repo Aug 12, 2020
@hammer
Copy link
Contributor

hammer commented Aug 12, 2020

I've changed the title of this issue since the conversation has mostly centered upon the question posed by @ravwojdyla in the final sentence of the original issue comment:

An alternative solution to this problem would be to have a single repo with proper separation of modules to prevent dependency hell.

@ravwojdyla
Copy link
Collaborator Author

For posterity, we have decided to consolidate the IO repos into the main repo, and try the design from https://github.com/pystatgen/sgkit/issues/65#issuecomment-670049733

@hammer
Copy link
Contributor

hammer commented Oct 15, 2020

Closing this out now that I've archived the IO-specific repositories.

@hammer hammer closed this as completed Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-repo Issues related to having multiple repos process + tools
Projects
None yet
Development

No branches or pull requests

7 participants