Monorepo vs. multi-repo #65

ravwojdyla · 2020-07-24T11:06:43Z

We have a solid style check setup in the main repo, IO repos have a bit older setup, it might be hard to keep them in sync plus we also have repos in RS that also have similar setups. I wonder if we should explore some kind of centralisation of this style check setup (akin to common plugin in sbt or root parent in maven). The best I could find for precommit is this, we could give it a try, one downside I can see is that it might not work well with mypy check (but that already doesn't work well anyway even in solo repo, related: https://github.com/pystatgen/sgkit/issues/39)

An alternative solution to this problem would be to have a single repo with proper separation of modules to prevent dependency hell.

ravwojdyla · 2020-07-30T15:40:39Z

It came up during our weekly sync, so I wanted to document it, we could use extras_require to separate dependencies, and even provide a convenient way to install a complete sgkit ex: sgkit[complete].

ravwojdyla · 2020-07-31T12:12:27Z

@jeromekelleher @alimanfoo @tomwhite @eric-czech @hammer any thoughts/objections/-1/+1 to merging IO repos into the main repo and using extra_requires to separate dependencies. If you prefer to discuss this during our ~weekly meetings, this can obviously wait until the next next one.

eric-czech · 2020-07-31T12:28:26Z

I'm a +1 to that as long as it's always easy to install the io-related packages in a separate environment from all the other sgkit packages (and not need to share common dependencies). Something like this seems reasonable to me:

conda activate base
conda install sgkit[core]
conda activate io
conda install sgkit[plink,bgen]

I think making CI run for everything in a common environment will work for now, but I don't think that would be a good long term solution. How hard is it to split the build up like that?

ravwojdyla · 2020-07-31T12:51:30Z

@eric-czech I believe (we should double check), install of any extra_requires will always install the root package(s). I'm not sure right now how idiomatic is it to not install root package for any of the extra, and I'm not sure if that is a good idea at the moment, BUT if we do want that, I guess a workaround for this could be to have a dummy root package sgkit_parent, and then everything else is in extra packages like core, plink, bgen. Regarding splitting CI, I agree for now we should just keep a single env, but later we can split things up (it shouldn't be too difficult).

ravwojdyla · 2020-07-31T13:12:16Z

And to provide some further reasoning for this:

it would lower the overhead of our style check, doc, test setup etc
potentially increase the discoverability of features, since full codebase would be in a single repo and would be installed together

Regarding 2nd point, to be clear, we could likely have a "core" package that brings all packages, some of which would require extra dependencies, and lack of those would trigger a ImportError and a informative message that says you need to install extra package ex: sgkit[blah] for a specific blah feature, akin to say dask:

> pip install dask # but NOT dask[bag]
> python
# dask.bag code/package is present (discoverable), but its dependencies are not installed
>>> import dask.bag
ImportError: No module named cloudpickle

Dask bag requirements are not installed.

Please either conda or pip install as follows:

  conda install dask               # either conda install
  pip install dask[bag] --upgrade  # or pip install

eric-czech · 2020-07-31T13:48:56Z

There's a long past discussion on this starting around here btw: https://github.com/pystatgen/sgkit/issues/2#issuecomment-646488636.

I definitely agree with @jeromekelleher here in that I plan on using our IO tools once for most analyses, or twice to export as VCF/PLINK, and that some of those tools won't ever work well in the same environment. On the other hand, packages like pysnptools align with our stack (at the moment) and I agree that there are a bunch of advantages to making them colocate in the same environment. I'm not sure which way to lean with other IO tools but I think it will be important to understand sooner than later how packages that can't share environments will be integrated.

For some more context on the current IO packages, pysnptools (requirements) depends on bgen-reader (requirements) and cumulatively they have pretty much the same dependencies as us, plus a few things like the native bgen c library. At least those two would be manageable for now and arguably we should pick up the maintenance if the two current maintainers ever drop it.

hammer · 2020-07-31T14:16:11Z

Thanks for moving this discussion forward, @ravwojdyla! It's a big decision so we will definitely not decide one way or the other without input from @jeromekelleher and @tomwhite when they return. And hopefully we will get an email response from @CarlKCarlK and @horta so that we can more closely coordinate with their work on PLINK and BGEN IO.

horta · 2020-07-31T14:48:36Z

Hi guys,

I will respond you later tonight. Sorry about that. Also, I'm getting one week soon to work on those projects again!

tomwhite · 2020-08-03T10:37:03Z

I would be onboard with moving sgkit-plink and sgkit-bgen into the sgkit repo, and using extra_requires to separate dependencies. It would make our lives easier, and users' lives easier since they could do pip install sgkit[all].

For VCF we currently depend on scikit-allel to convert to Zarr as an intermediate format (see https://github.com/pystatgen/sgkit/pull/40). In the future we may want a tool to go directly from VCF to sgkit format, and that might be complex enough to warrant a separate package (let's see). But the sgkit-specific code for plink and bgen has turned out to be pretty minimal, so I think it could happily live in sgkit.

jeromekelleher · 2020-08-03T12:45:10Z

I do see the arguments for merging everything together @ravwojdyla, and I agree it is a burden on us to keep the repos separate. However, I do think it would be a fundamental mistake to merge everything together and to put the burden of working out dependencies on the user. It is a lot easier for a user to understand whether they need to install sgkit-plinkrather than pysnptools if they want to process plink data.

Suppose I'm a user who wants to convert and process their VCF data. I'm lazy, so I don't read all the documentation saying what packages I need to install for what and so I run pip install sgkit[all] (I don't really understand what this means either, I'm just copying and pasting commands). Installation then fails on some library I've never heard of, and so I give up. By putting the dependencies for the third-party libs into sgkit, we're making it the users responsibility to understand what dependencies they need in order to process particular file formats.

Under the current scheme, there's two ways we could recommend users approach this:

Encourage them to install sgkit on its own, and the run sgkit import-vcf (see Initial proposal for CLI plugin system. #57 for proposal). If the sgkit-vcf package isn't installed, we'll have a nice friendly message saying "make sure you install sgkit-vcf".
We could have simple guidance like "if you're working with X data format, then install sgkit-X, which will also pull in sgkit as a dependency.

The point is, it's much easier for inexperienced users to understand that they need sgkit-x installed to work with format x, than some third party package that they've never heard of. We have to assume that most of the users of sgkit will not be experienced Python users, and everything we can do to simplify their lives will be a big help.

So, I agree it's a burden on us to maintain n different sgkit import/export libraries, but this is much better than putting the burden of understanding the dependency network for n different formats on users. OK, it's a PITA now to sync up n different repos in terms of coding style and so on, but this is just the initial phase when we're getting things up and running. Once the basic import/export functions are done, I'd bet that we'll rarely touch the format repos. All of the real development will happen in the sgkit repo, and import/export repos will basically be in maintenance mode, updating every now and again to deal with upstream updates.

ravwojdyla · 2020-08-03T16:17:28Z

@jeromekelleher good points, so maybe a good middle ground would be to not have pip install sgkit[all]. And instead have the code fail with a nice friendly message saying "make sure you install sgkit[vcf]"? Less burden on us, and the same level of experience for the users.

jeromekelleher · 2020-08-03T16:45:53Z

Yeah, this is a good idea @ravwojdyla - but I still don't think we should merge the repos! Fundamentally, I think it's short-sighted, but it's probably best to talk this through during a call. If everyone else agrees that we should merge the repos then, I'll go with the flow (and only complain and grumble occasionally 😉). Fair enough?

jeromekelleher · 2020-08-03T16:55:16Z

FWIW, here's the list of packages we install in a fresh venv for sgkit:

numpy-1.19.1 pandas-1.1.0 python-dateutil-2.8.1 pytz-2020.1 sgkit-0.1.dev51+g18ed63b six-1.15.0 xarray-0.16.0

and this is what we have for sgkit-plink:

appdirs-1.4.4 attrs-19.3.0 backports.tempfile-1.0 backports.weakref-1.0.post1 bgen-reader-4.0.4 cachetools-4.1.1 certifi-2020.6.20 cffi-1.14.1 chardet-3.0.4 cloudpickle-1.5.0 dask-2.22.0 dill-0.3.2 fsspec-0.8.0 h5py-2.10.0 idna-2.10 iniconfig-1.0.1 locket-0.2.0 more-itertools-8.4.0 numpy-1.19.1 packaging-20.4 pandas-1.1.0 partd-1.1.0 pluggy-0.13.1 psutil-5.7.2 py-1.9.0 pycparser-2.20 pyparsing-2.4.7 pysnptools-0.4.20 pytest-6.0.1 pytest-runner-5.2 python-dateutil-2.8.1 pytz-2020.1 pyyaml-5.3.1 requests-2.24.0 scipy-1.5.2 sgkit-plink-0.1.dev21+ge829857 six-1.15.0 texttable-1.6.2 toml-0.10.1 toolz-0.10.0 tqdm-4.48.2 urllib3-1.25.10 wheel-0.34.2 xarray-0.16.0

The first looks nice and light to me, and the second... doesn't.

CarlKCarlK · 2020-08-03T21:09:44Z

Greetings, Hi! I’m the lead on the PySnpTools project. I’m late to the sgkit discussion and still not fully up-to-speed. But let me tentatively say that I’d consider working on (and/or supporting) an effort to move PySnpTools’s Bed, etc reader/writer to a new minimal project. Then PySnpTools and sgkit-plink etc could take a dependency on it. (It’s possible sgkit-plink could be that minimal project.). PySnpTools is a big project whose purpose is to allow one to write projects like FaST-LMM<https://github.com/fastlmm/FaST-LMM> so that they can run on any data format, any clustering scheme, any distributed file system, etc. I’m fine with moving some of its low-level file reading code into a more shareable project. * Carl

eric-czech · 2020-08-04T09:23:31Z

Welcome @CarlKCarlK!

I'm excited to see how we can best integrate your work -- thanks for being willing to collaborate on it.

ravwojdyla · 2020-08-06T16:53:43Z

For our next meeting tomorrow, I thought maybe it would be good to document the extra_requires, the benefits and drawbacks.

But first I will describe the potential setup:

since all the IOs right now are relatively lightweight, we would merge all of them into this repo as separate modules
we won't have sgkit[all] package, but instead at least {core, io_bgen, io_plink} (the names are arbitrary, and totally up for discussion)
use of codebase that requires extra dependencies would trigger a friendly error, with message that says that users must install sgkit[foo] for foo feature (very much like dask does for example)

Now I will dive into consequences, will use: {benefit, drawback, same} to mark the points:

benefit/same: install of root repo ex: pip install sgkit would install full codebase but not all the dependencies, specifically not the dependencies of the extra IO modules
same: ^ this means that on simple/root install, you would end up with clean environment like in https://github.com/pystatgen/sgkit/issues/65#issuecomment-668130037
same: each IO modules would have it's own list of dependencies that get installed iff requested by the user, in which case you end up with more dependencies in your environment exactly the same way as if the IOs were separate project
benefit: since all codebase is in one place we don't have extra overhead of CICD, doc setup, inter-package dependencies, PRs etc, see Mypy integration sgkit-plink#25 for an example
same: we won't have sgkit[all] package to mitigate potential issues with heavy IO dependencies, see: https://github.com/pystatgen/sgkit/issues/65#issuecomment-668002409
same: to @eric-czech's requirement/point, if you want to install just the IO package, that would still trigger install of whole codebase (afaiu), but again, not all dependencies and we could have this setup in a way that installs root dependencies or not, it depends how important that is for us and whether the root dependencies are already part of the IO dependencies, so we would need to discuss this further and make a decision about this one
drawback/same: iff one of the IO modules grows some much that it would actually justify being a separate package, afaiu, that would be relatively doable, nevertheless I marked it as drawback. We would setup the new repo with IO's codebase, and still keep extra package (without the IO's codebase obviously) in main repo that now depends on the new separate package. Likely we could use namespace package.

This is not an exhaustive list, nevertheless I see more benefits in merging the repos (I might be missing something tho). Feel free to comment here, and let's also discuss this further during our weekly tomorrow.

jeromekelleher · 2020-08-06T16:58:00Z

Thanks for the great summary @ravwojdyla, this is very helpful!

hammer · 2020-08-12T19:50:24Z

I've changed the title of this issue since the conversation has mostly centered upon the question posed by @ravwojdyla in the final sentence of the original issue comment:

An alternative solution to this problem would be to have a single repo with proper separation of modules to prevent dependency hell.

ravwojdyla · 2020-09-14T14:14:44Z

For posterity, we have decided to consolidate the IO repos into the main repo, and try the design from https://github.com/pystatgen/sgkit/issues/65#issuecomment-670049733

hammer · 2020-10-15T14:57:41Z

Closing this out now that I've archived the IO-specific repositories.

ravwojdyla added process + tools multi-repo Issues related to having multiple repos labels Jul 24, 2020

hammer mentioned this issue Aug 5, 2020

[WIP] Docs describing the Genotype Call XArray #78

Closed

ravwojdyla mentioned this issue Aug 6, 2020

Mypy integration sgkit-dev/sgkit-plink#25

Merged

eric-czech mentioned this issue Aug 7, 2020

Precompute fixed-length string dtypes sgkit-dev/sgkit-plink#20

Merged

hammer changed the title ~~Enforce style across all pystatgen repos~~ Monorepo vs. multi-repo Aug 12, 2020

hammer mentioned this issue Aug 12, 2020

Read data from VCF files #104

Closed

ravwojdyla mentioned this issue Aug 13, 2020

Add zarr to setup.cfg #105

Closed

eric-czech mentioned this issue Aug 20, 2020

Document data generation for unit test sgkit-dev/sgkit-plink#29

Open

jeromekelleher mentioned this issue Sep 14, 2020

Add windows CI #247

Closed

ravwojdyla mentioned this issue Sep 14, 2020

Update api rename sgkit-dev/sgkit-plink#32

Merged

ravwojdyla mentioned this issue Sep 14, 2020

Update api after sgkit#236 sgkit-dev/sgkit-vcf#11

Merged

tomwhite mentioned this issue Sep 15, 2020

Move IO repos to sgkit #259

Closed

4 tasks

ravwojdyla mentioned this issue Sep 23, 2020

sgkit-plink IO merger #277

Merged

hammer closed this as completed Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monorepo vs. multi-repo #65

Monorepo vs. multi-repo #65

ravwojdyla commented Jul 24, 2020 •

edited

Loading

ravwojdyla commented Jul 30, 2020 •

edited

Loading

ravwojdyla commented Jul 31, 2020 •

edited

Loading

eric-czech commented Jul 31, 2020 •

edited

Loading

ravwojdyla commented Jul 31, 2020

ravwojdyla commented Jul 31, 2020 •

edited

Loading

eric-czech commented Jul 31, 2020

hammer commented Jul 31, 2020

horta commented Jul 31, 2020

tomwhite commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

ravwojdyla commented Aug 3, 2020 •

edited

Loading

jeromekelleher commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

CarlKCarlK commented Aug 3, 2020 via email •

edited by hammer

Loading

eric-czech commented Aug 4, 2020

ravwojdyla commented Aug 6, 2020 •

edited

Loading

jeromekelleher commented Aug 6, 2020

hammer commented Aug 12, 2020

ravwojdyla commented Sep 14, 2020

hammer commented Oct 15, 2020

Monorepo vs. multi-repo #65

Monorepo vs. multi-repo #65

Comments

ravwojdyla commented Jul 24, 2020 • edited Loading

ravwojdyla commented Jul 30, 2020 • edited Loading

ravwojdyla commented Jul 31, 2020 • edited Loading

eric-czech commented Jul 31, 2020 • edited Loading

ravwojdyla commented Jul 31, 2020

ravwojdyla commented Jul 31, 2020 • edited Loading

eric-czech commented Jul 31, 2020

hammer commented Jul 31, 2020

horta commented Jul 31, 2020

tomwhite commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

ravwojdyla commented Aug 3, 2020 • edited Loading

jeromekelleher commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

CarlKCarlK commented Aug 3, 2020 via email • edited by hammer Loading

eric-czech commented Aug 4, 2020

ravwojdyla commented Aug 6, 2020 • edited Loading

jeromekelleher commented Aug 6, 2020

hammer commented Aug 12, 2020

ravwojdyla commented Sep 14, 2020

hammer commented Oct 15, 2020

ravwojdyla commented Jul 24, 2020 •

edited

Loading

ravwojdyla commented Jul 30, 2020 •

edited

Loading

ravwojdyla commented Jul 31, 2020 •

edited

Loading

eric-czech commented Jul 31, 2020 •

edited

Loading

ravwojdyla commented Jul 31, 2020 •

edited

Loading

ravwojdyla commented Aug 3, 2020 •

edited

Loading

CarlKCarlK commented Aug 3, 2020 via email •

edited by hammer

Loading

ravwojdyla commented Aug 6, 2020 •

edited

Loading