sgkit-plink IO merger #277

ravwojdyla · 2020-09-23T17:34:14Z

~~Depends on: #274, #271~~

UX test:

# ------------------------------------------------------------------
# create a new clean py env
# ------------------------------------------------------------------
(base) ➜  projects conda create -n test_sgkit_blah python=3.7
(base) ➜  projects conda activate test_sgkit_blah

# ------------------------------------------------------------------
# clean install of sgkit (need to install from source since we don't
# publish yet) towards the end of the output you can see packages
# installed, they do NOT include sgkit-plink dependencies
# ------------------------------------------------------------------
(test_sgkit_blah) ➜  ~ pip install projects/sgkit
Processing /Users/rav/projects/sgkit
Collecting numpy
...
Installing collected packages: numpy, six, python-dateutil, pytz, pandas, xarray, pyyaml, toolz, dask, scipy, asciitree, numcodecs, monotonic, fasteners, zarr, llvmlite, numba, typing-extensions, sgkit
Successfully installed asciitree-0.3.3 dask-2.27.0 fasteners-0.15 llvmlite-0.34.0 monotonic-1.5 numba-0.51.2 numcodecs-0.7.2 numpy-1.19.2 pandas-1.1.2 python-dateutil-2.8.1 pytz-2020.1 pyyaml-5.3.1 scipy-1.5.2 sgkit-0.1.dev172+g725c774 six-1.15.0 toolz-0.10.0 typing-extensions-3.7.4.3 xarray-0.16.1 zarr-2.4.0

# ------------------------------------------------------------------
# let's test if sgkit works as expected
# ------------------------------------------------------------------
(test_sgkit_blah) ➜  ~ echo $PYTHONPATH

(test_sgkit_blah) ➜  ~ python3.7
>>> from sgkit.testing import simulate_genotype_call_dataset
>>> simulate_genotype_call_dataset(10, 10)
<xarray.Dataset>
Dimensions:             (alleles: 2, ploidy: 2, samples: 10, variants: 10)
...

# ------------------------------------------------------------------
# looks like the sgkit core works fine, now let's try to use
# sgkit-plink IO:
# ------------------------------------------------------------------
>>> from sgkit.io.plink import read_plink
...
Please install them via pip :

  pip install --upgrade 'sgkit[plink]'

# ------------------------------------------------------------------
# we've got an error there that tells us we need to install missing
# dependencies, and it shows us the commend, again since we don't
# publish sgkit yet, I install from source:
# ------------------------------------------------------------------

(test_sgkit_blah) ➜  ~ pip install --upgrade 'projects/sgkit[plink]'
...
Successfully installed appdirs-1.4.4 bed-reader-0.1.1 chardet-3.0.4 fsspec-0.8.2 idna-2.10 locket-0.2.0 packaging-20.4 partd-1.1.0 pooch-1.2.0 pyparsing-2.4.7 requests-2.24.0 sgkit-0.1.dev172+gff4fc2c urllib3-1.25.10

# ------------------------------------------------------------------
# You can see some extra package got installed, let's try to use plink IO again:
# ------------------------------------------------------------------
(test_sgkit_blah) ➜  ~ python3.7
>>> from sgkit.io.plink import read_plink
>>> read_plink(path="/tmp/plink_sim_10s_100v_10pmiss")
# ------------------------------------------------------------------
# all works now
# ------------------------------------------------------------------

tomwhite

This looks great @ravwojdyla.

We should rename pysnptools.py since we no longer use that library. Maybe call it plink_reader.py?

I was wondering if it would be possible to have the read_plink function at the top-level, but that's not possible due to the guard (right?). I think from sgkit.io.plink import read_plink is logical however.

Also, we'll need some documentation about how to install the library, but did you think that would be a separate PR?

tomwhite · 2020-09-24T14:17:35Z

setup.cfg

+plink =
+    partd
+    fsspec
+    bed-reader


I can't see where dask[dataframe] actually gets installed - it looks like only its dependencies get installed here?

@tomwhite see the comment above (https://github.com/pystatgen/sgkit/pull/277/files#diff-380c6a8ebbbce17d55d50ef17d3cf906R41-R51) for the context. Does that make sense?

It makes sense, but I'm still missing something: where is the dask[dataframe] dependency declared? It's used by pysnptools.py, but how does it get installed?

dask[dataframe] is not an "actual" dependency, it's an extra in dask, and since we already install dask via dask[array] in install_requires and since it's an extra in our extras (which this bug in pip affects) we can't really list dask[dataframe] as a dependency (unless we use the 2020 pip resolver), so we have the options outlined in the comment above, so instead of forcing the extra flag on users, I opted for listing the missing dependencies that come from dask[dataframe]. Initial version of this PR actually used the 2020 resolver, but I reverted to listing dask[dataframe] deps directly so that we can use plain pip.

Ah, got it. I was missing the bit about dask[dataframe] not actually providing any Dask code.

When the pip bug is fixed do you think we should switch to just add the dask[dataframe] dependency here?

@tomwhite quick answer - sure. But that bug will be fixed by the 2020 resolver, right now 2020 resolver is available via feature flag, it is scheduled to be default in October this year, so we can wait until then + a couple of months and we should be good to switch over.

ravwojdyla · 2020-09-24T14:29:52Z

@tomwhite

We should rename pysnptools.py since we no longer use that library. Maybe call it plink_reader.py?

Sure - will push a fixup.

I was wondering if it would be possible to have the read_plink function at the top-level, but that's not possible due to the guard (right?). I think from sgkit.io.plink import read_plink is logical however.

Exactly.

Also, we'll need some documentation about how to install the library, but did you think that would be a separate PR?

I think Eric added a bit of that in #278 (plus this PR adds informative error message if plink deps are missing), would you suggest anything more?

tomwhite · 2020-09-24T14:36:02Z

I think Eric added a bit of that in #278 (plus this PR adds informative error message if plink deps are missing), would you suggest anything more?

Yes, I think we should add a bit saying what you need to do to read PLINK/BGEN/VCF. I've added something in #235, but that will need changing, so I'm happy to do it after this (and the BGEN, VCF equivalents) have gone in.

mergify · 2020-09-24T15:28:57Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

tomwhite

Please also rename test_pysnptools.py to test_plink_reader.py.

jeromekelleher

LGTM, the import guard should work well I think.

Just so I'm clear, the intended usage is

# my user script
from sgkit.io.plink import read_plink  # Fails here if we don't can't import bedreader

df = read_plink("myplinkfile")

So, this will fail at import time if bedtools doesn't exist. I guess this will work well once we're good about library hygiene and make sure we don't ever refer to sgkit.io.plink elsewhere in sgkit.

I guess the other option would be to fail at run time when someone calls read_plink. This would mean we don't have to be quite as strict about how we use read_plink within the library (see #57, eg). I'm sure there are pros and cons either way, just bringing up the idea for discussion.

Anyway, this is easy to tweak afterwards. Very happy to merge this as is.

jeromekelleher · 2020-09-25T08:52:17Z

sgkit/io/plink/tests/test_pysnptools.py

@@ -0,0 +1,126 @@
+import numpy as np


should probably rename this file I guess (did @tomwhite point this out too though?)

ravwojdyla · 2020-09-25T10:40:38Z

@tomwhite @jeromekelleher thanks for the reviews, last update:

finished up renaming of the files
decided to moved plink tests to the main test directory (no value in having them separate)
added API doc and installation doc (added two sections to the docs: IO/imports to API and top level IOs) PTAL

@jeromekelleher

So, this will fail at import time if bedtools doesn't exist. I guess this will work well once we're good about library hygiene and make sure we don't ever refer to sgkit.io.plink elsewhere in sgkit.

Correct, and thanks for raising this. As you point out, we can easily change that in the future if we like (though I prefer it the way it is right now). Regarding CLI, there is many ways we can solve that with either import/run-time guards, and we can wait until we have CLI to decide(?).

codecov-commenter · 2020-09-25T10:48:07Z

Codecov Report

Merging #277 into master will decrease coverage by 0.18%.
The diff coverage is 97.08%.

@@            Coverage Diff             @@
##           master     #277      +/-   ##
==========================================
- Coverage   98.86%   98.68%   -0.19%     
==========================================
  Files          14       16       +2     
  Lines         884      987     +103     
==========================================
+ Hits          874      974     +100     
- Misses         10       13       +3

Impacted Files	Coverage Δ
sgkit/io/plink/__init__.py	`50.00% <50.00%> (ø)`
sgkit/io/plink/plink_reader.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5604787...8858371. Read the comment docs.

jeromekelleher · 2020-09-25T11:25:13Z

Sure @ravwojdyla, let's leave things as they are for now and think about the runtime/import time thing later.

Pinging @eric-czech for a review here too, since this is a big change.

tomwhite · 2020-09-28T11:00:38Z

decided to moved plink tests to the main test directory (no value in having them separate)

I'm looking at the VCF tests, and they are spread across several files, so in that case I think it is worth putting them in a separate directory (tests/io/vcf). So perhaps we should for plink too for consistency.

ravwojdyla · 2020-09-28T11:10:57Z

@tomwhite done.

eric-czech

Thanks @jeromekelleher.

LGTM. One very minor concern I have is testing that the core functionality works without the extras installed (as they would be in the build env now). I suppose that only becomes more of a risk when we have more functionality like vcfzarr_reader though, or anything that is IO-related without needing the underlying reader libraries.

ravwojdyla · 2020-09-28T13:12:51Z

@eric-czech thanks for the review! Regarding your concern about testing IOs, in those cases we can always leverage pytest markers, and for example by default skip tests that we expect are hard to test locally (and hide them behind a marker flag), kinda like what we do internally with datastore emulated tests.

Btw - about auto-merge this PR has a change in workflows, so you need to merge this manually.

eric-czech · 2020-09-28T13:12:55Z

Hm I tried adding auto-merge but no dice. @jeromekelleher could you take a look? Was that supposed to be fixed now or did I attach the wrong label?

jeromekelleher · 2020-09-28T13:23:21Z

Hm I tried adding auto-merge but no dice. @jeromekelleher could you take a look? Was that supposed to be fixed now or did I attach the wrong label?

as @ravwojdyla pointed out, mergify doesn't do auto merge on things that modify workflows. I've hit the manual merge.

ravwojdyla changed the title ~~POC sgkit-plink merger~~ sgkit-plink IO merger Sep 23, 2020

ravwojdyla force-pushed the rav/plink_poc branch 2 times, most recently from ff4fc2c to 3a8ff89 Compare September 23, 2020 19:00

ravwojdyla linked an issue Sep 23, 2020 that may be closed by this pull request

Move sgkit-plink to main sgkit repo #257

Closed

ravwojdyla force-pushed the rav/plink_poc branch from 3a8ff89 to 3094fb4 Compare September 24, 2020 10:53

tomwhite reviewed Sep 24, 2020

View reviewed changes

mergify bot added the conflict PR conflict label Sep 24, 2020

ravwojdyla force-pushed the rav/plink_poc branch from 3094fb4 to 36666f8 Compare September 24, 2020 15:43

mergify bot removed the conflict PR conflict label Sep 24, 2020

tomwhite approved these changes Sep 25, 2020

View reviewed changes

jeromekelleher approved these changes Sep 25, 2020

View reviewed changes

ravwojdyla force-pushed the rav/plink_poc branch from 36666f8 to 64918a2 Compare September 25, 2020 10:31

ravwojdyla added 2 commits September 25, 2020 12:44

sgkit-plink IO merger

67a0864

Add io/plink docs

8858371

ravwojdyla force-pushed the rav/plink_poc branch from 64918a2 to 8858371 Compare September 25, 2020 10:45

jeromekelleher requested a review from eric-czech September 25, 2020 11:25

Move plink tests to separate module in prep for vcf

e25d1ef

eric-czech approved these changes Sep 28, 2020

View reviewed changes

eric-czech added the auto-merge Auto merge label for mergify test flight label Sep 28, 2020

jeromekelleher merged commit cc7aa7f into sgkit-dev:master Sep 28, 2020

ravwojdyla deleted the rav/plink_poc branch September 28, 2020 14:15

ravwojdyla mentioned this pull request Oct 1, 2020

Move sgkit-bgen to main sgkit repo #256

Closed

tomwhite mentioned this pull request Oct 1, 2020

sgkit-vcf merger #289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sgkit-plink IO merger #277

sgkit-plink IO merger #277

ravwojdyla commented Sep 23, 2020 •

edited

Loading

tomwhite left a comment

tomwhite Sep 24, 2020

ravwojdyla Sep 24, 2020

tomwhite Sep 24, 2020

ravwojdyla Sep 24, 2020 •

edited

Loading

tomwhite Sep 24, 2020

ravwojdyla Sep 24, 2020

ravwojdyla commented Sep 24, 2020

tomwhite commented Sep 24, 2020

mergify bot commented Sep 24, 2020

tomwhite left a comment

jeromekelleher left a comment •

edited

Loading

jeromekelleher Sep 25, 2020

ravwojdyla commented Sep 25, 2020 •

edited

Loading

codecov-commenter commented Sep 25, 2020 •

edited

Loading

jeromekelleher commented Sep 25, 2020

tomwhite commented Sep 28, 2020

ravwojdyla commented Sep 28, 2020

eric-czech left a comment

ravwojdyla commented Sep 28, 2020

eric-czech commented Sep 28, 2020

jeromekelleher commented Sep 28, 2020

sgkit-plink IO merger #277

sgkit-plink IO merger #277

Conversation

ravwojdyla commented Sep 23, 2020 • edited Loading

tomwhite left a comment

Choose a reason for hiding this comment

tomwhite Sep 24, 2020

Choose a reason for hiding this comment

ravwojdyla Sep 24, 2020

Choose a reason for hiding this comment

tomwhite Sep 24, 2020

Choose a reason for hiding this comment

ravwojdyla Sep 24, 2020 • edited Loading

Choose a reason for hiding this comment

tomwhite Sep 24, 2020

Choose a reason for hiding this comment

ravwojdyla Sep 24, 2020

Choose a reason for hiding this comment

ravwojdyla commented Sep 24, 2020

tomwhite commented Sep 24, 2020

mergify bot commented Sep 24, 2020

tomwhite left a comment

Choose a reason for hiding this comment

jeromekelleher left a comment • edited Loading

Choose a reason for hiding this comment

jeromekelleher Sep 25, 2020

Choose a reason for hiding this comment

ravwojdyla commented Sep 25, 2020 • edited Loading

codecov-commenter commented Sep 25, 2020 • edited Loading

Codecov Report

jeromekelleher commented Sep 25, 2020

tomwhite commented Sep 28, 2020

ravwojdyla commented Sep 28, 2020

eric-czech left a comment

Choose a reason for hiding this comment

ravwojdyla commented Sep 28, 2020

eric-czech commented Sep 28, 2020

jeromekelleher commented Sep 28, 2020

ravwojdyla commented Sep 23, 2020 •

edited

Loading

ravwojdyla Sep 24, 2020 •

edited

Loading

jeromekelleher left a comment •

edited

Loading

ravwojdyla commented Sep 25, 2020 •

edited

Loading

codecov-commenter commented Sep 25, 2020 •

edited

Loading