Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sgkit-plink IO merger #277

Merged
merged 3 commits into from
Sep 28, 2020
Merged

Conversation

ravwojdyla
Copy link
Collaborator

@ravwojdyla ravwojdyla commented Sep 23, 2020

Re: #65 , #257

Depends on: #274, #271

UX test:

# ------------------------------------------------------------------
# create a new clean py env
# ------------------------------------------------------------------
(base) ➜  projects conda create -n test_sgkit_blah python=3.7
(base) ➜  projects conda activate test_sgkit_blah

# ------------------------------------------------------------------
# clean install of sgkit (need to install from source since we don't
# publish yet) towards the end of the output you can see packages
# installed, they do NOT include sgkit-plink dependencies
# ------------------------------------------------------------------
(test_sgkit_blah) ➜  ~ pip install projects/sgkit
Processing /Users/rav/projects/sgkit
Collecting numpy
...
Installing collected packages: numpy, six, python-dateutil, pytz, pandas, xarray, pyyaml, toolz, dask, scipy, asciitree, numcodecs, monotonic, fasteners, zarr, llvmlite, numba, typing-extensions, sgkit
Successfully installed asciitree-0.3.3 dask-2.27.0 fasteners-0.15 llvmlite-0.34.0 monotonic-1.5 numba-0.51.2 numcodecs-0.7.2 numpy-1.19.2 pandas-1.1.2 python-dateutil-2.8.1 pytz-2020.1 pyyaml-5.3.1 scipy-1.5.2 sgkit-0.1.dev172+g725c774 six-1.15.0 toolz-0.10.0 typing-extensions-3.7.4.3 xarray-0.16.1 zarr-2.4.0

# ------------------------------------------------------------------
# let's test if sgkit works as expected
# ------------------------------------------------------------------
(test_sgkit_blah) ➜  ~ echo $PYTHONPATH

(test_sgkit_blah) ➜  ~ python3.7
>>> from sgkit.testing import simulate_genotype_call_dataset
>>> simulate_genotype_call_dataset(10, 10)
<xarray.Dataset>
Dimensions:             (alleles: 2, ploidy: 2, samples: 10, variants: 10)
...

# ------------------------------------------------------------------
# looks like the sgkit core works fine, now let's try to use
# sgkit-plink IO:
# ------------------------------------------------------------------
>>> from sgkit.io.plink import read_plink
...
Please install them via pip :

  pip install --upgrade 'sgkit[plink]'

# ------------------------------------------------------------------
# we've got an error there that tells us we need to install missing
# dependencies, and it shows us the commend, again since we don't
# publish sgkit yet, I install from source:
# ------------------------------------------------------------------

(test_sgkit_blah) ➜  ~ pip install --upgrade 'projects/sgkit[plink]'
...
Successfully installed appdirs-1.4.4 bed-reader-0.1.1 chardet-3.0.4 fsspec-0.8.2 idna-2.10 locket-0.2.0 packaging-20.4 partd-1.1.0 pooch-1.2.0 pyparsing-2.4.7 requests-2.24.0 sgkit-0.1.dev172+gff4fc2c urllib3-1.25.10

# ------------------------------------------------------------------
# You can see some extra package got installed, let's try to use plink IO again:
# ------------------------------------------------------------------
(test_sgkit_blah) ➜  ~ python3.7
>>> from sgkit.io.plink import read_plink
>>> read_plink(path="/tmp/plink_sim_10s_100v_10pmiss")
# ------------------------------------------------------------------
# all works now
# ------------------------------------------------------------------

@ravwojdyla ravwojdyla changed the title POC sgkit-plink merger sgkit-plink IO merger Sep 23, 2020
@ravwojdyla ravwojdyla force-pushed the rav/plink_poc branch 2 times, most recently from ff4fc2c to 3a8ff89 Compare September 23, 2020 19:00
@ravwojdyla ravwojdyla linked an issue Sep 23, 2020 that may be closed by this pull request
Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @ravwojdyla.

We should rename pysnptools.py since we no longer use that library. Maybe call it plink_reader.py?

I was wondering if it would be possible to have the read_plink function at the top-level, but that's not possible due to the guard (right?). I think from sgkit.io.plink import read_plink is logical however.

Also, we'll need some documentation about how to install the library, but did you think that would be a separate PR?

plink =
partd
fsspec
bed-reader
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see where dask[dataframe] actually gets installed - it looks like only its dependencies get installed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but I'm still missing something: where is the dask[dataframe] dependency declared? It's used by pysnptools.py, but how does it get installed?

Copy link
Collaborator Author

@ravwojdyla ravwojdyla Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask[dataframe] is not an "actual" dependency, it's an extra in dask, and since we already install dask via dask[array] in install_requires and since it's an extra in our extras (which this bug in pip affects) we can't really list dask[dataframe] as a dependency (unless we use the 2020 pip resolver), so we have the options outlined in the comment above, so instead of forcing the extra flag on users, I opted for listing the missing dependencies that come from dask[dataframe]. Initial version of this PR actually used the 2020 resolver, but I reverted to listing dask[dataframe] deps directly so that we can use plain pip.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. I was missing the bit about dask[dataframe] not actually providing any Dask code.

When the pip bug is fixed do you think we should switch to just add the dask[dataframe] dependency here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite quick answer - sure. But that bug will be fixed by the 2020 resolver, right now 2020 resolver is available via feature flag, it is scheduled to be default in October this year, so we can wait until then + a couple of months and we should be good to switch over.

@ravwojdyla
Copy link
Collaborator Author

@tomwhite

We should rename pysnptools.py since we no longer use that library. Maybe call it plink_reader.py?

Sure - will push a fixup.

I was wondering if it would be possible to have the read_plink function at the top-level, but that's not possible due to the guard (right?). I think from sgkit.io.plink import read_plink is logical however.

Exactly.

Also, we'll need some documentation about how to install the library, but did you think that would be a separate PR?

I think Eric added a bit of that in #278 (plus this PR adds informative error message if plink deps are missing), would you suggest anything more?

@tomwhite
Copy link
Collaborator

I think Eric added a bit of that in #278 (plus this PR adds informative error message if plink deps are missing), would you suggest anything more?

Yes, I think we should add a bit saying what you need to do to read PLINK/BGEN/VCF. I've added something in #235, but that will need changing, so I'm happy to do it after this (and the BGEN, VCF equivalents) have gone in.

@mergify
Copy link
Contributor

mergify bot commented Sep 24, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 24, 2020
@mergify mergify bot removed the conflict PR conflict label Sep 24, 2020
Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also rename test_pysnptools.py to test_plink_reader.py.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the import guard should work well I think.

Just so I'm clear, the intended usage is

# my user script
from sgkit.io.plink import read_plink  # Fails here if we don't can't import bedreader

df = read_plink("myplinkfile")

So, this will fail at import time if bedtools doesn't exist. I guess this will work well once we're good about library hygiene and make sure we don't ever refer to sgkit.io.plink elsewhere in sgkit.

I guess the other option would be to fail at run time when someone calls read_plink. This would mean we don't have to be quite as strict about how we use read_plink within the library (see #57, eg). I'm sure there are pros and cons either way, just bringing up the idea for discussion.

Anyway, this is easy to tweak afterwards. Very happy to merge this as is.

@@ -0,0 +1,126 @@
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably rename this file I guess (did @tomwhite point this out too though?)

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Sep 25, 2020

@tomwhite @jeromekelleher thanks for the reviews, last update:

  • finished up renaming of the files
  • decided to moved plink tests to the main test directory (no value in having them separate)
  • added API doc and installation doc (added two sections to the docs: IO/imports to API and top level IOs) PTAL

@jeromekelleher

So, this will fail at import time if bedtools doesn't exist. I guess this will work well once we're good about library hygiene and make sure we don't ever refer to sgkit.io.plink elsewhere in sgkit.

Correct, and thanks for raising this. As you point out, we can easily change that in the future if we like (though I prefer it the way it is right now). Regarding CLI, there is many ways we can solve that with either import/run-time guards, and we can wait until we have CLI to decide(?).

@codecov-commenter
Copy link

codecov-commenter commented Sep 25, 2020

Codecov Report

Merging #277 into master will decrease coverage by 0.18%.
The diff coverage is 97.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #277      +/-   ##
==========================================
- Coverage   98.86%   98.68%   -0.19%     
==========================================
  Files          14       16       +2     
  Lines         884      987     +103     
==========================================
+ Hits          874      974     +100     
- Misses         10       13       +3     
Impacted Files Coverage Δ
sgkit/io/plink/__init__.py 50.00% <50.00%> (ø)
sgkit/io/plink/plink_reader.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5604787...8858371. Read the comment docs.

@jeromekelleher
Copy link
Collaborator

Sure @ravwojdyla, let's leave things as they are for now and think about the runtime/import time thing later.

Pinging @eric-czech for a review here too, since this is a big change.

@tomwhite
Copy link
Collaborator

decided to moved plink tests to the main test directory (no value in having them separate)

I'm looking at the VCF tests, and they are spread across several files, so in that case I think it is worth putting them in a separate directory (tests/io/vcf). So perhaps we should for plink too for consistency.

@ravwojdyla
Copy link
Collaborator Author

@tomwhite done.

Copy link
Collaborator

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeromekelleher.

LGTM. One very minor concern I have is testing that the core functionality works without the extras installed (as they would be in the build env now). I suppose that only becomes more of a risk when we have more functionality like vcfzarr_reader though, or anything that is IO-related without needing the underlying reader libraries.

@eric-czech eric-czech added the auto-merge Auto merge label for mergify test flight label Sep 28, 2020
@ravwojdyla
Copy link
Collaborator Author

@eric-czech thanks for the review! Regarding your concern about testing IOs, in those cases we can always leverage pytest markers, and for example by default skip tests that we expect are hard to test locally (and hide them behind a marker flag), kinda like what we do internally with datastore emulated tests.

Btw - about auto-merge this PR has a change in workflows, so you need to merge this manually.

@eric-czech
Copy link
Collaborator

Hm I tried adding auto-merge but no dice. @jeromekelleher could you take a look? Was that supposed to be fixed now or did I attach the wrong label?

@jeromekelleher jeromekelleher merged commit cc7aa7f into sgkit-dev:master Sep 28, 2020
@jeromekelleher
Copy link
Collaborator

Hm I tried adding auto-merge but no dice. @jeromekelleher could you take a look? Was that supposed to be fixed now or did I attach the wrong label?

as @ravwojdyla pointed out, mergify doesn't do auto merge on things that modify workflows. I've hit the manual merge.

@ravwojdyla ravwojdyla deleted the rav/plink_poc branch September 28, 2020 14:15
@tomwhite tomwhite mentioned this pull request Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move sgkit-plink to main sgkit repo
5 participants