-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sgkit-plink IO merger #277
Conversation
ff4fc2c
to
3a8ff89
Compare
3a8ff89
to
3094fb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @ravwojdyla.
We should rename pysnptools.py
since we no longer use that library. Maybe call it plink_reader.py
?
I was wondering if it would be possible to have the read_plink
function at the top-level, but that's not possible due to the guard (right?). I think from sgkit.io.plink import read_plink
is logical however.
Also, we'll need some documentation about how to install the library, but did you think that would be a separate PR?
plink = | ||
partd | ||
fsspec | ||
bed-reader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see where dask[dataframe] actually gets installed - it looks like only its dependencies get installed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite see the comment above (https://github.com/pystatgen/sgkit/pull/277/files#diff-380c6a8ebbbce17d55d50ef17d3cf906R41-R51) for the context. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, but I'm still missing something: where is the dask[dataframe]
dependency declared? It's used by pysnptools.py
, but how does it get installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask[dataframe]
is not an "actual" dependency, it's an extra in dask
, and since we already install dask
via dask[array]
in install_requires
and since it's an extra in our extras (which this bug in pip affects) we can't really list dask[dataframe]
as a dependency (unless we use the 2020 pip resolver), so we have the options outlined in the comment above, so instead of forcing the extra flag on users, I opted for listing the missing dependencies that come from dask[dataframe]
. Initial version of this PR actually used the 2020 resolver, but I reverted to listing dask[dataframe]
deps directly so that we can use plain pip
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. I was missing the bit about dask[dataframe]
not actually providing any Dask code.
When the pip bug is fixed do you think we should switch to just add the dask[dataframe]
dependency here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomwhite quick answer - sure. But that bug will be fixed by the 2020 resolver, right now 2020 resolver is available via feature flag, it is scheduled to be default in October this year, so we can wait until then + a couple of months and we should be good to switch over.
Sure - will push a fixup.
Exactly.
I think Eric added a bit of that in #278 (plus this PR adds informative error message if plink deps are missing), would you suggest anything more? |
Yes, I think we should add a bit saying what you need to do to read PLINK/BGEN/VCF. I've added something in #235, but that will need changing, so I'm happy to do it after this (and the BGEN, VCF equivalents) have gone in. |
This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏 |
3094fb4
to
36666f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also rename test_pysnptools.py
to test_plink_reader.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the import guard should work well I think.
Just so I'm clear, the intended usage is
# my user script
from sgkit.io.plink import read_plink # Fails here if we don't can't import bedreader
df = read_plink("myplinkfile")
So, this will fail at import time if bedtools doesn't exist. I guess this will work well once we're good about library hygiene and make sure we don't ever refer to sgkit.io.plink
elsewhere in sgkit.
I guess the other option would be to fail at run time when someone calls read_plink
. This would mean we don't have to be quite as strict about how we use read_plink
within the library (see #57, eg). I'm sure there are pros and cons either way, just bringing up the idea for discussion.
Anyway, this is easy to tweak afterwards. Very happy to merge this as is.
@@ -0,0 +1,126 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably rename this file I guess (did @tomwhite point this out too though?)
36666f8
to
64918a2
Compare
@tomwhite @jeromekelleher thanks for the reviews, last update:
Correct, and thanks for raising this. As you point out, we can easily change that in the future if we like (though I prefer it the way it is right now). Regarding CLI, there is many ways we can solve that with either import/run-time guards, and we can wait until we have CLI to decide(?). |
64918a2
to
8858371
Compare
Codecov Report
@@ Coverage Diff @@
## master #277 +/- ##
==========================================
- Coverage 98.86% 98.68% -0.19%
==========================================
Files 14 16 +2
Lines 884 987 +103
==========================================
+ Hits 874 974 +100
- Misses 10 13 +3
Continue to review full report at Codecov.
|
Sure @ravwojdyla, let's leave things as they are for now and think about the runtime/import time thing later. Pinging @eric-czech for a review here too, since this is a big change. |
I'm looking at the VCF tests, and they are spread across several files, so in that case I think it is worth putting them in a separate directory ( |
@tomwhite done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeromekelleher.
LGTM. One very minor concern I have is testing that the core functionality works without the extras installed (as they would be in the build env now). I suppose that only becomes more of a risk when we have more functionality like vcfzarr_reader
though, or anything that is IO-related without needing the underlying reader libraries.
@eric-czech thanks for the review! Regarding your concern about testing IOs, in those cases we can always leverage pytest markers, and for example by default skip tests that we expect are hard to test locally (and hide them behind a marker flag), kinda like what we do internally with datastore emulated tests. Btw - about |
Hm I tried adding auto-merge but no dice. @jeromekelleher could you take a look? Was that supposed to be fixed now or did I attach the wrong label? |
as @ravwojdyla pointed out, mergify doesn't do auto merge on things that modify workflows. I've hit the manual merge. |
Re: #65 , #257
Depends on: #274, #271UX test: