Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sgkit-plink IO merger #277

Merged
merged 3 commits into from
Sep 28, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,10 @@ jobs:
- name: Install dependencies
# activate conda
shell: bash -l {0}
# conda can't install all dev tools, so we need to split it between conda and pip
run: |
conda install --file requirements.txt --file requirements-dev.txt msprime
conda install --file requirements.txt msprime
pip install -r requirements-dev.txt
- name: Test with pytest and coverage
# activate conda
shell: bash -l {0}
Expand Down
20 changes: 17 additions & 3 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,26 @@
.. currentmodule:: sgkit

#############
API reference
#############

This page provides an auto-generated summary of sgkits's API.

IO/imports
==========

.. currentmodule:: sgkit.io.plink
.. autosummary::
:toctree: generated/

read_plink

.. currentmodule:: sgkit
.. autosummary::
:toctree: generated/

read_vcfzarr

.. currentmodule:: sgkit

Creating a dataset
==================

Expand All @@ -14,7 +29,6 @@ Creating a dataset

create_genotype_call_dataset
create_genotype_dosage_dataset
read_vcfzarr

Methods
=======
Expand Down
11 changes: 6 additions & 5 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,13 @@ sgkit: Statistical genetics toolkit in Python
=============================================

.. toctree::
:maxdepth: 2
:caption: Contents:
:maxdepth: 2
:caption: Contents:

api
usage
contributing
api
usage
io
contributing


Indices and tables
Expand Down
16 changes: 16 additions & 0 deletions docs/io.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
.. _io:

IOs
===

PLINK
-----

The :func:`sgkit.io.plink.read_plink` loads a single PLINK dataset as Dask
arrays within an `xr.Dataset` from bed, bim, and fam files.

PLINK IO support is an "extra" feature within sgkit and requires additional
dependencies. To install sgkit with PLINK support using pip::

$ pip install git+https://github.com/pystatgen/sgkit#egg=sgkit[plink]

3 changes: 3 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@ statsmodels
zarr
msprime
scikit-learn
partd
fsspec
bed-reader
23 changes: 21 additions & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,24 @@ install_requires =
setup_requires =
setuptools >= 41.2
setuptools_scm


[options.extras_require]
# For plink we need dask[dataframe], we already have
# dask[array] in install_requires, and since
# https://github.com/pypa/pip/issues/4957, pip
# will essentially ignore dask[dataframe] in the extras.
# We can workaround this by either adding pip flag
# --use-feature 2020-resolver, or installing
# dask[dataframe] in the install_requires, or just listing
# the 2 missing dependencies from dataframe, the way we do
# here, when pip finally gets a resolver, this won't be
# a problem. Here we opt for listing the 2 dependencies
# since this is the least user invasive solution.
plink =
partd
fsspec
bed-reader
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see where dask[dataframe] actually gets installed - it looks like only its dependencies get installed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but I'm still missing something: where is the dask[dataframe] dependency declared? It's used by pysnptools.py, but how does it get installed?

Copy link
Collaborator Author

@ravwojdyla ravwojdyla Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask[dataframe] is not an "actual" dependency, it's an extra in dask, and since we already install dask via dask[array] in install_requires and since it's an extra in our extras (which this bug in pip affects) we can't really list dask[dataframe] as a dependency (unless we use the 2020 pip resolver), so we have the options outlined in the comment above, so instead of forcing the extra flag on users, I opted for listing the missing dependencies that come from dask[dataframe]. Initial version of this PR actually used the 2020 resolver, but I reverted to listing dask[dataframe] deps directly so that we can use plain pip.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. I was missing the bit about dask[dataframe] not actually providing any Dask code.

When the pip bug is fixed do you think we should switch to just add the dask[dataframe] dependency here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite quick answer - sure. But that bug will be fixed by the 2020 resolver, right now 2020 resolver is available via feature flag, it is scheduled to be default in October this year, so we can wait until then + a couple of months and we should be good to switch over.


[coverage:report]
fail_under = 100

Expand Down Expand Up @@ -92,9 +109,11 @@ ignore_missing_imports = True
ignore_missing_imports = True
[mypy-sklearn.*]
ignore_missing_imports = True
[mypy-bed_reader.*]
ignore_missing_imports = True
[mypy-sgkit.*]
allow_redefinition = True
[mypy-sgkit.tests.*]
[mypy-sgkit.*.tests.*]
disallow_untyped_defs = False
disallow_untyped_decorators = False
[mypy-validation.*]
Expand Down
11 changes: 11 additions & 0 deletions sgkit/io/plink/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
try:
from .plink_reader import read_plink # noqa: F401

__all__ = ["read_plink"]
except ImportError as e:
msg = (
"sgkit-plink requirements are not installed.\n\n"
"Please install them via pip :\n\n"
" pip install 'git+https://github.com/pystatgen/sgkit#egg=sgkit[plink]'"
)
raise ImportError(str(e) + "\n\n" + msg) from e
Loading