-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add usage and design documentation #278
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent @eric-czech, thanks very much for taking the time to lay this all out. I've found it very helpful.
My main feedback is that it currently assumes a fairly advanced knowledge of the pydata ecosystem, and a few sentences here and there explaining high-level things and providing links to upstream docs would be a big help for many people (me included!)
|
||
Overview | ||
-------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worthwhile giving a high-level overview of what sgkit is first, something like
Sgkit is a general purpose toolkit for statistical and population genomics. The primary goal of sgkit is to take advantage of powerful tools in the PyData ecosystem [link] to facilitate interactive analysis of large-scale genomic datasets. The main libraries we use are
- xarray to provides labelled numerical datasets
- dask [something about parallelism and distribution
- cupy [somehting about GPUs]
- [? more]?
Basically, we shouldn't assume that someone "getting started" knows what all these upstream libraries are and how cool they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it should repeat what's on the index page? Wasn't sure if you saw that, but it seems to be what you're describing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see the index page all right, but I still think it'd be worth giving a little context here. We should assume new users who have just clicked a link are coming in fresh here, so just a small bit of background would help get them started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/getting_started.rst
Outdated
|
||
The presence of a single-nucleotide variant (SNV) is indicated above by the ``call_genotype`` variable, which contains | ||
an integer value corresponding to the index of the associated allele present (i.e. index into the ``variant_allele`` variable) | ||
for a sample at a given locus and chromosome. Every sgkit variable has a set of fixed semantics like this. For more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"locus" might confuse people here. "genome coordinate" or "location" perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
This example shows how either can be used, though users should prefer the mask array where possible since | ||
its on-disk representation is typically far smaller after compression is applied. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets hard-core pretty quickly. I wonder if we should put in an example where missing data is handled automatically by the library first? Users might get the impression they need to know what numba and jitting are to use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move the jit example to a section in the User Guide? It's not something most users need to get started, and could be quite intimidating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Compute Fst between the groups | ||
# TODO: Refactor based on https://github.com/pystatgen/sgkit/pull/260 | ||
.pipe(lambda ds: sg.Fst(*(g[1] for g in ds.groupby('sample_cohort')))) | ||
# Extract the single Fst value from the resulting array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change once Fst returns the full array of per-variant values by default, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
docs/getting_started.rst
Outdated
Chaining operations | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
This example shows to chain multiple sgkit, xarray, and pandas operations into a single pipeline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some upstream documentation/tutorial explaining what an xarray/pandas pipeline is? I think this would be helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/getting_started.rst
Outdated
~~~~~~~~~~~~~~ | ||
|
||
Chunked arrays, via Dask, operate very similarly to in-memory arrays within Xarray. Because of this, few affordances | ||
in sgkit are provided to treat them differently. They can generally be used in whatever context in-memory arrays are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the second sentence in this para adds much, probably best deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Chunked arrays | ||
~~~~~~~~~~~~~~ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick intro on what chunked arrays are with links to upstream docs would be super helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes now also include:
Could you take another look when you get a chance @alimanfoo / @jeromekelleher (https://eric-czech.github.io/sgkit)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @eric-czech - thank you for writing it!
docs/index.rst
Outdated
down to smaller datasets and access simpler functionality for those that may be new to Python (though there is still | ||
a good bit of work to done on this front). See :ref:`getting_started` for more details. | ||
|
||
Sgkit is inspired heavily by `scikit-allel <https://scikit-allel.readthedocs.io/en/stable/>`_ and `hail <https://hail.is/docs/0.2/index.html>`_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "Hail" is capitalized (see e.g. https://hail.is/docs/0.2/tutorials-landing.html).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
docs/getting_started.rst
Outdated
|
||
- `Xarray <http://xarray.pydata.org/en/stable/>`_: N-D labeling for arrays and datasets | ||
- `Dask <https://docs.dask.org/en/latest/>`_: Parallel computing on chunked arrays | ||
- `Zarr <https://zarr.readthedocs.io/en/stable/>`_: Serialization for chunked arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Storage for chunked arrays"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/getting_started.rst
Outdated
|
||
There are currently no data models in the library that attempt to capture the complexity of many (or even common) | ||
analyses and the data structures that would support them -- operations are applied primarily to Xarray | ||
`Dataset <http://xarray.pydata.org/en/stable/data-structures.html#dataset>`_ objects instead. Users are free to manipulate data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to reword in a more positive way?
"Sgkit uses Xarray Dataset objects as the data structure for representing biological data. Users are free..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a little better? pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R39-R42
`PLINK <https://www.cog-genomics.org/plink2>`_ or `BGEN <https://www.well.ox.ac.uk/~gav/bgen_format/>`_. | ||
This is a guideline however, and a ``Dataset`` seen in practice might include many more or fewer variables and dimensions. | ||
|
||
.. image:: _static/data-structures-xarray.jpg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you create this diagram? It would be good to include the original in source control so it can be edited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good call, I made it with google drawings so I made this read-only to anyone with the link so it could be duplicated+edited: pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R46-R48
|
||
This example shows how either can be used, though users should prefer the mask array where possible since | ||
its on-disk representation is typically far smaller after compression is applied. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move the jit example to a section in the User Guide? It's not something most users need to get started, and could be quite intimidating.
# If an existing variable would be re-defined, a warning is thrown | ||
import warnings | ||
ds = sg.count_variant_alleles(ds) | ||
with warnings.catch_warnings(record=True) as w: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it look like users have to catch warnings, but you're having to that here to avoid ipython failing (I think). Not sure if there's a better way of doing it - one option would be to have the commented out code that would fail. Or add a comment being explaining that you don't need to explicitly catch warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid the failure but more so I wanted a way to have it actually render somewhere. I couldn't find a way to have the directive do that. What's the problem with expecting users to catch warnings if they want to rather than having them print somewhere? I'm struggling to word a comment for it. Mind wording one for me and I'll drop it in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI haven't forgotten about this and I opened https://github.com/pystatgen/sgkit/issues/288 to track it. I plan on working on https://github.com/pystatgen/sgkit/issues/287 soon so it may naturally go away.
docs/getting_started.rst
Outdated
`Method chaining <https://tomaugspurger.github.io/method-chaining.html>`_ is a common practice with Python | ||
data tools that improves code readability and reduces the probability of introducing accidental namespace collisions. | ||
Sgkit functions are compatible with this idiom by default and this example shows to use it in conjunction with | ||
Xarray and Xandas operations in a single pipeline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool new project? 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have a "Xandas" typo here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 ty. I read @tomwhite's comment twice and still couldn't see the problem for some reason. Xandas does sound cool though.
docs/getting_started.rst
Outdated
# Show statistics for one of the arrays to be used as a filter | ||
ds_qc.variant_call_rate.to_series().describe() | ||
|
||
# Build a pipeline that filters on call rate and computes Fst between two populations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add "(or cohorts)" to echo the usage below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/getting_started.rst
Outdated
- `Dataset.compute <http://xarray.pydata.org/en/stable/generated/xarray.Dataset.compute.html>`_ is called | ||
- `DataArray.compute <http://xarray.pydata.org/en/stable/generated/xarray.DataArray.compute.html>`_ is called | ||
- The ``DataArray.values`` attribute is referenced | ||
- Individual dask arrays are retrieved through the ``DataArray.data`` attribute and forced to evaluate via `Client.compute <https://distributed.dask.org/en/latest/api.html#distributed.Client.compute>`_, `dask.array.Array.compute <https://tutorial.dask.org/03_array.html#Example>`_ or by coercing them to another array type (e.g. using np.asarray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quote np.asarray
and possibly link to its doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/getting_started.rst
Outdated
sg.count_variant_alleles(ds).variant_allele_count | ||
|
||
|
||
A primary design goal in sgkit is to facilitate ad-hoc analysis. There are many useful functions in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "ad hoc"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
This PR has conflicts, @eric-czech please rebase and push updated version 🙏 |
Did you consider jupyter-sphinx for rendering the code chunks @eric-czech? Inspired by you, I've been trying to use the |
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Ok @tomwhite, I refactored it to this pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R209 and moved the numba example to a separate section in the user guide at pystatgen/sgkit@f7f5a03#diff-0fcaa96196be65c5d918fd5099ea19a8R50. Sound good? p.s. I don't know why git won't let me comment on the original thread inline even though it doesn't say it's outdated or otherwise inaccessible. |
Ah nice @jeromekelleher, that does look better! I opened https://github.com/pystatgen/sgkit/issues/287 to track the likely replacement. |
@jeromekelleher/@tomwhite do you want to take another look at this? I think I got all of your suggestions, but let me know if there's anything else before I try to merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I spotted one unfixed typo, but merge away whenever.
docs/getting_started.rst
Outdated
`Method chaining <https://tomaugspurger.github.io/method-chaining.html>`_ is a common practice with Python | ||
data tools that improves code readability and reduces the probability of introducing accidental namespace collisions. | ||
Sgkit functions are compatible with this idiom by default and this example shows to use it in conjunction with | ||
Xarray and Xandas operations in a single pipeline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have a "Xandas" typo here.
Merged manually because this included a small change in the docs workflow. |
https://github.com/pystatgen/sgkit/issues/87
This adds documentation on data structures, API usage, missing data, pipelining and a few other topics.
The changes here are mirrored at https://eric-czech.github.io/sgkit/ so that they're a little easier to review.
I added an "Overview" section within "Getting Started" that introduces a bunch of these ideas without much depth. I was imagining that this plus as separate "User Guide" section that explains some of these topics, plus others that are more specific to a certain type of genetic analysis, in greater detail and with realistic data would be a good start. I went that direction mostly because it's how the Xarray documentation is structured.
@tomwhite I'm happy to move this over to the "Usage" section instead and omit any kind of less-detailed overview. It may become a pain to maintain separate sections with some overlap. I do have a slight preference for the name "User Guide" over "Usage" though -- let me know if you disagree.
@hammer my thinking with this is that the current design is more or less implicit in the user docs, but to capture some of the history on design decisions too I added a section in Contributing called "Design Decisions". I put a few major threads I could think of in there.
FYI this also adds the IPython sphinx extension and ipython/matplotlib as doc build dependencies.