Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add usage and design documentation #278

Merged
merged 6 commits into from
Sep 30, 2020
Merged

Conversation

eric-czech
Copy link
Collaborator

https://github.com/pystatgen/sgkit/issues/87

This adds documentation on data structures, API usage, missing data, pipelining and a few other topics.

The changes here are mirrored at https://eric-czech.github.io/sgkit/ so that they're a little easier to review.

I added an "Overview" section within "Getting Started" that introduces a bunch of these ideas without much depth. I was imagining that this plus as separate "User Guide" section that explains some of these topics, plus others that are more specific to a certain type of genetic analysis, in greater detail and with realistic data would be a good start. I went that direction mostly because it's how the Xarray documentation is structured.

@tomwhite I'm happy to move this over to the "Usage" section instead and omit any kind of less-detailed overview. It may become a pain to maintain separate sections with some overlap. I do have a slight preference for the name "User Guide" over "Usage" though -- let me know if you disagree.

@hammer my thinking with this is that the current design is more or less implicit in the user docs, but to capture some of the history on design decisions too I added a section in Contributing called "Design Decisions". I put a few major threads I could think of in there.

FYI this also adds the IPython sphinx extension and ipython/matplotlib as doc build dependencies.

docs/index.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent @eric-czech, thanks very much for taking the time to lay this all out. I've found it very helpful.

My main feedback is that it currently assumes a fairly advanced knowledge of the pydata ecosystem, and a few sentences here and there explaining high-level things and providing links to upstream docs would be a big help for many people (me included!)


Overview
--------

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worthwhile giving a high-level overview of what sgkit is first, something like

Sgkit is a general purpose toolkit for statistical and population genomics. The primary goal of sgkit is to take advantage of powerful tools in the PyData ecosystem [link] to facilitate interactive analysis of large-scale genomic datasets. The main libraries we use are

  • xarray to provides labelled numerical datasets
  • dask [something about parallelism and distribution
  • cupy [somehting about GPUs]
  • [? more]?

Basically, we shouldn't assume that someone "getting started" knows what all these upstream libraries are and how cool they are.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it should repeat what's on the index page? Wasn't sure if you saw that, but it seems to be what you're describing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see the index page all right, but I still think it'd be worth giving a little context here. We should assume new users who have just clicked a link are coming in fresh here, so just a small bit of background would help get them started.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The presence of a single-nucleotide variant (SNV) is indicated above by the ``call_genotype`` variable, which contains
an integer value corresponding to the index of the associated allele present (i.e. index into the ``variant_allele`` variable)
for a sample at a given locus and chromosome. Every sgkit variable has a set of fixed semantics like this. For more
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"locus" might confuse people here. "genome coordinate" or "location" perhaps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


This example shows how either can be used, though users should prefer the mask array where possible since
its on-disk representation is typically far smaller after compression is applied.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets hard-core pretty quickly. I wonder if we should put in an example where missing data is handled automatically by the library first? Users might get the impression they need to know what numba and jitting are to use this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move the jit example to a section in the User Guide? It's not something most users need to get started, and could be quite intimidating.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Compute Fst between the groups
# TODO: Refactor based on https://github.com/pystatgen/sgkit/pull/260
.pipe(lambda ds: sg.Fst(*(g[1] for g in ds.groupby('sample_cohort'))))
# Extract the single Fst value from the resulting array
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change once Fst returns the full array of per-variant values by default, too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Chaining operations
~~~~~~~~~~~~~~~~~~~

This example shows to chain multiple sgkit, xarray, and pandas operations into a single pipeline:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some upstream documentation/tutorial explaining what an xarray/pandas pipeline is? I think this would be helpful here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~~~~~~~~~~~~~~

Chunked arrays, via Dask, operate very similarly to in-memory arrays within Xarray. Because of this, few affordances
in sgkit are provided to treat them differently. They can generally be used in whatever context in-memory arrays are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the second sentence in this para adds much, probably best deleted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Chunked arrays
~~~~~~~~~~~~~~

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick intro on what chunked arrays are with links to upstream docs would be super helpful here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-czech eric-czech removed the request for review from hammer September 25, 2020 13:46
@eric-czech
Copy link
Collaborator Author

eric-czech commented Sep 25, 2020

These changes now also include:

  • Moving simulate_genotype_call_dataset into the top-level API (https://github.com/pystatgen/sgkit/issues/252)
  • Changing the default seed in simulate_genotype_call_dataset to be fixed -- I don't see any reason for it to not be reproducible by default (Hail does that with all simulation functions and its a nice convenience).
  • Forcing the push to gh-pages in the docs workflow
  • Moving the dataset merge docs to Overview and creating a User Guide section with some stubs for sections I thought would be appropriate there: https://eric-czech.github.io/sgkit/user_guide.html

Could you take another look when you get a chance @alimanfoo / @jeromekelleher (https://eric-czech.github.io/sgkit)?

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @eric-czech - thank you for writing it!

docs/index.rst Outdated
down to smaller datasets and access simpler functionality for those that may be new to Python (though there is still
a good bit of work to done on this front). See :ref:`getting_started` for more details.

Sgkit is inspired heavily by `scikit-allel <https://scikit-allel.readthedocs.io/en/stable/>`_ and `hail <https://hail.is/docs/0.2/index.html>`_,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Hail" is capitalized (see e.g. https://hail.is/docs/0.2/tutorials-landing.html).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


- `Xarray <http://xarray.pydata.org/en/stable/>`_: N-D labeling for arrays and datasets
- `Dask <https://docs.dask.org/en/latest/>`_: Parallel computing on chunked arrays
- `Zarr <https://zarr.readthedocs.io/en/stable/>`_: Serialization for chunked arrays
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Storage for chunked arrays"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


There are currently no data models in the library that attempt to capture the complexity of many (or even common)
analyses and the data structures that would support them -- operations are applied primarily to Xarray
`Dataset <http://xarray.pydata.org/en/stable/data-structures.html#dataset>`_ objects instead. Users are free to manipulate data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reword in a more positive way?

"Sgkit uses Xarray Dataset objects as the data structure for representing biological data. Users are free..."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`PLINK <https://www.cog-genomics.org/plink2>`_ or `BGEN <https://www.well.ox.ac.uk/~gav/bgen_format/>`_.
This is a guideline however, and a ``Dataset`` seen in practice might include many more or fewer variables and dimensions.

.. image:: _static/data-structures-xarray.jpg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you create this diagram? It would be good to include the original in source control so it can be edited.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good call, I made it with google drawings so I made this read-only to anyone with the link so it could be duplicated+edited: pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R46-R48


This example shows how either can be used, though users should prefer the mask array where possible since
its on-disk representation is typically far smaller after compression is applied.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move the jit example to a section in the User Guide? It's not something most users need to get started, and could be quite intimidating.

# If an existing variable would be re-defined, a warning is thrown
import warnings
ds = sg.count_variant_alleles(ds)
with warnings.catch_warnings(record=True) as w:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it look like users have to catch warnings, but you're having to that here to avoid ipython failing (I think). Not sure if there's a better way of doing it - one option would be to have the commented out code that would fail. Or add a comment being explaining that you don't need to explicitly catch warnings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid the failure but more so I wanted a way to have it actually render somewhere. I couldn't find a way to have the directive do that. What's the problem with expecting users to catch warnings if they want to rather than having them print somewhere? I'm struggling to word a comment for it. Mind wording one for me and I'll drop it in?

Copy link
Collaborator Author

@eric-czech eric-czech Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI haven't forgotten about this and I opened https://github.com/pystatgen/sgkit/issues/288 to track it. I plan on working on https://github.com/pystatgen/sgkit/issues/287 soon so it may naturally go away.

`Method chaining <https://tomaugspurger.github.io/method-chaining.html>`_ is a common practice with Python
data tools that improves code readability and reduces the probability of introducing accidental namespace collisions.
Sgkit functions are compatible with this idiom by default and this example shows to use it in conjunction with
Xarray and Xandas operations in a single pipeline:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool new project? 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have a "Xandas" typo here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 ty. I read @tomwhite's comment twice and still couldn't see the problem for some reason. Xandas does sound cool though.

pystatgen/sgkit@dcb78db

# Show statistics for one of the arrays to be used as a filter
ds_qc.variant_call_rate.to_series().describe()

# Build a pipeline that filters on call rate and computes Fst between two populations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add "(or cohorts)" to echo the usage below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- `Dataset.compute <http://xarray.pydata.org/en/stable/generated/xarray.Dataset.compute.html>`_ is called
- `DataArray.compute <http://xarray.pydata.org/en/stable/generated/xarray.DataArray.compute.html>`_ is called
- The ``DataArray.values`` attribute is referenced
- Individual dask arrays are retrieved through the ``DataArray.data`` attribute and forced to evaluate via `Client.compute <https://distributed.dask.org/en/latest/api.html#distributed.Client.compute>`_, `dask.array.Array.compute <https://tutorial.dask.org/03_array.html#Example>`_ or by coercing them to another array type (e.g. using np.asarray)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quote np.asarray and possibly link to its doc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg.count_variant_alleles(ds).variant_allele_count


A primary design goal in sgkit is to facilitate ad-hoc analysis. There are many useful functions in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "ad hoc"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@mergify
Copy link
Contributor

mergify bot commented Sep 28, 2020

This PR has conflicts, @eric-czech please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 28, 2020
@jeromekelleher
Copy link
Collaborator

Did you consider jupyter-sphinx for rendering the code chunks @eric-czech? Inspired by you, I've been trying to use the ipython-directive in another project and quickly hit against it's clunkiness. Jupyter-sphinx looks like it's much more active. Did you take a look at it?

eric-czech and others added 4 commits September 29, 2020 16:57
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
@eric-czech
Copy link
Collaborator Author

eric-czech commented Sep 29, 2020

Maybe move the jit example to a section in the User Guide? It's not something most users need to get started, and could be quite intimidating.

Ok @tomwhite, I refactored it to this pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R209 and moved the numba example to a separate section in the user guide at pystatgen/sgkit@f7f5a03#diff-0fcaa96196be65c5d918fd5099ea19a8R50. Sound good?

p.s. I don't know why git won't let me comment on the original thread inline even though it doesn't say it's outdated or otherwise inaccessible.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Sep 29, 2020

Did you consider jupyter-sphinx for rendering the code chunks @eric-czech? Inspired by you, I've been trying to use the ipython-directive in another project and quickly hit against it's clunkiness. Jupyter-sphinx looks like it's much more active. Did you take a look at it?

Ah nice @jeromekelleher, that does look better! I opened https://github.com/pystatgen/sgkit/issues/287 to track the likely replacement.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Sep 29, 2020

@jeromekelleher/@tomwhite do you want to take another look at this? I think I got all of your suggestions, but let me know if there's anything else before I try to merge it.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I spotted one unfixed typo, but merge away whenever.

`Method chaining <https://tomaugspurger.github.io/method-chaining.html>`_ is a common practice with Python
data tools that improves code readability and reduces the probability of introducing accidental namespace collisions.
Sgkit functions are compatible with this idiom by default and this example shows to use it in conjunction with
Xarray and Xandas operations in a single pipeline:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have a "Xandas" typo here.

@eric-czech eric-czech added the auto-merge Auto merge label for mergify test flight label Sep 30, 2020
@eric-czech eric-czech merged commit d588980 into sgkit-dev:master Sep 30, 2020
@eric-czech
Copy link
Collaborator Author

Merged manually because this included a small change in the docs workflow.

@eric-czech eric-czech deleted the docs branch September 30, 2020 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants