Add usage and design documentation #278

eric-czech · 2020-09-24T13:35:42Z

https://github.com/pystatgen/sgkit/issues/87

This adds documentation on data structures, API usage, missing data, pipelining and a few other topics.

The changes here are mirrored at https://eric-czech.github.io/sgkit/ so that they're a little easier to review.

I added an "Overview" section within "Getting Started" that introduces a bunch of these ideas without much depth. I was imagining that this plus as separate "User Guide" section that explains some of these topics, plus others that are more specific to a certain type of genetic analysis, in greater detail and with realistic data would be a good start. I went that direction mostly because it's how the Xarray documentation is structured.

@tomwhite I'm happy to move this over to the "Usage" section instead and omit any kind of less-detailed overview. It may become a pain to maintain separate sections with some overlap. I do have a slight preference for the name "User Guide" over "Usage" though -- let me know if you disagree.

@hammer my thinking with this is that the current design is more or less implicit in the user docs, but to capture some of the history on design decisions too I added a section in Contributing called "Design Decisions". I put a few major threads I could think of in there.

FYI this also adds the IPython sphinx extension and ipython/matplotlib as doc build dependencies.

docs/index.rst

jeromekelleher

This is excellent @eric-czech, thanks very much for taking the time to lay this all out. I've found it very helpful.

My main feedback is that it currently assumes a fairly advanced knowledge of the pydata ecosystem, and a few sentences here and there explaining high-level things and providing links to upstream docs would be a big help for many people (me included!)

jeromekelleher · 2020-09-25T09:50:55Z

docs/getting_started.rst

+
+Overview
+--------
+


Might be worthwhile giving a high-level overview of what sgkit is first, something like

Sgkit is a general purpose toolkit for statistical and population genomics. The primary goal of sgkit is to take advantage of powerful tools in the PyData ecosystem [link] to facilitate interactive analysis of large-scale genomic datasets. The main libraries we use are

xarray to provides labelled numerical datasets

dask [something about parallelism and distribution

cupy [somehting about GPUs]

[? more]?

Basically, we shouldn't assume that someone "getting started" knows what all these upstream libraries are and how cool they are.

Think it should repeat what's on the index page? Wasn't sure if you saw that, but it seems to be what you're describing.

I did see the index page all right, but I still think it'd be worth giving a little context here. We should assume new users who have just clicked a link are coming in fresh here, so just a small bit of background would help get them started.

pystatgen/sgkit@ae349ee#diff-9bc8bf6e8d9db020cefce1fbc0426795R25

jeromekelleher · 2020-09-25T09:52:46Z

docs/getting_started.rst

+
+The presence of a single-nucleotide variant (SNV) is indicated above by the ``call_genotype`` variable, which contains
+an integer value corresponding to the index of the associated allele present (i.e. index into the ``variant_allele`` variable)
+for a sample at a given locus and chromosome. Every sgkit variable has a set of fixed semantics like this. For more


"locus" might confuse people here. "genome coordinate" or "location" perhaps?

pystatgen/sgkit@ae349ee#diff-9bc8bf6e8d9db020cefce1fbc0426795R63

jeromekelleher · 2020-09-25T10:04:40Z

docs/getting_started.rst

+
+This example shows how either can be used, though users should prefer the mask array where possible since
+its on-disk representation is typically far smaller after compression is applied.
+


This gets hard-core pretty quickly. I wonder if we should put in an example where missing data is handled automatically by the library first? Users might get the impression they need to know what numba and jitting are to use this.

pystatgen/sgkit@ae349ee#diff-9bc8bf6e8d9db020cefce1fbc0426795R196

Maybe move the jit example to a section in the User Guide? It's not something most users need to get started, and could be quite intimidating.

FYI: https://github.com/pystatgen/sgkit/pull/278#issuecomment-700997663

jeromekelleher · 2020-09-25T10:06:52Z

docs/getting_started.rst

+        # Compute Fst between the groups
+        # TODO: Refactor based on https://github.com/pystatgen/sgkit/pull/260
+        .pipe(lambda ds: sg.Fst(*(g[1] for g in ds.groupby('sample_cohort'))))
+        # Extract the single Fst value from the resulting array


This will change once Fst returns the full array of per-variant values by default, too.

jeromekelleher · 2020-09-25T10:07:54Z

docs/getting_started.rst

+Chaining operations
+~~~~~~~~~~~~~~~~~~~
+
+This example shows to chain multiple sgkit, xarray, and pandas operations into a single pipeline:


Is there some upstream documentation/tutorial explaining what an xarray/pandas pipeline is? I think this would be helpful here.

pystatgen/sgkit@ae349ee#diff-9bc8bf6e8d9db020cefce1fbc0426795R265

jeromekelleher · 2020-09-25T10:09:01Z

docs/getting_started.rst

+~~~~~~~~~~~~~~
+
+Chunked arrays, via Dask, operate very similarly to in-memory arrays within Xarray. Because of this, few affordances
+in sgkit are provided to treat them differently. They can generally be used in whatever context in-memory arrays are


I don't think the second sentence in this para adds much, probably best deleted.

pystatgen/sgkit@ae349ee#diff-9bc8bf6e8d9db020cefce1fbc0426795R305

jeromekelleher · 2020-09-25T10:10:21Z

docs/getting_started.rst

+
+Chunked arrays
+~~~~~~~~~~~~~~
+


A quick intro on what chunked arrays are with links to upstream docs would be super helpful here.

pystatgen/sgkit@ae349ee#diff-9bc8bf6e8d9db020cefce1fbc0426795R305

eric-czech · 2020-09-25T14:02:08Z

These changes now also include:

Moving simulate_genotype_call_dataset into the top-level API (https://github.com/pystatgen/sgkit/issues/252)
Changing the default seed in simulate_genotype_call_dataset to be fixed -- I don't see any reason for it to not be reproducible by default (Hail does that with all simulation functions and its a nice convenience).
Forcing the push to gh-pages in the docs workflow
Moving the dataset merge docs to Overview and creating a User Guide section with some stubs for sections I thought would be appropriate there: https://eric-czech.github.io/sgkit/user_guide.html

Could you take another look when you get a chance @alimanfoo / @jeromekelleher (https://eric-czech.github.io/sgkit)?

tomwhite

Great work @eric-czech - thank you for writing it!

tomwhite · 2020-09-28T09:31:16Z

docs/index.rst

+down to smaller datasets and access simpler functionality for those that may be new to Python (though there is still
+a good bit of work to done on this front). See :ref:`getting_started` for more details.
+
+Sgkit is inspired heavily by `scikit-allel <https://scikit-allel.readthedocs.io/en/stable/>`_ and `hail <https://hail.is/docs/0.2/index.html>`_,


I think "Hail" is capitalized (see e.g. https://hail.is/docs/0.2/tutorials-landing.html).

tomwhite · 2020-09-28T09:35:01Z

docs/getting_started.rst

+
+- `Xarray <http://xarray.pydata.org/en/stable/>`_: N-D labeling for arrays and datasets
+- `Dask <https://docs.dask.org/en/latest/>`_: Parallel computing on chunked arrays
+- `Zarr <https://zarr.readthedocs.io/en/stable/>`_: Serialization for chunked arrays


"Storage for chunked arrays"

pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R31

tomwhite · 2020-09-28T09:38:50Z

docs/getting_started.rst

+
+There are currently no data models in the library that attempt to capture the complexity of many (or even common)
+analyses and the data structures that would support them -- operations are applied primarily to Xarray
+`Dataset <http://xarray.pydata.org/en/stable/data-structures.html#dataset>`_ objects instead. Users are free to manipulate data


Is it possible to reword in a more positive way?

"Sgkit uses Xarray Dataset objects as the data structure for representing biological data. Users are free..."

This a little better? pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R39-R42

tomwhite · 2020-09-28T09:40:39Z

docs/getting_started.rst

+`PLINK <https://www.cog-genomics.org/plink2>`_ or `BGEN <https://www.well.ox.ac.uk/~gav/bgen_format/>`_.
+This is a guideline however, and a ``Dataset`` seen in practice might include many more or fewer variables and dimensions.
+
+.. image:: _static/data-structures-xarray.jpg


How did you create this diagram? It would be good to include the original in source control so it can be edited.

Ah good call, I made it with google drawings so I made this read-only to anyone with the link so it could be duplicated+edited: pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R46-R48

tomwhite · 2020-09-28T09:53:12Z

docs/getting_started.rst

+
+This example shows how either can be used, though users should prefer the mask array where possible since
+its on-disk representation is typically far smaller after compression is applied.
+


Maybe move the jit example to a section in the User Guide? It's not something most users need to get started, and could be quite intimidating.

tomwhite · 2020-09-28T10:00:16Z

docs/getting_started.rst

+    # If an existing variable would be re-defined, a warning is thrown
+    import warnings
+    ds = sg.count_variant_alleles(ds)
+    with warnings.catch_warnings(record=True) as w:


This makes it look like users have to catch warnings, but you're having to that here to avoid ipython failing (I think). Not sure if there's a better way of doing it - one option would be to have the commented out code that would fail. Or add a comment being explaining that you don't need to explicitly catch warnings.

I was trying to avoid the failure but more so I wanted a way to have it actually render somewhere. I couldn't find a way to have the directive do that. What's the problem with expecting users to catch warnings if they want to rather than having them print somewhere? I'm struggling to word a comment for it. Mind wording one for me and I'll drop it in?

FYI haven't forgotten about this and I opened https://github.com/pystatgen/sgkit/issues/288 to track it. I plan on working on https://github.com/pystatgen/sgkit/issues/287 soon so it may naturally go away.

tomwhite · 2020-09-28T10:01:20Z

docs/getting_started.rst

+`Method chaining <https://tomaugspurger.github.io/method-chaining.html>`_ is a common practice with Python
+data tools that improves code readability and reduces the probability of introducing accidental namespace collisions.
+Sgkit functions are compatible with this idiom by default and this example shows to use it in conjunction with
+Xarray and Xandas operations in a single pipeline:


Cool new project? 😄

Still have a "Xandas" typo here.

🤦 ty. I read @tomwhite's comment twice and still couldn't see the problem for some reason. Xandas does sound cool though.

pystatgen/sgkit@dcb78db

tomwhite · 2020-09-28T10:03:49Z

docs/getting_started.rst

+    # Show statistics for one of the arrays to be used as a filter
+    ds_qc.variant_call_rate.to_series().describe()
+
+    # Build a pipeline that filters on call rate and computes Fst between two populations


Maybe add "(or cohorts)" to echo the usage below.

pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R268

tomwhite · 2020-09-28T10:05:33Z

docs/getting_started.rst

+- `Dataset.compute <http://xarray.pydata.org/en/stable/generated/xarray.Dataset.compute.html>`_ is called
+- `DataArray.compute <http://xarray.pydata.org/en/stable/generated/xarray.DataArray.compute.html>`_ is called
+- The ``DataArray.values`` attribute is referenced
+- Individual dask arrays are retrieved through the ``DataArray.data`` attribute and forced to evaluate via `Client.compute <https://distributed.dask.org/en/latest/api.html#distributed.Client.compute>`_, `dask.array.Array.compute <https://tutorial.dask.org/03_array.html#Example>`_ or by coercing them to another array type (e.g. using np.asarray)


Quote np.asarray and possibly link to its doc.

pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R303

tomwhite · 2020-09-28T10:08:36Z

docs/getting_started.rst

+    sg.count_variant_alleles(ds).variant_allele_count
+
+
+A primary design goal in sgkit is to facilitate ad-hoc analysis. There are many useful functions in


Nit: "ad hoc"

mergify · 2020-09-28T13:24:26Z

This PR has conflicts, @eric-czech please rebase and push updated version 🙏

jeromekelleher · 2020-09-29T08:50:57Z

Did you consider jupyter-sphinx for rendering the code chunks @eric-czech? Inspired by you, I've been trying to use the ipython-directive in another project and quickly hit against it's clunkiness. Jupyter-sphinx looks like it's much more active. Did you take a look at it?

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>

eric-czech · 2020-09-29T21:21:13Z

Maybe move the jit example to a section in the User Guide? It's not something most users need to get started, and could be quite intimidating.

Ok @tomwhite, I refactored it to this pystatgen/sgkit@f7f5a03#diff-9bc8bf6e8d9db020cefce1fbc0426795R209 and moved the numba example to a separate section in the user guide at pystatgen/sgkit@f7f5a03#diff-0fcaa96196be65c5d918fd5099ea19a8R50. Sound good?

p.s. I don't know why git won't let me comment on the original thread inline even though it doesn't say it's outdated or otherwise inaccessible.

eric-czech · 2020-09-29T21:27:19Z

Did you consider jupyter-sphinx for rendering the code chunks @eric-czech? Inspired by you, I've been trying to use the ipython-directive in another project and quickly hit against it's clunkiness. Jupyter-sphinx looks like it's much more active. Did you take a look at it?

Ah nice @jeromekelleher, that does look better! I opened https://github.com/pystatgen/sgkit/issues/287 to track the likely replacement.

eric-czech · 2020-09-29T21:31:11Z

@jeromekelleher/@tomwhite do you want to take another look at this? I think I got all of your suggestions, but let me know if there's anything else before I try to merge it.

jeromekelleher

Looks great! I spotted one unfixed typo, but merge away whenever.

jeromekelleher · 2020-09-30T07:25:25Z

docs/getting_started.rst

+`Method chaining <https://tomaugspurger.github.io/method-chaining.html>`_ is a common practice with Python
+data tools that improves code readability and reduces the probability of introducing accidental namespace collisions.
+Sgkit functions are compatible with this idiom by default and this example shows to use it in conjunction with
+Xarray and Xandas operations in a single pipeline:


Still have a "Xandas" typo here.

eric-czech · 2020-09-30T09:24:36Z

Merged manually because this included a small change in the docs workflow.

eric-czech force-pushed the docs branch from 0650c61 to 8261456 Compare September 24, 2020 13:43

eric-czech requested a review from hammer September 24, 2020 13:52

ravwojdyla mentioned this pull request Sep 24, 2020

sgkit-plink IO merger #277

Merged

alimanfoo reviewed Sep 24, 2020

View reviewed changes

docs/index.rst Outdated Show resolved Hide resolved

jeromekelleher reviewed Sep 25, 2020

View reviewed changes

eric-czech removed the request for review from hammer September 25, 2020 13:46

tomwhite approved these changes Sep 28, 2020

View reviewed changes

mergify bot added the conflict PR conflict label Sep 28, 2020

eric-czech mentioned this pull request Sep 29, 2020

Add PCA usage to user guide #285

Open

eric-czech and others added 4 commits September 29, 2020 16:57

Add usage and design documentation sgkit-dev#87

1d1f055

Update docs/index.rst

d5a0542

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>

Suggested changes

bf07905

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>

Force push gh-pages branch in gh action

a38f8b7

eric-czech force-pushed the docs branch from 343c25f to f7f5a03 Compare September 29, 2020 20:59

mergify bot removed the conflict PR conflict label Sep 29, 2020

Suggested changes

d314bed

eric-czech force-pushed the docs branch from f7f5a03 to d314bed Compare September 29, 2020 21:18

eric-czech mentioned this pull request Sep 29, 2020

Investigate jupyter-sphinx and potentially replace ipython-directive in docs #287

Open

jeromekelleher approved these changes Sep 30, 2020

View reviewed changes

Fix typo

dcb78db

eric-czech added the auto-merge Auto merge label for mergify test flight label Sep 30, 2020

eric-czech merged commit d588980 into sgkit-dev:master Sep 30, 2020

eric-czech deleted the docs branch September 30, 2020 09:24

eric-czech mentioned this pull request Sep 30, 2020

Improve warnings usage in docs examples #288

Open

This was referenced Oct 1, 2020

Document our approach to missing values #33

Closed

Document data structures and design philosophy #87

Closed

eric-czech mentioned this pull request Oct 1, 2020

Remove create_*_dataset methods from public API #252

Closed


		This example shows how either can be used, though users should prefer the mask array where possible since
		its on-disk representation is typically far smaller after compression is applied.

		sg.count_variant_alleles(ds).variant_allele_count


		A primary design goal in sgkit is to facilitate ad-hoc analysis. There are many useful functions in


		Overview
		--------


		Chunked arrays
		~~~~~~~~~~~~~~

Add usage and design documentation #278

Add usage and design documentation #278

Conversation

eric-czech commented Sep 24, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech commented Sep 25, 2020 • edited Loading

tomwhite left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech Sep 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Sep 28, 2020

jeromekelleher commented Sep 29, 2020

eric-czech commented Sep 29, 2020 • edited Loading

eric-czech commented Sep 29, 2020 • edited Loading

eric-czech commented Sep 29, 2020 • edited Loading

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech commented Sep 30, 2020

eric-czech commented Sep 25, 2020 •

edited

Loading

eric-czech Sep 30, 2020 •

edited

Loading

eric-czech commented Sep 29, 2020 •

edited

Loading

eric-czech commented Sep 29, 2020 •

edited

Loading

eric-czech commented Sep 29, 2020 •

edited

Loading