PCA implementation #262

eric-czech · 2020-09-16T12:47:01Z

Implementation for https://github.com/pystatgen/sgkit/issues/95.

Notes:

Upstream Fixes

Some upstream fixes necessary for this were:

Dask SVD broken for short-fat arrays (Add SVD support for short-and-fat arrays dask/dask#6591)
Dask SVD had poor performance on certain in-memory arrays (Improve SVD consistency and small array handling dask/dask#6616)
The svd_flip function in dask-ml (used by PCA/TruncatedSVD) would have forced in-memory evaluation, but it doesn't anymore (Add svd_flip (#6599) dask/dask#6613)
- The fix for this also makes da.linalg.svd results consistent across platforms and input array chunkings
Dask-ml PCA classes force eager evaluation (Allow lazily evaluated TruncateSVD results dask/dask-ml#740)
Randomized SVD broken for small arrays (Set lower bound on compression level for svd_compressed using rows and cols dask/dask#6622)

Discussion

I added https://github.com/pystatgen/sgkit/issues/285 to track adding more extensive examples
This contains a temporary workaround for https://github.com/pystatgen/sgkit/issues/282 that should eventually be removed
Dependencies added:
- matplotlib and scikit-allel as dev dependencies
- dask-ml as a core dependency
I opened https://github.com/pystatgen/sgkit/issues/286 to discuss how to handle required variables in a calculation like this (I don't particularly like how it works in this PR at the moment)

TODO

eric-czech · 2020-09-29T19:02:11Z

fyi @jeromekelleher I didn't use msprime for anything here because the features mentioned in https://github.com/pystatgen/sgkit/issues/23#issue-653340614 still aren't released afaik. Looks like it's been quite a while since the last one (Dec 2019). Is that likely to change anytime soon?

jeromekelleher

LGTM @eric-czech, just a minor point on my aversion to hard-coding default values in the method signature. I know \approx nothing about PCA though, so pinging @alimanfoo for a review.

jeromekelleher · 2020-09-30T07:46:46Z

sgkit/stats/pca.py

+    n_components: int = 10,
+    *,
+    ploidy: Optional[int] = None,
+    scaler: Union[BaseEstimator, str] = "patterson",


I tend to prefer having "None" as the default in the signature as it gives a bit more flexibility, both in the context of other parameters and over time as the API evolves. Are you sure that "patterson" will always be the right default value, in every context?

Same for other value here.

Hm I can't think of a scenario where the default scaler becomes a choice made based on other inputs, but making it more flexible in case one comes up sounds good to me. It definitely makes sense for the algorithm parameter. I changed both in pystatgen/sgkit@28a9b49.

jeromekelleher · 2020-09-30T07:52:59Z

fyi @jeromekelleher I didn't use msprime for anything here because the features mentioned in #23 (comment) still aren't released afaik. Looks like it's been quite a while since the last one (Dec 2019). Is that likely to change anytime soon?

There'll be a beta in the coming weeks, but hard to know when we'll ship a final release. I think you made the right choice, we can follow up with more detailed tests later.

mergify · 2020-09-30T09:24:44Z

This PR has conflicts, @eric-czech please rebase and push updated version 🙏

tomwhite

This looks great @eric-czech. No real substantive comments from me. It's good to see lots of very comprehensive tests here, and also that most of the PCA implementation details are upstream.

tomwhite · 2020-10-01T12:11:27Z

sgkit/stats/pca.py

+    >>> import xarray as xr
+    >>> import numpy as np
+    >>> import sgkit as sg
+    >>> from sgkit.testing import simulate_genotype_call_dataset


This function is in the top-level now.

pystatgen/sgkit@237bc96

tomwhite · 2020-10-01T12:27:12Z

requirements.txt

@@ -1,6 +1,7 @@
 numpy
 xarray
 dask[array]
+dask-ml


This pulls in scikit-learn, Dask distributed and some other dependencies. That's probably OK, but thinking about if there's any way to minimise transitive dependencies here.

eric-czech · 2020-10-01T19:34:48Z

@jeromekelleher should I wait for @alimanfoo to review or merge? Should have asked on the call.

codecov-commenter · 2020-10-01T19:37:29Z

Codecov Report

Merging #262 into master will increase coverage by 0.25%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
+ Coverage   97.12%   97.37%   +0.25%     
==========================================
  Files          16       18       +2     
  Lines        1042     1144     +102     
==========================================
+ Hits         1012     1114     +102     
  Misses         30       30

Impacted Files	Coverage Δ
sgkit/__init__.py	`100.00% <100.00%> (ø)`
sgkit/stats/pca.py	`100.00% <100.00%> (ø)`
sgkit/stats/preprocessing.py	`100.00% <100.00%> (ø)`
sgkit/typing.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9516604...237bc96. Read the comment docs.

mergify · 2020-10-02T00:58:55Z

This PR has conflicts, @eric-czech please rebase and push updated version 🙏

jeromekelleher · 2020-10-02T05:05:13Z

@jeromekelleher should I wait for @alimanfoo to review or merge? Should have asked on the call.

No, if we haven't heard from him in a few days don't block (same goes for me, over the next few weeks). Merge away here I'd say.

ravwojdyla · 2020-10-02T11:52:38Z

sgkit/stats/pca.py

+                "`ploidy` must be specified explicitly when not present in dataset dimensions"
+            )
+        ploidy = ds.dims["ploidy"]
+    if scaler is None:


nit: you could also say: scaler = scaler or "patterson", same for algorithm

ravwojdyla · 2020-10-02T11:53:05Z

sgkit/stats/pca.py

+            )
+    if algorithm is None:
+        algorithm = "tsqr"
+    if algorithm not in ["tsqr", "randomized"]:


nit: ["tsqr", "randomized"] should be a set {"tsqr", "randomized"}

ravwojdyla · 2020-10-02T12:00:51Z

sgkit/stats/pca.py

+    return conditional_merge_datasets(ds, new_ds, merge)
+
+
+def pca(


Should this function use sgkit.variables, which would mean:

using variable references instead of strings

validate input

validate output

update the docs

Added the variables (pystatgen/sgkit@05fa77d#diff-cd77642bae81c13ab776800efe4ba498900a284bc2e4741d943b727f60b8b7e1R250-R284) and updated references (pystatgen/sgkit@05fa77d#diff-e66844198337edbcfa8f27fbcbcc58ef011272669fe9295119b0847c1b7cdf69R134). Squashed one too many commits before rebasing on the current master so I lost a clean delta, but that was basically all I changed.

ravwojdyla · 2020-10-02T12:10:48Z

sgkit/tests/test_preprocessing.py

+    if use_nan and ac.dtype.kind != "f":
+        return
+    if use_nan:
+        # Test that nan and negative sentinel values


there is something wrong with this comment (?)

good call, fixed

ravwojdyla · 2020-10-02T12:33:11Z

sgkit/stats/pca.py

+
+
+def pca_stats(ds: Dataset, est: BaseEstimator, *, merge: bool = True) -> Dataset:
+    """ Extract attributes from PCA estimator """


nit: when you have a single line docstring, you add a space prefix/suffix, why? Is this documented somewhere?

https://github.com/pystatgen/sgkit/issues/325

mergify · 2020-10-14T13:07:41Z

This PR has conflicts, @eric-czech please rebase and push updated version 🙏

codecov-io · 2020-10-14T13:18:54Z

Codecov Report

Merging #262 into master will increase coverage by 0.19%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
+ Coverage   96.35%   96.55%   +0.19%     
==========================================
  Files          26       28       +2     
  Lines        1866     1972     +106     
==========================================
+ Hits         1798     1904     +106     
  Misses         68       68

Impacted Files	Coverage Δ
sgkit/__init__.py	`100.00% <100.00%> (ø)`
sgkit/stats/pca.py	`100.00% <100.00%> (ø)`
sgkit/stats/preprocessing.py	`100.00% <100.00%> (ø)`
sgkit/typing.py	`100.00% <100.00%> (ø)`
sgkit/variables.py	`96.29% <100.00%> (+0.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b83ca1b...05fa77d. Read the comment docs.

eric-czech force-pushed the pca branch 2 times, most recently from 6d198c6 to 865c4e8 Compare September 16, 2020 12:58

hammer mentioned this pull request Sep 17, 2020

PCA #227

Closed

eric-czech force-pushed the pca branch 8 times, most recently from aaf48d1 to 40e9612 Compare September 29, 2020 16:03

This was referenced Sep 29, 2020

Add count_call_alternate_alleles function #282

Open

Add PCA usage to user guide #285

Open

eric-czech marked this pull request as ready for review September 29, 2020 16:49

eric-czech requested review from jeromekelleher and tomwhite September 29, 2020 16:49

eric-czech force-pushed the pca branch from 40e9612 to fde5a42 Compare September 29, 2020 16:56

jeromekelleher reviewed Sep 30, 2020

View reviewed changes

jeromekelleher requested a review from alimanfoo September 30, 2020 07:52

mergify bot added the conflict PR conflict label Sep 30, 2020

eric-czech force-pushed the pca branch from fde5a42 to 6dee80e Compare September 30, 2020 09:44

mergify bot removed the conflict PR conflict label Sep 30, 2020

tomwhite approved these changes Oct 1, 2020

View reviewed changes

eric-czech force-pushed the pca branch from 28a9b49 to 237bc96 Compare October 1, 2020 19:32

mergify bot added the conflict PR conflict label Oct 2, 2020

ravwojdyla requested changes Oct 2, 2020

View reviewed changes

eric-czech force-pushed the pca branch from 237bc96 to 4ef0cbb Compare October 14, 2020 12:38

mergify bot removed the conflict PR conflict label Oct 14, 2020

eric-czech force-pushed the pca branch from b27bab2 to a224270 Compare October 14, 2020 13:04

mergify bot added the conflict PR conflict label Oct 14, 2020

PCA implementation sgkit-dev#95

05fa77d

eric-czech force-pushed the pca branch from 64c9ff4 to 05fa77d Compare October 14, 2020 13:09

mergify bot removed the conflict PR conflict label Oct 14, 2020

eric-czech mentioned this pull request Oct 14, 2020

Document conventions that fall outside of code formatting #325

Open

eric-czech added the auto-merge Auto merge label for mergify test flight label Oct 14, 2020

mergify bot merged commit 162447e into sgkit-dev:master Oct 14, 2020

eric-czech deleted the pca branch October 20, 2020 11:27

eric-czech mentioned this pull request Apr 20, 2021

PyData prototype genetics method implementations related-sciences/gwas-analysis#30

Open

9 tasks

tomwhite mentioned this pull request Jan 4, 2023

PCA User Story #95

Closed

hammer mentioned this pull request Jan 9, 2024

QC section sgkit-dev/sgkit-publication#89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA implementation #262

PCA implementation #262

eric-czech commented Sep 16, 2020 •

edited

Loading

eric-czech commented Sep 29, 2020 •

edited

Loading

jeromekelleher left a comment

jeromekelleher Sep 30, 2020

eric-czech Sep 30, 2020

jeromekelleher commented Sep 30, 2020

mergify bot commented Sep 30, 2020

tomwhite left a comment

tomwhite Oct 1, 2020

eric-czech Oct 1, 2020

tomwhite Oct 1, 2020

eric-czech commented Oct 1, 2020

codecov-commenter commented Oct 1, 2020

mergify bot commented Oct 2, 2020

jeromekelleher commented Oct 2, 2020

ravwojdyla Oct 2, 2020

eric-czech Oct 14, 2020

ravwojdyla Oct 2, 2020

eric-czech Oct 14, 2020

ravwojdyla Oct 2, 2020

eric-czech Oct 14, 2020

ravwojdyla Oct 2, 2020

eric-czech Oct 14, 2020

ravwojdyla Oct 2, 2020

eric-czech Oct 14, 2020

mergify bot commented Oct 14, 2020

codecov-io commented Oct 14, 2020 •

edited

Loading

		return conditional_merge_datasets(ds, new_ds, merge)


		def pca(



		def pca_stats(ds: Dataset, est: BaseEstimator, *, merge: bool = True) -> Dataset:
		""" Extract attributes from PCA estimator """

PCA implementation #262

PCA implementation #262

Conversation

eric-czech commented Sep 16, 2020 • edited Loading

Upstream Fixes

Discussion

TODO

eric-czech commented Sep 29, 2020 • edited Loading

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher commented Sep 30, 2020

mergify bot commented Sep 30, 2020

tomwhite left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech commented Oct 1, 2020

codecov-commenter commented Oct 1, 2020

Codecov Report

mergify bot commented Oct 2, 2020

jeromekelleher commented Oct 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Oct 14, 2020

codecov-io commented Oct 14, 2020 • edited Loading

Codecov Report

eric-czech commented Sep 16, 2020 •

edited

Loading

eric-czech commented Sep 29, 2020 •

edited

Loading

codecov-io commented Oct 14, 2020 •

edited

Loading