Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA implementation #262

Merged
merged 1 commit into from
Oct 14, 2020
Merged

PCA implementation #262

merged 1 commit into from
Oct 14, 2020

Conversation

eric-czech
Copy link
Collaborator

@eric-czech eric-czech commented Sep 16, 2020

Implementation for https://github.com/pystatgen/sgkit/issues/95.

Notes:

Upstream Fixes

Some upstream fixes necessary for this were:

Discussion

TODO

  • More tests
    • f4/f8 dtype preservation
    • More complex chunking tests (this is the source of many problems)
    • Add randomized/deterministic comparison
    • Ensure lazy evaluation
    • Validate sign stability across random states and chunkings
    • Add missing data warning
    • Test for inputs on different array backends
    • Patterson scaler
      • test vs skallel
      • test with missing data
      • test dtype preservation
  • Add return_estimator option (or save scaler properties in dataset)
  • Throw informative errors when chunking not compatible with algorithm
  • Decide whether or not randomized pca should be a separate function since it is the only option for fully-chunked arrays
  • Lift count_call_alternate_alleles into issue
  • Improve test speed
  • Add docs

@eric-czech
Copy link
Collaborator Author

eric-czech commented Sep 29, 2020

fyi @jeromekelleher I didn't use msprime for anything here because the features mentioned in https://github.com/pystatgen/sgkit/issues/23#issue-653340614 still aren't released afaik. Looks like it's been quite a while since the last one (Dec 2019). Is that likely to change anytime soon?

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @eric-czech, just a minor point on my aversion to hard-coding default values in the method signature. I know \approx nothing about PCA though, so pinging @alimanfoo for a review.

n_components: int = 10,
*,
ploidy: Optional[int] = None,
scaler: Union[BaseEstimator, str] = "patterson",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to prefer having "None" as the default in the signature as it gives a bit more flexibility, both in the context of other parameters and over time as the API evolves. Are you sure that "patterson" will always be the right default value, in every context?

Same for other value here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I can't think of a scenario where the default scaler becomes a choice made based on other inputs, but making it more flexible in case one comes up sounds good to me. It definitely makes sense for the algorithm parameter. I changed both in pystatgen/sgkit@28a9b49.

@jeromekelleher
Copy link
Collaborator

fyi @jeromekelleher I didn't use msprime for anything here because the features mentioned in #23 (comment) still aren't released afaik. Looks like it's been quite a while since the last one (Dec 2019). Is that likely to change anytime soon?

There'll be a beta in the coming weeks, but hard to know when we'll ship a final release. I think you made the right choice, we can follow up with more detailed tests later.

@mergify
Copy link
Contributor

mergify bot commented Sep 30, 2020

This PR has conflicts, @eric-czech please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 30, 2020
@mergify mergify bot removed the conflict PR conflict label Sep 30, 2020
Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @eric-czech. No real substantive comments from me. It's good to see lots of very comprehensive tests here, and also that most of the PCA implementation details are upstream.

>>> import xarray as xr
>>> import numpy as np
>>> import sgkit as sg
>>> from sgkit.testing import simulate_genotype_call_dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is in the top-level now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1,6 +1,7 @@
numpy
xarray
dask[array]
dask-ml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pulls in scikit-learn, Dask distributed and some other dependencies. That's probably OK, but thinking about if there's any way to minimise transitive dependencies here.

@eric-czech
Copy link
Collaborator Author

@jeromekelleher should I wait for @alimanfoo to review or merge? Should have asked on the call.

@codecov-commenter
Copy link

Codecov Report

Merging #262 into master will increase coverage by 0.25%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
+ Coverage   97.12%   97.37%   +0.25%     
==========================================
  Files          16       18       +2     
  Lines        1042     1144     +102     
==========================================
+ Hits         1012     1114     +102     
  Misses         30       30              
Impacted Files Coverage Δ
sgkit/__init__.py 100.00% <100.00%> (ø)
sgkit/stats/pca.py 100.00% <100.00%> (ø)
sgkit/stats/preprocessing.py 100.00% <100.00%> (ø)
sgkit/typing.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9516604...237bc96. Read the comment docs.

@mergify
Copy link
Contributor

mergify bot commented Oct 2, 2020

This PR has conflicts, @eric-czech please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Oct 2, 2020
@jeromekelleher
Copy link
Collaborator

@jeromekelleher should I wait for @alimanfoo to review or merge? Should have asked on the call.

No, if we haven't heard from him in a few days don't block (same goes for me, over the next few weeks). Merge away here I'd say.

"`ploidy` must be specified explicitly when not present in dataset dimensions"
)
ploidy = ds.dims["ploidy"]
if scaler is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could also say: scaler = scaler or "patterson", same for algorithm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

)
if algorithm is None:
algorithm = "tsqr"
if algorithm not in ["tsqr", "randomized"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ["tsqr", "randomized"] should be a set {"tsqr", "randomized"}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

return conditional_merge_datasets(ds, new_ds, merge)


def pca(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function use sgkit.variables, which would mean:

  • using variable references instead of strings
  • validate input
  • validate output
  • update the docs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the variables (pystatgen/sgkit@05fa77d#diff-cd77642bae81c13ab776800efe4ba498900a284bc2e4741d943b727f60b8b7e1R250-R284) and updated references (pystatgen/sgkit@05fa77d#diff-e66844198337edbcfa8f27fbcbcc58ef011272669fe9295119b0847c1b7cdf69R134). Squashed one too many commits before rebasing on the current master so I lost a clean delta, but that was basically all I changed.

if use_nan and ac.dtype.kind != "f":
return
if use_nan:
# Test that nan and negative sentinel values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is something wrong with this comment (?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, fixed



def pca_stats(ds: Dataset, est: BaseEstimator, *, merge: bool = True) -> Dataset:
""" Extract attributes from PCA estimator """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: when you have a single line docstring, you add a space prefix/suffix, why? Is this documented somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergify
Copy link
Contributor

mergify bot commented Oct 14, 2020

This PR has conflicts, @eric-czech please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Oct 14, 2020
@codecov-io
Copy link

codecov-io commented Oct 14, 2020

Codecov Report

Merging #262 into master will increase coverage by 0.19%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
+ Coverage   96.35%   96.55%   +0.19%     
==========================================
  Files          26       28       +2     
  Lines        1866     1972     +106     
==========================================
+ Hits         1798     1904     +106     
  Misses         68       68              
Impacted Files Coverage Δ
sgkit/__init__.py 100.00% <100.00%> (ø)
sgkit/stats/pca.py 100.00% <100.00%> (ø)
sgkit/stats/preprocessing.py 100.00% <100.00%> (ø)
sgkit/typing.py 100.00% <100.00%> (ø)
sgkit/variables.py 96.29% <100.00%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b83ca1b...05fa77d. Read the comment docs.

@eric-czech eric-czech added the auto-merge Auto merge label for mergify test flight label Oct 14, 2020
@mergify mergify bot merged commit 162447e into sgkit-dev:master Oct 14, 2020
@eric-czech eric-czech deleted the pca branch October 20, 2020 11:27
@tomwhite tomwhite mentioned this pull request Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants