Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary stats #102

Merged
merged 8 commits into from
Sep 14, 2020
Merged

Summary stats #102

merged 8 commits into from
Sep 14, 2020

Conversation

tomwhite
Copy link
Collaborator

This is a draft implementation of #29 (based on @eric-czech's code there) for discussion. I'm quite happy for someone else to take if over if interested (e.g. @daletovar, @jerowe).

API docs need to be added still. Also sample summary stats.

A couple of points for discussion:

  1. Method naming. We currently have count_alleles (verb), but some of the new methods don't easily fit into that pattern: e.g. call_rate, variant_stats. Perhaps insisting the function name for each method is a verb is going too far, and we should just allow the name that we think sounds best (this seems to be the approach taken by Hail and Glow which both have a mixture of both styles).
  2. Conditionally computing variables, and merging. The allele_frequency function calls count_alleles, but perhaps it should check first if it has already been computed (I think @eric-czech does something similar in LD prune, for example). In that case, should it return the variant_allele_count variable in the returned dataset? I'm wondering whether we should actually merge the new variables with the original dataset as a general rule. The original dataset would be unchanged, and this would mean the user no longer had to call merge themselves, which might work better in pipelines, for example.

@jeromekelleher
Copy link
Collaborator

@daletovar, @jerowe I think this would be a great place to start getting some real code merged. Would either of you like to work with @tomwhite on getting this fleshed out?

@eric-czech
Copy link
Collaborator

eric-czech commented Aug 10, 2020

Method naming. We currently have count_alleles (verb), but some of the new methods don't easily fit into that pattern

I've actually been struggling with that a good bit @tomwhite. I keep trying to think of a good verb for things like regenie, linear_regression, or ld_prune but the fact that they're generally going to correspond to lazily defined arrays seems to put them between two stools for verbs like get and run. I had been (and still am) generally picking nouns for method names but only because I can't think of a good alternative. The nouns come easily though and given that there are unlikely to be many things in the API other than functions, perhaps verbs aren't as useful as a distinguishing modifier?

@ericdatakelly
Copy link

@daletovar, @jerowe I think this would be a great place to start getting some real code merged. Would either of you like to work with @tomwhite on getting this fleshed out?

@jeromekelleher: @daletovar is out of town until Thursday but I think he'd be interested in this.

@eric-czech
Copy link
Collaborator

To more of @tomwhite's original questions:

In that case, should it return the variant_allele_count variable in the returned dataset?

From an efficiency perspective, I think it would be ok to define upstream variables if not provided and then not merge/return them. I had thought Dask would be smart enough to know when the same computation is defined twice and in some experiments I haven't been able to find a counterexample, meaning that if we were to throw away the variant_allele_count created internally within an allele_frequency right before a user calls count_alleles, the Dask graph would be the same.

I'm wondering whether we should actually merge the new variables with the original dataset as a general rule. The original dataset would be unchanged, and this would mean the user no longer had to call merge themselves, which might work better in pipelines, for example.

I opened https://github.com/pystatgen/sgkit/issues/103 with some thoughts on that so we can keep that discussion going.

@daletovar
Copy link
Collaborator

@tomwhite, I'm happy to help with this. Would you prefer we keep this PR open and I fork your repo to collaborate?

@tomwhite
Copy link
Collaborator Author

@daletovar thanks for picking this up. I don't have a strong opinion on whether you fork this, or start from scratch.

)


def genotype_count(ds: Dataset, dim: Dimension) -> Dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed.

@daletovar
Copy link
Collaborator

@tomwhite, I see. Well, do we know what all summary stats we want to have?

@tomwhite
Copy link
Collaborator Author

After discussion with @daletovar I've picked this one up again.

I've rebased on top of #217 (which is still a draft) so that the allele counting functions return datasets, and to use the code to merge datasets. I've also added docs, and a test for the case where variant_allele_count already exists in the dataset.

@tomwhite tomwhite force-pushed the summary-stats branch 2 times, most recently from 6d62c8a to 5e9abae Compare September 2, 2020 08:19
@tomwhite tomwhite marked this pull request as ready for review September 2, 2020 08:25
Copy link
Collaborator

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long to approve this, but LGTM.

@jeromekelleher jeromekelleher added the auto-merge Auto merge label for mergify test flight label Sep 14, 2020
@jeromekelleher
Copy link
Collaborator

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Sep 14, 2020

Command rebase: success

Branch has been successfully rebased

@tomwhite
Copy link
Collaborator Author

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Sep 14, 2020

Command update: success

Branch has been successfully updated

@jeromekelleher
Copy link
Collaborator

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Sep 14, 2020

Command update: success

Branch has been successfully updated

@jeromekelleher
Copy link
Collaborator

The updates weren't working as we hadn't merged #244. Hopefully this will sort it out now.

@mergify mergify bot merged commit 36cd66e into sgkit-dev:master Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants