Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gaussian_mixture_model_karyotype_assignment function to assign sex karyotype using Gaussian mixture models #478

Merged
merged 21 commits into from
Sep 16, 2022

Conversation

jkgoodrich
Copy link
Contributor

Modify get_ploidy_cutoffs to accept an expression for grouping samples into 'XX' and 'XY' groupings.
Modify annotate_sex to allow for sex karyotype imputation to use a gaussian mixture model to determine grouping of samples into 'XX' and 'XY' for use in get_ploidy_cutoffs.
Modify annotate_sex to make computing f-stat and inferring sex karyotype optional.

@jkgoodrich
Copy link
Contributor Author

jkgoodrich commented Sep 14, 2022

If helpful for testing:

import hail as hl
from gnomad_qc.v4.resources.basics import (
    calling_intervals,
    get_gnomad_v4_vds,
)
from bokeh.plotting import output_notebook, show

hl.init(default_reference='GRCh38')
output_notebook()

vds = get_gnomad_v4_vds(
    remove_hard_filtered_samples=False,
    remove_hard_filtered_samples_no_sex=True,
    test=True,
)
mt = hl.vds.to_merged_sparse_mt(vds)
calling_intervals=calling_intervals(interval_name='intersection', calling_interval_padding=50).ht()
freq_ht = hl.read_table("gs://gnomad/v4.0/sample_qc/exomes/gnomad.exomes.v4.0.f_stat_sites.ht")
sex_ht = annotate_sex(vds, included_intervals=calling_intervals, sites_ht=freq_ht, f_stat_cutoff=-1, aaf_expr="AF", gt_expr="LGT", use_gaussian_mixture_model=True)
sex_ht2 = annotate_sex(mt, included_intervals=calling_intervals, sites_ht=freq_ht, f_stat_cutoff=-1, aaf_expr="AF", gt_expr="LGT", use_gaussian_mixture_model=True)
sex_ht3 = annotate_sex(vds, included_intervals=calling_intervals, sites_ht=freq_ht, f_stat_cutoff=-1, aaf_expr="AF", gt_expr="LGT", use_gaussian_mixture_model=True, variants_only_x_ploidy=True)
sex_ht4 = annotate_sex(mt, included_intervals=calling_intervals, sites_ht=freq_ht, f_stat_cutoff=-1, aaf_expr="AF", gt_expr="LGT", use_gaussian_mixture_model=True, variants_only_x_ploidy=True)

show(hl.plot.scatter(sex_ht.chrX_ploidy, sex_ht.chrY_ploidy, label=sex_ht.sex_karyotype))
show(hl.plot.scatter(sex_ht2.chrX_ploidy, sex_ht2.chrY_ploidy, label=sex_ht2.sex_karyotype))
show(hl.plot.scatter(sex_ht3.chrX_ploidy, sex_ht3.chrY_ploidy, label=sex_ht3.sex_karyotype))
show(hl.plot.scatter(sex_ht4.chrX_ploidy, sex_ht4.chrY_ploidy, label=sex_ht4.sex_karyotype))

Copy link
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. Mostly just small suggestions on docs and a couple errors to raise. However, I hit a bug in the old code when testing your example 2 and 4 above, that I cannot comment on in the PR but hit when testing your example 2 and 4 above -- using a MT in the annotate_sex function. The call to impute_sex_ploidy returns a {contig}_mean_dp field and not an autosomal_mean_dp

f"{chrom}_mean_dp": hl.agg.filter(
and
f"{chrom}_mean_dp": hl.agg.sum(
. In L455 the function attempts to rename autosomal_mean_dp if variants only is true but that field doesnt exist, and if it is false, it attempts to rename to {contig}_mean_dp which already exists. Unless there is a subsequent PR that will update this and it was accidentally split apart from this PR, well need to fix it. Do you want me to make a ticket for this or should this PR cover it?

gnomad/sample_qc/pipeline.py Outdated Show resolved Hide resolved
gnomad/sample_qc/pipeline.py Outdated Show resolved Hide resolved
gnomad/sample_qc/pipeline.py Outdated Show resolved Hide resolved
gnomad/sample_qc/pipeline.py Outdated Show resolved Hide resolved
gnomad/sample_qc/pipeline.py Outdated Show resolved Hide resolved
gnomad/sample_qc/sex.py Outdated Show resolved Hide resolved
gnomad/sample_qc/sex.py Show resolved Hide resolved
gnomad/sample_qc/sex.py Outdated Show resolved Hide resolved
gnomad/sample_qc/sex.py Show resolved Hide resolved
gnomad/sample_qc/sex.py Show resolved Hide resolved
jkgoodrich and others added 3 commits September 15, 2022 14:59
Co-authored-by: Mike Wilson <mwilson@broadinstitute.org>
…nstitute/gnomad_methods into jg/add_gaussian_karyotype

� Conflicts:
�	gnomad/sample_qc/pipeline.py
@jkgoodrich
Copy link
Contributor Author

Yes, the bug you found does get fixed in the next PR, but I might as well fix it here too

Copy link
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

gnomad/sample_qc/pipeline.py Outdated Show resolved Hide resolved
gnomad/sample_qc/pipeline.py Outdated Show resolved Hide resolved
Co-authored-by: Mike Wilson <mwilson@broadinstitute.org>
@jkgoodrich jkgoodrich merged commit f83c326 into main Sep 16, 2022
@jkgoodrich jkgoodrich deleted the jg/add_gaussian_karyotype branch September 16, 2022 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants