Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional_samples_to_drop option to run_pca_with_relateds #489

Merged
merged 5 commits into from
Oct 17, 2022

Conversation

klaricch
Copy link
Contributor

@klaricch klaricch commented Oct 5, 2022

-Resolves issue where the autosomes_only filter was ignored if related_samples_to_drop was provided to run_pca_with_relateds
-Adds an additional_samples_to_drop option to run_pca_with_relateds, which will remove additional provided samples when generating the PCs and then project them in the PC space

@klaricch klaricch requested a review from jkgoodrich October 5, 2022 15:54
@jkgoodrich jkgoodrich changed the title add outlier_samples_to_drop option to run_pca_with_relateds Add outlier_samples_to_drop option to run_pca_with_relateds Oct 5, 2022
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly one comment that will change a few names and comments/docstrings if you agree.

@@ -259,52 +259,60 @@ def assign_population_pcs(
def run_pca_with_relateds(
qc_mt: hl.MatrixTable,
related_samples_to_drop: Optional[hl.Table],
outlier_samples_to_drop: Optional[hl.Table],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought is to modify this to additional samples to drop or something like that. Right now for removing unreleasable samples we just combine them with the related sample HT that we pass to this function, but we could just pass that table here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like additional samples to drop. Don't we drop the unreleasable samples and non-high-quality samples completely before running this step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope https://github.com/broadinstitute/gnomad_qc/blob/a828b1529a439fc41ecc73647a442c7e63897701/gnomad_qc/v3/sample_qc/sample_qc.py#L457

    if not include_unreleasable_samples:
        logger.info("Excluding unreleasable samples for PCA.")
        samples_to_drop = samples_to_drop.union(
            qc_mt.filter_cols(
                ~project_meta.ht()[qc_mt.col_key].releasable
                | project_meta.ht()[qc_mt.col_key].exclude
            )
            .cols()
            .select()
        )
    else:
        logger.info("Including unreleasable samples for PCA")

    return run_pca_with_relateds(qc_mt, samples_to_drop, n_pcs=n_pcs)

We do drop samples that fail hard filtering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, for the subpop analysis the unreleasable samples are dropped

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's fine

gnomad/sample_qc/ancestry.py Outdated Show resolved Hide resolved
@klaricch klaricch requested a review from jkgoodrich October 5, 2022 17:27
@klaricch klaricch changed the title Add outlier_samples_to_drop option to run_pca_with_relateds Add additional_samples_to_drop option to run_pca_with_relateds Oct 5, 2022
Copy link
Contributor

@jkgoodrich jkgoodrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, only thing is that we might want to add a breaking change label because if anyone used run_pca_with_relateds(qc_mt, related_samples_to_drop_table, n_pcs) it won't work

gnomad/sample_qc/ancestry.py Outdated Show resolved Hide resolved
Co-authored-by: jkgoodrich <33063077+jkgoodrich@users.noreply.github.com>
@klaricch klaricch merged commit 229e8c5 into main Oct 17, 2022
@jkgoodrich jkgoodrich deleted the kl/subpop_outliers branch April 5, 2023 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants