Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add n_partitions option to get_qc_mt before LD pruning #472

Merged
merged 8 commits into from
Aug 16, 2022
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions gnomad/sample_qc/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ def get_qc_mt(
filter_exome_low_coverage_regions: bool = False,
high_conf_regions: Optional[List[str]] = None,
checkpoint_path: Optional[str] = None,
partitions: Optional[int] = None,
klaricch marked this conversation as resolved.
Show resolved Hide resolved
) -> hl.MatrixTable:
"""
Create a QC-ready MT.
Expand Down Expand Up @@ -149,6 +150,7 @@ def get_qc_mt(
:param filter_exome_low_coverage_regions: If set, only high coverage exome regions (computed from gnomAD are kept)
klaricch marked this conversation as resolved.
Show resolved Hide resolved
:param high_conf_regions: If given, the data will be filtered to only include variants in those regions
:param checkpoint_path: If given, the QC MT will be checkpointed to the specified path before running LD pruning. If not specified, persist will be used instead.
:param partitions: If given, the QC MT will be repartitioned to the specified number of partitions before running LD pruning. 'checkpoint_path' must also be specified as the MT will first be written to the checkpoint_path before being reread with with new number of partitions.
klaricch marked this conversation as resolved.
Show resolved Hide resolved
:return: Filtered MT
"""
logger.info("Creating QC MatrixTable")
Expand All @@ -157,6 +159,9 @@ def get_qc_mt(
"The LD-prune step of this function requires non-preemptible workers only!"
)

if partitions and not checkpoint_path:
raise ValueError("checkpoint_path must be supplied if repartitioning!")

qc_mt = filter_low_conf_regions(
mt,
filter_lcr=filter_lcr,
Expand All @@ -182,8 +187,13 @@ def get_qc_mt(

if ld_r2 is not None:
if checkpoint_path:
logger.info("Checkpointing the MT and LD pruning")
qc_mt = qc_mt.checkpoint(checkpoint_path, overwrite=True)
if partitions:
logger.info("Repartitioning MT and LD pruning")
klaricch marked this conversation as resolved.
Show resolved Hide resolved
qc_mt.write(checkpoint_path, overwrite=True)
qc_mt = hl.read_matrix_table(checkpoint_path, _n_partitions=partitions)
else:
logger.info("Checkpointing the MT and LD pruning")
qc_mt = qc_mt.checkpoint(checkpoint_path, overwrite=True)
else:
logger.info("Persisting the MT and LD pruning")
qc_mt = qc_mt.persist()
Expand Down