Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vcf export #241

Merged
merged 62 commits into from
Aug 4, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
60b8640
Added constants, make_label_combos, generic_field_check, and make_fil…
ch-kr Jul 21, 2020
3decb5f
Added AS_FIELDS, SITE_FIELDS constants and added function add_as_info…
ch-kr Jul 21, 2020
c4ad6b2
Updated make_as_info_dict and added more constants (RF_FIELDS, VQSR_F…
ch-kr Jul 21, 2020
6c9b95f
Added constants for entries to select during export
ch-kr Jul 21, 2020
a9c0354
Added make_info_dict to vcf.py
ch-kr Jul 21, 2020
20f5d52
Removed unnecessary pop constants (can be imported from ancestry.py)
ch-kr Jul 22, 2020
2117d82
Added make hist bin edges expr function to vcf.py, also removed types…
ch-kr Jul 22, 2020
78084a7
Added make hist dict
ch-kr Jul 22, 2020
5c698b7
Updated changelog
ch-kr Jul 22, 2020
e30d67b
Added SORT_ORDER and sample_sum_check to vcf.py
ch-kr Jul 22, 2020
706864d
Moved make combo header text to vcf.py
ch-kr Jul 22, 2020
d058743
Fixed imports and reformatted with black
ch-kr Jul 22, 2020
d1bbacd
Changed docstring for make_hist_bin_edges_expr
ch-kr Jul 22, 2020
424fcc1
Added set female metrics to NA
ch-kr Jul 22, 2020
d397b87
Updated docstring in set female metrics to na
ch-kr Jul 22, 2020
ec0ddd6
Removed transmitted singleton and sibling singleton from SITE_FIELDS
ch-kr Jul 28, 2020
647d8e6
Updated make_label_combos docstring and fixed values for some constan…
ch-kr Jul 28, 2020
973c5ef
Updated docstring for generic_field_check
ch-kr Jul 28, 2020
ae3fcf4
Updated docstring for make_filters_sanity_check_expr
ch-kr Jul 28, 2020
d3967aa
Updated make_filters_sanity_check_expr
ch-kr Jul 28, 2020
5313ea1
Updated docstring for make_combo_header_text
ch-kr Jul 28, 2020
b556441
Updated docstring for make_info_dict
ch-kr Jul 28, 2020
13ac89e
Updated make_combo_header_text to take dict as input
ch-kr Jul 28, 2020
1202e2e
Updated make_info_dict to pass in sort order
ch-kr Jul 28, 2020
759cfac
Addressed rest of review comments
ch-kr Jul 29, 2020
fb14af8
Add BaseQRankSum to SITE_FIELDS const
ch-kr Jul 30, 2020
8d1f38c
Addressed review comments
ch-kr Jul 30, 2020
92dbedf
Update cache and setup-python Actions (#244)
nawatts Jul 31, 2020
d18e2bb
Added constants, make_label_combos, generic_field_check, and make_fil…
ch-kr Jul 21, 2020
26956b9
Added AS_FIELDS, SITE_FIELDS constants and added function add_as_info…
ch-kr Jul 21, 2020
02f4f56
Updated make_as_info_dict and added more constants (RF_FIELDS, VQSR_F…
ch-kr Jul 21, 2020
e7edd87
Added constants for entries to select during export
ch-kr Jul 21, 2020
922a674
Added make_info_dict to vcf.py
ch-kr Jul 21, 2020
9ff11ca
Removed unnecessary pop constants (can be imported from ancestry.py)
ch-kr Jul 22, 2020
1bb7162
Added make hist bin edges expr function to vcf.py, also removed types…
ch-kr Jul 22, 2020
48bf3bb
Added make hist dict
ch-kr Jul 22, 2020
e736da0
Updated changelog
ch-kr Jul 22, 2020
ffaa908
Added SORT_ORDER and sample_sum_check to vcf.py
ch-kr Jul 22, 2020
00b92ec
Moved make combo header text to vcf.py
ch-kr Jul 22, 2020
6ed92b0
Fixed imports and reformatted with black
ch-kr Jul 22, 2020
191b073
Changed docstring for make_hist_bin_edges_expr
ch-kr Jul 22, 2020
d8a3e7a
Added set female metrics to NA
ch-kr Jul 22, 2020
51a5a73
Updated docstring in set female metrics to na
ch-kr Jul 22, 2020
6b1969d
Removed transmitted singleton and sibling singleton from SITE_FIELDS
ch-kr Jul 28, 2020
2af3587
Updated make_label_combos docstring and fixed values for some constan…
ch-kr Jul 28, 2020
ca57708
Updated docstring for generic_field_check
ch-kr Jul 28, 2020
4428b0c
Updated docstring for make_filters_sanity_check_expr
ch-kr Jul 28, 2020
75e997d
Updated make_filters_sanity_check_expr
ch-kr Jul 28, 2020
fdfed1a
Updated docstring for make_combo_header_text
ch-kr Jul 28, 2020
2c47c63
Updated docstring for make_info_dict
ch-kr Jul 28, 2020
41fb66d
Updated make_combo_header_text to take dict as input
ch-kr Jul 28, 2020
2ddd8f1
Updated make_info_dict to pass in sort order
ch-kr Jul 28, 2020
3a80a41
Addressed rest of review comments
ch-kr Jul 29, 2020
1e48d12
Add BaseQRankSum to SITE_FIELDS const
ch-kr Jul 30, 2020
ff61379
Addressed review comments
ch-kr Jul 30, 2020
ca49ac5
Updated docstring for sample_sum_check
ch-kr Jul 31, 2020
5a27577
Rebasing branch
ch-kr Jul 31, 2020
261c38d
Updated some docstrings addressing review comments
ch-kr Aug 3, 2020
d176f68
Created assessment folder and moved sample_sum_check, generic_field_c…
ch-kr Aug 3, 2020
1640b3f
Forgot to commit changes in vcf.py when moving to assessment
ch-kr Aug 3, 2020
6e67800
Updated generic field check docstring
ch-kr Aug 3, 2020
f021255
Added option to check for additional filters to make_filters_sanity_c…
ch-kr Aug 4, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,19 @@ jobs:
- name: Checkout
uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v1
uses: actions/setup-python@v2
with:
python-version: 3.7
- name: Use pip cache
uses: actions/cache@v1
uses: actions/cache@v2
with:
path: ~/.cache/pip
key: pip-${{ hashFiles('**/requirements*.txt') }}
restore-keys: |
pip-
- name: Install dependencies
run: |
pip install wheel
pip install -r requirements.txt
pip install -r requirements-dev.txt
- name: Run Pylint
Expand All @@ -49,6 +50,7 @@ jobs:
pip-
- name: Install dependencies
run: |
pip install wheel
pip install -r requirements.txt
pip install -r docs/requirements.txt
- name: Build docs
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
* Fix for error in `compute_quantile_bin` that caused incorrect binning when a single score overlapped multiple bins. [(#238)](https://github.com/broadinstitute/gnomad_methods/pull/238)
* Removed assumption of `snv` annotation from `compute_quantile_bin`. [(#238)](https://github.com/broadinstitute/gnomad_methods/pull/238)
* Fixed `create_binned_ht` because it produced a "Cannot combine expressions from different source objects error". [(#238)](https://github.com/broadinstitute/gnomad_methods/pull/238)
* Added constants and functions relevant to VCF export [(#241)](https://github.com/broadinstitute/gnomad_methods/pull/241)

## Version 0.4.0 - July 9th, 2020

Expand Down
1 change: 1 addition & 0 deletions gnomad/assessment/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from gnomad.assessment import sanity_checks
166 changes: 166 additions & 0 deletions gnomad/assessment/sanity_checks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
import logging
from typing import Dict, List, Optional

import hail as hl

from gnomad.utils.vcf import make_label_combos, SORT_ORDER


logging.basicConfig(format="%(levelname)s (%(name)s %(lineno)s): %(message)s")
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)


def generic_field_check(
ht: hl.Table,
cond_expr: hl.expr.BooleanExpression,
check_description: str,
display_fields: List[str],
verbose: bool,
) -> None:
"""
Check a generic logical condition involving annotations in a Hail Table and print the results to terminal.

Displays the number of rows in the Table that match the `cond_expr` and fail to be the desired condition (`check_description`).
If the number of rows that match the `cond_expr` is 0, then the Table passes that check; otherwise, it fails.

.. note::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this note -- it's probably something easily missed when setting up the generic field check, so it's great to have this additional reminder

`cond_expr` and `check_description` are opposites and should never be the same.
E.g., If `cond_expr` filters for instances where the raw AC is less than adj AC,
then it is checking sites that fail to be the desired condition (`check_description`)
of having a raw AC greater than or equal to the adj AC.

:param ht: Table containing annotations to be checked.
:param cond_expr: Logical expression referring to annotations in ht to be checked.
:param check_description: String describing the condition being checked; is displayed in terminal summary message.
:param display_fields: List of names of ht annotations to be displayed in case of failure (for troubleshooting purposes);
these fields are also displayed if verbose is True.
:param verbose: If True, show top values of annotations being checked, including checks that pass; if False,
show only top values of annotations that fail checks.
"""
ht_orig = ht
ht = ht.filter(cond_expr)
n_fail = ht.count()
if n_fail > 0:
logger.info(f"Found {n_fail} sites that fail {check_description} check:")
ht = ht.flatten()
ht.select("locus", "alleles", *display_fields).show()
else:
logger.info(f"PASSED {check_description} check")
if verbose:
ht_orig = ht_orig.flatten()
ht_orig.select(*display_fields).show()


def make_filters_sanity_check_expr(
ht: hl.Table, extra_filter_checks: Optional[Dict[str, hl.expr.Expression]] = None
) -> Dict[str, hl.expr.Expression]:
"""
Make Hail expressions to measure % variants filtered under varying conditions of interest.

Checks for:
- Total number of variants
- Fraction of variants removed due to:
- Any filter
- Inbreeding coefficient filter in combination with any other filter
- AC0 filter in combination with any other filter
- Random forest filtering in combination with any other filter
- Only inbreeding coefficient filter
- Only AC0 filter
- Only random forest filtering

:param ht: Table containing 'filter' annotation to be examined.
:param extra_filter_checks: Optional dictionary containing filter condition name (key) extra filter expressions (value) to be examined.
:return: Dictionary containing Hail aggregation expressions to examine filter flags.
"""
filters_dict = {
"n": hl.agg.count(),
"frac_any_filter": hl.agg.fraction(hl.len(ht.filters) != 0),
"frac_inbreed_coeff": hl.agg.fraction(ht.filters.contains("InbreedingCoeff")),
"frac_ac0": hl.agg.fraction(ht.filters.contains("AC0")),
"frac_rf": hl.agg.fraction(ht.filters.contains("RF")),
"frac_inbreed_coeff_only": hl.agg.fraction(
ht.filters.contains("InbreedingCoeff") & (ht.filters.length() == 1)
),
"frac_ac0_only": hl.agg.fraction(
ht.filters.contains("AC0") & (ht.filters.length() == 1)
),
"frac_rf_only": hl.agg.fraction(
ht.filters.contains("RF") & (ht.filters.length() == 1)
),
}
if extra_filter_checks:
filters_dict.update(extra_filter_checks)

return filters_dict


def sample_sum_check(
ht: hl.Table,
prefix: str,
label_groups: Dict[str, List[str]],
verbose: bool,
subpop: bool = None,
sort_order: List[str] = SORT_ORDER,
) -> None:
"""
Compute afresh the sum of annotations for a specified group of annotations, and compare to the annotated version;
display results from checking the sum of the specified annotations in the terminal.

:param ht: Table containing annotations to be summed.
:param prefix: String indicating sample subset.
:param label_groups: Dictionary containing an entry for each label group, where key is the name of the grouping,
e.g. "sex" or "pop", and value is a list of all possible values for that grouping (e.g. ["male", "female"] or ["afr", "nfe", "amr"]).
:param verbose: If True, show top values of annotations being checked, including checks that pass; if False,
show only top values of annotations that fail checks.
:param subpop: Subpop abbreviation, supplied only if subpopulations are included in the annotation groups being checked.
:param sort_order: List containing order to sort label group combinations. Default is SORT_ORDER.
:return: None
"""
label_combos = make_label_combos(label_groups)
combo_AC = [ht.info[f"{prefix}AC_{x}"] for x in label_combos]
combo_AN = [ht.info[f"{prefix}AN_{x}"] for x in label_combos]
combo_nhomalt = [ht.info[f"{prefix}nhomalt_{x}"] for x in label_combos]

group = label_groups.pop("group")[0]
alt_groups = "_".join(
sorted(label_groups.keys(), key=lambda x: sort_order.index(x))
)

annot_dict = {
f"sum_AC_{group}_{alt_groups}": hl.sum(combo_AC),
f"sum_AN_{group}_{alt_groups}": hl.sum(combo_AN),
f"sum_nhomalt_{group}_{alt_groups}": hl.sum(combo_nhomalt),
}

ht = ht.annotate(**annot_dict)

for subfield in ["AC", "AN", "nhomalt"]:
if not subpop:
generic_field_check(
ht,
(
ht.info[f"{prefix}{subfield}_{group}"]
!= ht[f"sum_{subfield}_{group}_{alt_groups}"]
),
f"{prefix}{subfield}_{group} = sum({subfield}_{group}_{alt_groups})",
[
f"info.{prefix}{subfield}_{group}",
f"sum_{subfield}_{group}_{alt_groups}",
],
verbose,
)
else:
generic_field_check(
ht,
(
ht.info[f"{prefix}{subfield}_{group}_{subpop}"]
!= ht[f"sum_{subfield}_{group}_{alt_groups}"]
),
f"{prefix}{subfield}_{group}_{subpop} = sum({subfield}_{group}_{alt_groups})",
[
f"info.{prefix}{subfield}_{group}_{subpop}",
f"sum_{subfield}_{group}_{alt_groups}",
],
verbose,
)
Loading