Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update vcf.py to work on joint freq release Table #688

Merged
merged 10 commits into from
Apr 3, 2024

Conversation

KoalaQin
Copy link
Contributor

@KoalaQin KoalaQin commented Mar 29, 2024

Changes:

  1. Add an option in make_hist_bin_edges_expr so it can access histograms as a row or nested under a row. Why? In exomes or genomes release HT, histograms is a row, but in the joint releast HT, it's under the row of exomes, genomes, or joint.
  2. Add prefix, suffix and description_text options in make_hist_dict. Also in order to work on joint Table.
  3. Add REGION_FLAGS_JOINT_INFO_DICT for joint-specific region_flags;
  4. Update make_info_dict to include CTT & CMH annotations;

Copy link
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and suggestions

@@ -1158,6 +1158,7 @@ def make_vcf_filter_dict(
def make_hist_bin_edges_expr(
ht: hl.Table,
hists: List[str] = HISTS,
row_name: str = "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more descriptive thanrow_name? Maybe ann_w_hists or something? Could you also make this Optional? And whats the benefit of defaulting to an empty string instead of None?

"""
# Add underscore to prefix if it isn't empty
if prefix != "":
prefix += label_delimiter

edges_dict = {}
if row_name and hasattr(ht[row_name], "histograms"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this answers my question above, does ht[None] fail? Though my gut is if None was passed this would short circuit and hasattr(ht[None],...) wouldnt be evaluated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use ht[None], it won't work. You meant something like this?

    row_name: Optional[str] = None,
...
                        ht.head(1)[row_name].histograms.age_hists[f"age_hist_{call_type}"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, can you explain how it didnt work? I'd expect the if row_name to evaluate to False and hasattr(.. wouldnt even be evaluated so your second line would never be an issue with None. But perhaps hail is getting angry about it when its creating the DAG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error:
TypeError: Table.__getitem__: invalid index argument(s) Usage 1: field selection: ht['field'] Usage 2: Left distinct join: ht[ht2.key] or ht[ht2.field1, ht2.field2]

@KoalaQin KoalaQin requested a review from mike-w-wilson March 29, 2024 15:56
@KoalaQin KoalaQin changed the title Add row_name option in case histograms is nested under a row Update hist functions in vcf.py for it to work on joint release table struct Mar 29, 2024
@KoalaQin KoalaQin changed the title Update hist functions in vcf.py for it to work on joint release table struct Update vcf.py to work on joint freq release Table Mar 30, 2024
Copy link
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things but this is looking good.

Comment on lines 1030 to 1031
"CTT_odds_ratio",
"CTT_p_value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be appended to the metric list when freq_ctt is passed, not a default. These values do not exist on the exome or genome HTs by themselves which use this code so well have non-existent fields in the dictionary when we run this code for exomes or genomes. I know this follows the behavior of callstats/faf and a design choice we made previously where we include all metrics in the list even if only one is called but that wasnt the best choice and since these stats are specific to the joint HT, lets pull them out into a conditional append.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching this!

gnomad/utils/vcf.py Show resolved Hide resolved
if include_age_hists:
edges_dict.update(
{
f"{prefix}{call_type}": "|".join(
map(
lambda x: f"{x:.1f}",
ht.head(1)
.histograms.age_hists[f"age_hist_{call_type}"]
ht_row.histograms.age_hists[f"age_hist_{call_type}"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if the datatype has missing values in the first row since you cannot index into a NoneType. We need to add logic to handle the missing values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

head(1) worked very fast, my current change is very slow to run, do you have a faster suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I add more secondary workers...

@@ -1203,7 +1287,7 @@ def make_hist_bin_edges_expr(
edges_dict[hist_name] = "|".join(
map(
lambda x: f"{x:.2f}" if "ab" in hist else str(int(x)),
ht.head(1).histograms[hist_type][hist].collect()[0].bin_edges,
ht_row.histograms[hist_type][hist].collect()[0].bin_edges,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

gnomad/utils/vcf.py Outdated Show resolved Hide resolved
@KoalaQin KoalaQin requested a review from mike-w-wilson April 1, 2024 16:52
@KoalaQin
Copy link
Contributor Author

KoalaQin commented Apr 3, 2024

I restructured make_hist_bin_edges_expr to implement our thinking, I wanted to simplify it more but it didn't work, let me know if this version can still be improved.

Copy link
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, its straight forward and readable. One thing, let's move the

REGION_FLAG_FIELDS = [
    "fail_interval_qc",
    "outside_broad_capture_region",
    "outside_ukb_capture_region",
    "outside_broad_calling_region",
    "outside_ukb_calling_region",
    "not_called_in_exomes",
    "not_called_in_genomes",
]

Into this PR from the gnomad_qc 591 PR. They should live in the same place as the other flags, maybe just call them JOINT_REGION_FLAG_FIELDS

gnomad/utils/vcf.py Outdated Show resolved Hide resolved
gnomad/utils/vcf.py Show resolved Hide resolved
gnomad/utils/vcf.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mike-w-wilson mike-w-wilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@KoalaQin KoalaQin merged commit f4d8155 into main Apr 3, 2024
3 checks passed
@KoalaQin KoalaQin deleted the qh/hist_bin_edges_expr branch April 3, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants