-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update vcf.py to work on joint freq release Table #688
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and suggestions
gnomad/utils/vcf.py
Outdated
@@ -1158,6 +1158,7 @@ def make_vcf_filter_dict( | |||
def make_hist_bin_edges_expr( | |||
ht: hl.Table, | |||
hists: List[str] = HISTS, | |||
row_name: str = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more descriptive thanrow_name
? Maybe ann_w_hists
or something? Could you also make this Optional? And whats the benefit of defaulting to an empty string instead of None?
gnomad/utils/vcf.py
Outdated
""" | ||
# Add underscore to prefix if it isn't empty | ||
if prefix != "": | ||
prefix += label_delimiter | ||
|
||
edges_dict = {} | ||
if row_name and hasattr(ht[row_name], "histograms"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this answers my question above, does ht[None] fail? Though my gut is if None was passed this would short circuit and hasattr(ht[None],...)
wouldnt be evaluated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to use ht[None], it won't work. You meant something like this?
row_name: Optional[str] = None,
...
ht.head(1)[row_name].histograms.age_hists[f"age_hist_{call_type}"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, can you explain how it didnt work? I'd expect the if row_name
to evaluate to False and hasattr(..
wouldnt even be evaluated so your second line would never be an issue with None. But perhaps hail is getting angry about it when its creating the DAG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error:
TypeError: Table.__getitem__: invalid index argument(s) Usage 1: field selection: ht['field'] Usage 2: Left distinct join: ht[ht2.key] or ht[ht2.field1, ht2.field2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things but this is looking good.
gnomad/utils/vcf.py
Outdated
"CTT_odds_ratio", | ||
"CTT_p_value", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be appended to the metric list when freq_ctt is passed, not a default. These values do not exist on the exome or genome HTs by themselves which use this code so well have non-existent fields in the dictionary when we run this code for exomes or genomes. I know this follows the behavior of callstats/faf and a design choice we made previously where we include all metrics in the list even if only one is called but that wasnt the best choice and since these stats are specific to the joint HT, lets pull them out into a conditional append.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for catching this!
gnomad/utils/vcf.py
Outdated
if include_age_hists: | ||
edges_dict.update( | ||
{ | ||
f"{prefix}{call_type}": "|".join( | ||
map( | ||
lambda x: f"{x:.1f}", | ||
ht.head(1) | ||
.histograms.age_hists[f"age_hist_{call_type}"] | ||
ht_row.histograms.age_hists[f"age_hist_{call_type}"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if the datatype has missing values in the first row since you cannot index into a NoneType. We need to add logic to handle the missing values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
head(1) worked very fast, my current change is very slow to run, do you have a faster suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I add more secondary workers...
gnomad/utils/vcf.py
Outdated
@@ -1203,7 +1287,7 @@ def make_hist_bin_edges_expr( | |||
edges_dict[hist_name] = "|".join( | |||
map( | |||
lambda x: f"{x:.2f}" if "ab" in hist else str(int(x)), | |||
ht.head(1).histograms[hist_type][hist].collect()[0].bin_edges, | |||
ht_row.histograms[hist_type][hist].collect()[0].bin_edges, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above.
I restructured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach, its straight forward and readable. One thing, let's move the
REGION_FLAG_FIELDS = [
"fail_interval_qc",
"outside_broad_capture_region",
"outside_ukb_capture_region",
"outside_broad_calling_region",
"outside_ukb_calling_region",
"not_called_in_exomes",
"not_called_in_genomes",
]
Into this PR from the gnomad_qc 591 PR. They should live in the same place as the other flags, maybe just call them JOINT_REGION_FLAG_FIELDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes:
make_hist_bin_edges_expr
so it can access histograms as a row or nested under a row. Why? In exomes or genomes release HT,histograms
is a row, but in the joint releast HT, it's under the row ofexomes
,genomes
, orjoint
.make_hist_dict
. Also in order to work on joint Table.REGION_FLAGS_JOINT_INFO_DICT
for joint-specific region_flags;make_info_dict
to include CTT & CMH annotations;