Annotate CNV table with mutation frequencies #52

ewafula · 2021-07-14T08:34:59Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Uses consensus_seg_annotated_cn_autosomes.tsv and consensus_seg_annotated_cn_x_and_y.tsv consensus CNV calls and variant types (amplification, deep deletion, gain, loss, and neutral) to determine Ensembl gene-level mutation frequencies for each cancer type in an overall cohort dateset and in the independent primary/relapse cohort subsets of the data.

What was your approach?

The code is adapted from https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/f70645b6c7e4eb15ea29e45e9ebf0adeb5798b9b/analyses/snv-frequencies by @logstar and https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/kgaonkar6/fusion_freq/analyses/fusion-frequencies by @kgaonkar6

Given CNV consensus table with Kids_First_Biospecimen_ID and Variant_Type, python script ,01-cnv-frequencies.py computes gene-level mutation frequencies per cancer_group within cohort and add annotations.

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#66
d3b-center/ticket-tracker-OPC#68

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

lf mutation frequencies should be restricted to only Ensembl gene identifiers without consideration of variant types (amplification, deep deletion, gain, loss, and neutral). Currently variant types are included in combination with Ensembl gene identifiers to count mutations.

Is there anything that you want to discuss further?

Still requires additional information to update the table with variant categories such as focal, segmental, chromosomal e.t.c., and Oncogene/TSG categories from OncoKB

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

JSONL and TSV tables

What is your summary of the results?

The CNV consensus frequencies results currently only for PBTA and GMKF cohorts.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

logstar

Thank you for creating the cnv-frequencies module @ewafula !

I wonder if you could only use tab or space for indentation in 01-cnv-frequencies.py. Mixing tabs and spaces for indentation may cause hard-to-detect errors in future updates, if the script is edited by certain text editors.

To make sure that other people could reproduce your results identically, could you rerun your analysis module in the Docker image? You can add RUN pip3 install mygene in your local Dockerfile to install the mygene package. I assume you did not use the Docker image, because you used python > 3.5 syntax, and the Docker image only has python == 3.5.

Following are specific suggestions and comments.

analyses/cnv-frequencies/01-cnv-frequencies.py

analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh

analyses/cnv-frequencies/01-cnv-frequencies.py

analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh

analyses/cnv-frequencies/01-cnv-frequencies.py

logstar · 2021-07-14T16:35:59Z

lf mutation frequencies should be restricted to only Ensembl gene identifiers without consideration of variant types (amplification, deep deletion, gain, loss, and neutral). Currently variant types are included in combination with Ensembl gene identifiers to count mutations.

@jharenza Are (gene, variant type)-level CNV mutation frequencies expected results?

update local cnv-frequencies with remote OpenPedCan

ewafula · 2021-07-16T15:16:48Z

@logstar, @jharenza, I made all the changes you recommended and regenerated the results using a docker image build from the OpenPedCan Dockerfile. I also include the OnkoKB categories incase results need to be used for mock data. I'll need to amend the function that adds annotations after the annotation module is ready. Currently it is taking ~6-7 hrs to retrieve annotations for all ~25k genes from MyGene on my old Mac mini. It is time consuming if I need to amend code and rerun. jq, the JSONL converter @logstar using doesn't install properly on my machine. As result, I have left the python code for converting TSV to JSONL using cvs.DictWriter. Works ok in python >v3.6, but was experimental in earlier versions, including python v3.5 in the project docker image. The conversion is unstable in python v3.5 and sometime does not maintain the order of the columns in the table when dumped to JSON. I am exploring if I can implement using OrderedDict from the python Collection module. I did not commit the Dockerfile (with mygene module) because we will not be retrieving annotations onwards using MyGene API.

ewafula · 2021-07-16T17:06:41Z

Ok will. Thanks!

…

On Fri, Jul 16, 2021 at 1:04 PM Yuanchao Zhang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh <#52 (comment)> : > +# Independent primary tumor samples file path +primary_tumors=analyses/independent-samples/results/independent-specimens.wgs.primary.tsv + +# Independent relapse tumor samples file path +relapse_tumors=analyses/independent-samples/results/independent-specimens.wgs.relapse.tsv rerunning now with each cohort No problem. The differences are documented by @runjin326 <https://github.com/runjin326> at README.md. Could you rerun without the mygene part? So you would not need to run it for a couple of hours? Or you could use this table for ENSG -> gene full name mapping, https://github.com/logstar/OpenPedCan-analysis/blob/lft-utils-ann-data-download/analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZN26A6W27RW363RAH6AUDTYBRBHANCNFSM5AK7DEKQ> .

ewafula · 2021-07-16T17:38:06Z

Sorry, must have misunderstood that. I’ll make the change. So, only the relapse and primary independent samples uses number of samples instead of number of patients? I have not nailed the reasoning for it in my head yet! A Patients might more that one sample?

…

On Fri, Jul 16, 2021 at 1:12 PM Yuanchao Zhang ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In analyses/cnv-frequencies/01-cnv-frequencies.py <#52 (comment)> : > + tumor_dfs["relapse_tumors"] = relapse_tumors_df + + # compute variant frequencies for each cancer group per cohort and cancer group in cohorts + # for the overal dataset (all tumor samples) and independent primary/replase tumor samples + def func(x): + d = {} + d["Gene_Symbol"] = ",".join(x["gene_symbol"].unique()) + d["total_alterations"] = x["Kids_First_Participant_ID"].nunique() + return pd.Series(d, index=["Gene_Symbol", "total_alterations"]) + all_tumors_frequecy_dfs = [] + primary_tumors_frequecy_dfs = [] + relapse_tumors_frequecy_dfs = [] + for index, row in cancer_group_cohort_df.iterrows(): + if row["num_samples"] > 5: + for df_name, tumor_df in tumor_dfs.items(): + df = tumor_df[(tumor_df["cancer_group"] == row["cancer_group"]) & (tumor_df["cohort"] == row["cohort"])] cohort == 'all_cohorts' cases are handled in https://github.com/ewafula/OpenPedCan-analysis/blob/994a4178e3fd68fd2a342d1f5ffd5e6bd7030c6f/analyses/cnv-frequencies/01-cnv-frequencies.py#L116-L119 . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZN26EX53H4EYXCO6PS5E3TYBSBTANCNFSM5AK7DEKQ> .

logstar · 2021-07-16T17:47:13Z

Sorry, must have misunderstood that. I’ll make the change. So, only the relapse and primary independent samples uses number of samples instead of number of patients? I have not nailed the reasoning for it in my head yet! A Patients might more that one sample?
…
On Fri, Jul 16, 2021 at 1:12 PM Yuanchao Zhang @.> wrote: @.* commented on this pull request. ------------------------------ In analyses/cnv-frequencies/01-cnv-frequencies.py <#52 (comment)> : > + tumor_dfs["relapse_tumors"] = relapse_tumors_df + + # compute variant frequencies for each cancer group per cohort and cancer group in cohorts + # for the overal dataset (all tumor samples) and independent primary/replase tumor samples + def func(x): + d = {} + d["Gene_Symbol"] = ",".join(x["gene_symbol"].unique()) + d["total_alterations"] = x["Kids_First_Participant_ID"].nunique() + return pd.Series(d, index=["Gene_Symbol", "total_alterations"]) + all_tumors_frequecy_dfs = [] + primary_tumors_frequecy_dfs = [] + relapse_tumors_frequecy_dfs = [] + for index, row in cancer_group_cohort_df.iterrows(): + if row["num_samples"] > 5: + for df_name, tumor_df in tumor_dfs.items(): + df = tumor_df[(tumor_df["cancer_group"] == row["cancer_group"]) & (tumor_df["cohort"] == row["cohort"])] cohort == 'all_cohorts' cases are handled in https://github.com/ewafula/OpenPedCan-analysis/blob/994a4178e3fd68fd2a342d1f5ffd5e6bd7030c6f/analyses/cnv-frequencies/01-cnv-frequencies.py#L116-L119 . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26EX53H4EYXCO6PS5E3TYBSBTANCNFSM5AK7DEKQ .

No problem!

You are right. Only the relapse and primary independent samples use the number of samples instead of number of patients. A patient ID could have more than one independent sample IDs if we use each-cohort independent sample list.

ewafula · 2021-07-16T17:52:56Z

Got it! My understanding of independent sample was off. Thanks! On Fri, Jul 16, 2021 at 1:47 PM Yuanchao Zhang ***@***.***> wrote:

…

Sorry, must have misunderstood that. I’ll make the change. So, only the relapse and primary independent samples uses number of samples instead of number of patients? I have not nailed the reasoning for it in my head yet! A Patients might more that one sample? … <#m_-2194758999075739861_> On Fri, Jul 16, 2021 at 1:12 PM Yuanchao Zhang *@*.*> wrote: @.** commented on this pull request. ------------------------------ In analyses/cnv-frequencies/01-cnv-frequencies.py <#52 (comment) <#52 (comment)>> : > + tumor_dfs["relapse_tumors"] = relapse_tumors_df + + # compute variant frequencies for each cancer group per cohort and cancer group in cohorts + # for the overal dataset (all tumor samples) and independent primary/replase tumor samples + def func(x): + d = {} + d["Gene_Symbol"] = ",".join(x["gene_symbol"].unique()) + d["total_alterations"] = x["Kids_First_Participant_ID"].nunique() + return pd.Series(d, index=["Gene_Symbol", "total_alterations"]) + all_tumors_frequecy_dfs = [] + primary_tumors_frequecy_dfs = [] + relapse_tumors_frequecy_dfs = [] + for index, row in cancer_group_cohort_df.iterrows(): + if row["num_samples"] > 5: + for df_name, tumor_df in tumor_dfs.items(): + df = tumor_df[(tumor_df["cancer_group"] == row["cancer_group"]) & (tumor_df["cohort"] == row["cohort"])] cohort == 'all_cohorts' cases are handled in https://github.com/ewafula/OpenPedCan-analysis/blob/994a4178e3fd68fd2a342d1f5ffd5e6bd7030c6f/analyses/cnv-frequencies/01-cnv-frequencies.py#L116-L119 . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment) <#52 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26EX53H4EYXCO6PS5E3TYBSBTANCNFSM5AK7DEKQ . No problem! You are right. Only the relapse and primary independent samples use the number of samples instead of number of patients. A patient ID could have more than one independent sample IDs if we use each-cohort independent sample list. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZN26DKDVBXES6O2ZLFIFTTYBWCZANCNFSM5AK7DEKQ> .

logstar · 2021-07-16T17:54:26Z

No problem at all. Let me know if you have any questions.

…

On Jul 16, 2021, at 1:53 PM, ewafula ***@***.***> wrote: Got it! My understanding of independent sample was off. Thanks! On Fri, Jul 16, 2021 at 1:47 PM Yuanchao Zhang ***@***.***> wrote: > Sorry, must have misunderstood that. I’ll make the change. So, only the > relapse and primary independent samples uses number of samples instead of > number of patients? I have not nailed the reasoning for it in my head yet! > A Patients might more that one sample? > … <#m_-2194758999075739861_> > On Fri, Jul 16, 2021 at 1:12 PM Yuanchao Zhang *@*.*> wrote: @.** > commented on this pull request. ------------------------------ In > analyses/cnv-frequencies/01-cnv-frequencies.py <#52 (comment) > <#52 (comment)>> > : > + tumor_dfs["relapse_tumors"] = relapse_tumors_df + + # compute variant > frequencies for each cancer group per cohort and cancer group in cohorts + > # for the overal dataset (all tumor samples) and independent > primary/replase tumor samples + def func(x): + d = {} + d["Gene_Symbol"] = > ",".join(x["gene_symbol"].unique()) + d["total_alterations"] = > x["Kids_First_Participant_ID"].nunique() + return pd.Series(d, > index=["Gene_Symbol", "total_alterations"]) + all_tumors_frequecy_dfs = [] > + primary_tumors_frequecy_dfs = [] + relapse_tumors_frequecy_dfs = [] + for > index, row in cancer_group_cohort_df.iterrows(): + if row["num_samples"] > > 5: + for df_name, tumor_df in tumor_dfs.items(): + df = > tumor_df[(tumor_df["cancer_group"] == row["cancer_group"]) & > (tumor_df["cohort"] == row["cohort"])] cohort == 'all_cohorts' cases are > handled in > https://github.com/ewafula/OpenPedCan-analysis/blob/994a4178e3fd68fd2a342d1f5ffd5e6bd7030c6f/analyses/cnv-frequencies/01-cnv-frequencies.py#L116-L119 > . — You are receiving this because you were mentioned. Reply to this email > directly, view it on GitHub <#52 (comment) > <#52 (comment)>>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/AAZN26EX53H4EYXCO6PS5E3TYBSBTANCNFSM5AK7DEKQ > . > > No problem! > > You are right. Only the relapse and primary independent samples use the > number of samples instead of number of patients. A patient ID could have > more than one independent sample IDs if we use each-cohort independent > sample list. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#52 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAZN26DKDVBXES6O2ZLFIFTTYBWCZANCNFSM5AK7DEKQ> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJGV54FW53D5NEPS23ZUCDTYBWYFANCNFSM5AK7DEKQ>.

…, and JSONL conversion

ewafula · 2021-07-17T21:40:23Z

@logstar, @jharenza, all changes done:

amended primary/relapse tumor frequencies calculation - using samples counts instead patient counts
amended the annotation func - no longer retrieves from full gene names using MyGene API, now using ENSG -> gene full name mapping table from the long-format-table-utils module (https://github.com/logstar/OpenPedCan-analysis/blob/lft-utils-ann-data-download/analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv)
reimplemented the tsv/cvs to jsonl conversion to using the DictWriter and OrderedDict methods from the python (< v3.6) cvs and collections modules respectively
now using independent-specimens.wgs.primary.eachcohort.tsv and independent-specimens.wgs.relapse.eachcohort.tsv files from the independent module for primary/relapse tumor frequnecies calculations
regenerated results in the results directory of the module

logstar

Thank you for the updates @ewafula !

The revised module looks good to me. All issues are resolved. The x_and_y command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes command, because it was not completed after > 30 minutes.

I wonder how long does the autosomes command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.

Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.

cc @jharenza

ewafula · 2021-07-19T22:15:33Z

Thank you for the updates @ewafula !

The revised module looks good to me. All issues are resolved. The x_and_y command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes command, because it was not completed after > 30 minutes.

I wonder how long does the autosomes command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.

Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.

cc @jharenza

@logstar
Just did timing. Autosomes takes 2 hr 9 min. I'll work on optimizing to reduce the run time.

logstar · 2021-07-19T22:52:04Z

Thank you for the updates @ewafula !
The revised module looks good to me. All issues are resolved. The x_and_y command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes command, because it was not completed after > 30 minutes.
I wonder how long does the autosomes command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.
Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.
cc @jharenza

@logstar
Just did timing. Autosomes takes 2 hr 9 min. I'll work on optimizing to reduce the run time.

@ewafula Thank you for looking into the run time issue.

As the code looks good and the results look correct and reproducible, I think we could merge this PR soon, so this PR will not become too long to review.

@jharenza I have not evaluated the results with CNV specific knowledge, so I will leave this PR open for now.

@ewafula I wonder if you could create a short ticket/issue at https://github.com/PediatricOpenTargets/ticket-tracker for optimizing the cnv-frequencies module and cc @jharenza and me, if you think it would be necessary to reduce the run time, and add a comment here to link that ticket/issue. This way, we could continue our discussion on the optimization in another ticket/issue. Maybe > 2hr run time is still acceptable, or the optimization task has low priority.

I will create a ticket/issue for adapting the annotation module CLI, when it is available.

ewafula · 2021-07-20T02:40:12Z

Thank you for the updates @ewafula !
The revised module looks good to me. All issues are resolved. The x_and_y command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes command, because it was not completed after > 30 minutes.
I wonder how long does the autosomes command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.
Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.
cc @jharenza

@logstar
Just did timing. Autosomes takes 2 hr 9 min. I'll work on optimizing to reduce the run time.

@ewafula Thank you for looking into the run time issue.

As the code looks good and the results look correct and reproducible, I think we could merge this PR soon, so this PR will not become too long to review.

@jharenza I have not evaluated the results with CNV specific knowledge, so I will leave this PR open for now.

@ewafula I wonder if you could create a short ticket/issue at https://github.com/PediatricOpenTargets/ticket-tracker for optimizing the cnv-frequencies module and cc @jharenza and me, if you think it would be necessary to reduce the run time, and add a comment here to link that ticket/issue. This way, we could continue our discussion on the optimization in another ticket/issue. Maybe > 2hr run time is still acceptable, or the optimization task has low priority.

I will create a ticket/issue for adapting the annotation module CLI, when it is available.

@logstar, @jharenza, I have open a ticket/issue to work on optimizing run times for cnv-module
d3b-center/ticket-tracker-OPC#120

logstar · 2021-07-20T02:45:30Z

Thank you @ewafula !

update from upstream

logstar · 2021-07-21T20:25:14Z

@logstar, the annotation function is going to be replaced anyway. So there no need for optimize it further. All the changes were in the function.
…
On Wed, Jul 21, 2021 at 4:12 PM Yuanchao Zhang @.***> wrote: Thank you for fixing the errors @ewafula https://github.com/ewafula ! The results are identical to the previously uploaded ones now. However, the run time now is about 108 minutes now. $ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh real 108m58.665s user 108m55.404s sys 0m4.508s @jharenza https://github.com/jharenza I wonder if the frequencies and other parts of the results look good. I think this PR is ready for merging. Regarding the run time, we could discuss further at PediatricOpenTargets/ticket-tracker#120 <PediatricOpenTargets/ticket-tracker#120>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26DTNQPEYBE7CP4ZU33TY4SZBANCNFSM5AK7DEKQ .

I agree. The annotation function needs no further optimization, as it will be replaced by the upcoming annotation module. All other parts should run within 40 minutes, so they are also good. I will close the optimization ticket d3b-center/ticket-tracker-OPC#120.

jharenza

Hi @ewafula! thanks for working on this!

I have a few comments and requested changes:

I spot checked GMKF cohort for MYCN:

> cnv %>%
+   filter(Dataset == "GMKF" & Disease == "Neuroblastoma" & Gene_Symbol == "MYCN") %>%
+   select(Gene_Symbol, Variant_Type, `Total_Primary_Tumors_Altered/Primary_Tumors_in_Dataset`, Frequency_in_Primary_Tumors)
# A tibble: 3 x 4
  Gene_Symbol Variant_Type  `Total_Primary_Tumors_Altered/Primary… Frequency_in_Primary…
  <chr>       <chr>         <chr>                                  <chr>                
1 MYCN        amplification 25/200                                 12.50%               
2 MYCN        gain          39/200                                 19.50%               
3 MYCN        loss          1/200                                  0.50%                
> v6 %>%
+   filter(cohort == "GMKF" & experimental_strategy == "WGS" & tumor_descriptor == "Primary Tumor") %>%
+   select(molecular_subtype) %>%
+   table()
.
    MYCN amp MYCN non-amp      Unknown 
          47          288            1 
> 47/(47+288)
[1] 0.1402985

I think this is close-ish right now - we are making updates to the CNV module within OpenPBTA and we also have an open ticket to assess CNV thresholding, so this will definitely change.

Can you combine the autosomes and xy chromosome files to one file?
Will you also name these files more distinctly - they are no longer the seg files, but the cnv frequency files.
I noticed you have only the cohort level analysis here, i.e. cohort+cancer_group, but do not have the all_cohorts analysis. Each of these uses distinct independent specimen files as well. Can you add this?

Thanks!

ewafula · 2021-07-21T22:03:51Z

@* jharenza, the the code accounts all_cohorts. We just don’t have any cancer_group that overlaps the two cohorts (PBTA and GMFK). Therefore, all_cohort results are not present with the current input @logstar, is there some logic that am missing? * *I will combine the input consensus files and rename the out files.*

…

On Wed, Jul 21, 2021 at 5:52 PM Jo Lynne Rokita ***@***.***> wrote: ***@***.**** requested changes on this pull request. Hi @ewafula <https://github.com/ewafula>! thanks for working on this! I have a few comments and requested changes: 1. I spot checked GMKF cohort for MYCN: > cnv %>% + filter(Dataset == "GMKF" & Disease == "Neuroblastoma" & Gene_Symbol == "MYCN") %>% + select(Gene_Symbol, Variant_Type, `Total_Primary_Tumors_Altered/Primary_Tumors_in_Dataset`, Frequency_in_Primary_Tumors) # A tibble: 3 x 4 Gene_Symbol Variant_Type `Total_Primary_Tumors_Altered/Primary… Frequency_in_Primary… <chr> <chr> <chr> <chr> 1 MYCN amplification 25/200 12.50% 2 MYCN gain 39/200 19.50% 3 MYCN loss 1/200 0.50% > v6 %>% + filter(cohort == "GMKF" & experimental_strategy == "WGS" & tumor_descriptor == "Primary Tumor") %>% + select(molecular_subtype) %>% + table() . MYCN amp MYCN non-amp Unknown 47 288 1 > 47/(47+288) [1] 0.1402985 I think this is close-ish right now - we are making updates to the CNV module within OpenPBTA <AlexsLemonade#1113> and we also have an open ticket to assess CNV thresholding <d3b-center/ticket-tracker-OPC#113>, so this will definitely change. 1. Can you combine the autosomes and xy chromosome files to one file? 2. Will you also name these files more distinctly - they are no longer the seg files, but the cnv frequency files. 3. I noticed you have only the cohort level analysis here, i.e. cohort+cancer_group, but do not have the all_cohorts analysis. Each of these uses distinct independent specimen files as well. Can you add this? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZN26DPXACJGHUJ7MR77Q3TY46TDANCNFSM5AK7DEKQ> .

jharenza · 2021-07-21T22:07:34Z

Ah perhaps the overlap is in RNA only right now - GMKF and PBTA have Neuroblastoma in common, but only few in GMKF. I hadn't checked experimental strategy, but good you have it in there - thanks!

logstar · 2021-07-21T22:36:03Z

@* jharenza, the the code accounts all_cohorts. We just don’t have any cancer_group that overlaps the two cohorts (PBTA and GMFK). Therefore, all_cohort results are not present with the current input @logstar, is there some logic that am missing? * I will combine the input consensus files and rename the out files.
…
On Wed, Jul 21, 2021 at 5:52 PM Jo Lynne Rokita @.> wrote: @.* requested changes on this pull request. Hi @ewafula https://github.com/ewafula! thanks for working on this! I have a few comments and requested changes: 1. I spot checked GMKF cohort for MYCN: > cnv %>% + filter(Dataset == "GMKF" & Disease == "Neuroblastoma" & Gene_Symbol == "MYCN") %>% + select(Gene_Symbol, Variant_Type, Total_Primary_Tumors_Altered/Primary_Tumors_in_Dataset, Frequency_in_Primary_Tumors) # A tibble: 3 x 4 Gene_Symbol Variant_Type `Total_Primary_Tumors_Altered/Primary… Frequency_in_Primary… 1 MYCN amplification 25/200 12.50% 2 MYCN gain 39/200 19.50% 3 MYCN loss 1/200 0.50% > v6 %>% + filter(cohort == "GMKF" & experimental_strategy == "WGS" & tumor_descriptor == "Primary Tumor") %>% + select(molecular_subtype) %>% + table() . MYCN amp MYCN non-amp Unknown 47 288 1 > 47/(47+288) [1] 0.1402985 I think this is close-ish right now - we are making updates to the CNV module within OpenPBTA <AlexsLemonade#1113> and we also have an open ticket to assess CNV thresholding <PediatricOpenTargets/ticket-tracker#113>, so this will definitely change. 1. Can you combine the autosomes and xy chromosome files to one file? 2. Will you also name these files more distinctly - they are no longer the seg files, but the cnv frequency files. 3. I noticed you have only the cohort level analysis here, i.e. cohort+cancer_group, but do not have the all_cohorts analysis. Each of these uses distinct independent specimen files as well. Can you add this? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (review)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26DPXACJGHUJ7MR77Q3TY46TDANCNFSM5AK7DEKQ .

Thank you for checking. The all_cohorts is correctly handled at

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/abb449d57ebef086e6f827854b05a17f27e6957f/analyses/cnv-frequencies/01-cnv-frequencies.py#L72-L81

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/abb449d57ebef086e6f827854b05a17f27e6957f/analyses/cnv-frequencies/01-cnv-frequencies.py#L118-L121

ewafula · 2021-07-27T16:51:06Z

@logstar, @jharenza, update for CNV frequencies module PR

the annotator from the long-format-table-utils module integrated in the cnv-frequencies module
results regenerated using v7 data release
autosome and x_and_y results merged
JSONL and TSV result files named appropriately (similar to snv-frequencies module)

jharenza

Hi @ewafula - thanks for the updates.

In your table, I noticed that frequencies are missing for the table:

> cnv %>%
+   filter(Gene_symbol == "MYCN" & Disease == "Neuroblastoma") %>%
+   select(Gene_symbol, Dataset, Disease, Frequency_in_overall_dataset, `Total_primary_tumors_altered/Primary_tumors_in_dataset`, Frequency_in_primary_tumors)
# A tibble: 11 x 6
   Gene_symbol Dataset    Disease     Frequency_in_overall_da… `Total_primary_tumors_altered/Primary_tum… Frequency_in_primary_t…
   <chr>       <chr>      <chr>       <chr>                    <chr>                                      <lgl>                  
 1 MYCN        all_cohor… Neuroblast… 20.77%                   25/200                                     NA                     
 2 MYCN        all_cohor… Neuroblast… 16.43%                   39/200                                     NA                     
 3 MYCN        all_cohor… Neuroblast… 14.49%                   1/200                                      NA                     
 4 MYCN        all_cohor… Neuroblast… 17.87%                   0/0                                        NA                     
 5 MYCN        GMKF       Neuroblast… 12.50%                   25/200                                     NA                     
 6 MYCN        GMKF       Neuroblast… 19.50%                   39/200                                     NA                     
 7 MYCN        GMKF       Neuroblast… 0.50%                    1/200                                      NA                     
 8 MYCN        TARGET     Neuroblast… 27.93%                   0/0                                        NA                     
 9 MYCN        TARGET     Neuroblast… 13.51%                   0/0                                        NA                     
10 MYCN        TARGET     Neuroblast… 26.58%                   0/0                                        NA                     
11 MYCN        TARGET     Neuroblast… 33.33%                   0/0                                        NA

Also, for the anove samples, most in TARGET are primary tumors, so it seems the total calculations are being missed here:

v7 %>%
  filter(cohort == "TARGET" & experimental_strategy != "RNA-Seq" & cancer_group == "Neuroblastoma") %>%
  select(tumor_descriptor) %>%
  table()

I think this has to do with the independent sample files you are using - see my comment.

analyses/cnv-frequencies/01-cnv-frequencies.py

logstar · 2021-07-27T21:43:04Z

Hi @ewafula - thanks for the updates.

In your table, I noticed that frequencies are missing for the table:

> cnv %>%
+   filter(Gene_symbol == "MYCN" & Disease == "Neuroblastoma") %>%
+   select(Gene_symbol, Dataset, Disease, Frequency_in_overall_dataset, `Total_primary_tumors_altered/Primary_tumors_in_dataset`, Frequency_in_primary_tumors)
# A tibble: 11 x 6
   Gene_symbol Dataset    Disease     Frequency_in_overall_da… `Total_primary_tumors_altered/Primary_tum… Frequency_in_primary_t…
   <chr>       <chr>      <chr>       <chr>                    <chr>                                      <lgl>                  
 1 MYCN        all_cohor… Neuroblast… 20.77%                   25/200                                     NA                     
 2 MYCN        all_cohor… Neuroblast… 16.43%                   39/200                                     NA                     
 3 MYCN        all_cohor… Neuroblast… 14.49%                   1/200                                      NA                     
 4 MYCN        all_cohor… Neuroblast… 17.87%                   0/0                                        NA                     
 5 MYCN        GMKF       Neuroblast… 12.50%                   25/200                                     NA                     
 6 MYCN        GMKF       Neuroblast… 19.50%                   39/200                                     NA                     
 7 MYCN        GMKF       Neuroblast… 0.50%                    1/200                                      NA                     
 8 MYCN        TARGET     Neuroblast… 27.93%                   0/0                                        NA                     
 9 MYCN        TARGET     Neuroblast… 13.51%                   0/0                                        NA                     
10 MYCN        TARGET     Neuroblast… 26.58%                   0/0                                        NA                     
11 MYCN        TARGET     Neuroblast… 33.33%                   0/0                                        NA

Also, for the anove samples, most in TARGET are primary tumors, so it seems the total calculations are being missed here:

v7 %>%
  filter(cohort == "TARGET" & experimental_strategy != "RNA-Seq" & cancer_group == "Neuroblastoma") %>%
  select(tumor_descriptor) %>%
  table()

I think this has to do with the independent sample files you are using - see my comment.

@jharenza Thank you for checking the results! I think the NAs might be caused by read_tsv default na = c("", "NA"). read_tsv with na = c("NA") retain the empty strings.

I did a quick check on the TSV file and found no NA.

$ grep -P '\tNA\t' results/gene-level-cnv-consensus-annotated-mut-freq.tsv | wc -l
0

jharenza · 2021-07-27T22:32:56Z

Oh yes I need to read in as blanks but they should not be blank is what I was saying. There are values in the numerator and denominator but then no values in frequency. Can you check?

logstar · 2021-07-27T22:38:06Z

Oh yes I need to read in as blanks but they should not be blank is what I was saying. There are values in the numerator and denominator but then no values in frequency. Can you check?

Sorry that I missed those. Thank you for pointing those out @jharenza !

They are probably caused by read_tsv guess_max number being too small, and the frequency column is read as logical. I found no NA in the result TSV by grep.

$ grep -P 'MYCN\t.*Neuroblastoma' results/gene-level-cnv-consensus-annotated-mut-freq.tsv 
MYCN    ENSG00000134323 amplification           all_cohorts     Neuroblastoma   86/414  20.77%  25/200  12.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            all_cohorts     Neuroblastoma   68/414  16.43%  39/200  19.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            all_cohorts     Neuroblastoma   60/414  14.49%  1/200   0.50%   0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         all_cohorts     Neuroblastoma   74/414  17.87%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           GMKF    Neuroblastoma   25/200  12.50%  25/200  12.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            GMKF    Neuroblastoma   39/200  19.50%  39/200  19.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            GMKF    Neuroblastoma   1/200   0.50%   1/200   0.50%   0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           TARGET  Neuroblastoma   62/222  27.93%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            TARGET  Neuroblastoma   30/222  13.51%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            TARGET  Neuroblastoma   59/222  26.58%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         TARGET  Neuroblastoma   74/222  33.33%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072

logstar · 2021-07-27T22:41:49Z

@jharenza I am not sure why I am seeing different numbers from yours. The all_cohorts have ###/414 on my end. I will double-check.

logstar

Thank you for the updates @ewafula .

The module runs well on the Docker image/container and completes in 18 minutes! All results are also identically reproduced.

$ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.shRead analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_annot_freq.tsv...
Done.
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_annot_freq.tsv...
Done.

real    17m37.967s
user    17m28.653s
sys     0m11.996s

The code updates in commit ce0aa12 also look good to me.

Following are some minor suggestions.

analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh

jharenza · 2021-07-27T23:07:55Z

Ok great- thanks for checking!

ewafula · 2021-07-28T00:53:00Z

@jharenza, @logstar, PR update. Made all the changes discussed above, including using independent specimen files which contain WXS and panel data - analyses/independent-samples/results/independent-specimens.wgswxspanel.primary.eachcohort.tsv and analyses/independent-samples/results/independent-specimens.wgswxspanel.relapse.eachcohort.tsv

logstar

Thank you for the updates @ewafula !

The module runs well in the Docker image/container, which is completed in 23 minutes. The results are reproduced identically.

$ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_annot_freq.tsv...
Done.
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_annot_freq.tsv...
Done.

real    22m43.293s
user    22m31.474s
sys     0m14.797s

The code updates in commit f18ca24 look good to me.

The TSV file has no NA in it.

$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P '\tNA\t|\tNaN\t|\tNULL\t' | wc -l
0

The primary #/# and frequencies are also not 0/0 and blank anymore. The relapse #/# and frequencies are still 0/0 and blank for not only the TARGET cohort, so I wonder if this is expected @jharenza .

$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | head -1
Gene_symbol     Gene_Ensembl_ID Variant_type    Variant_category        Dataset Disease Total_alterations/Patients_in_dataset   Frequency_in_overall_dataset    Total_primary_tumors_altered/Primary_tumors_in_dataset      Frequency_in_primary_tumors     Total_relapse_tumors_altered/Relapse_tumors_in_dataset  Frequency_in_relapse_tumors     Gene_full_name  RMTL    OncoKB_cancer_gene  OncoKB_oncogene_TSG     EFO     MONDO
$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P 'MYCN\t.*\tNeuroblastoma\t'
MYCN    ENSG00000134323 amplification           all_cohorts     Neuroblastoma   86/414  20.77%  81/410  19.76%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            all_cohorts     Neuroblastoma   68/414  16.43%  65/410  15.85%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            all_cohorts     Neuroblastoma   60/414  14.49%  59/410  14.39%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         all_cohorts     Neuroblastoma   74/414  17.87%  72/410  17.56%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           GMKF    Neuroblastoma   25/200  12.50%  25/200  12.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            GMKF    Neuroblastoma   39/200  19.50%  39/200  19.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            GMKF    Neuroblastoma   1/200   0.50%   1/200   0.50%   0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           TARGET  Neuroblastoma   62/222  27.93%  56/210  26.67%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            TARGET  Neuroblastoma   30/222  13.51%  26/210  12.38%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            TARGET  Neuroblastoma   59/222  26.58%  58/210  27.62%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         TARGET  Neuroblastoma   74/222  33.33%  72/210  34.29%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072

jharenza · 2021-07-28T02:27:53Z

The relapse #/# and frequencies are still 0/0 and blank for not only the TARGET cohort, so I wonder if this is expected @jharenza .

Yes, this is expected if you look at the broad_tumor_descriptor field we added in v7. Thanks!

ewafula · 2021-07-28T02:50:41Z

Thank you for the updates @ewafula !

The module runs well in the Docker image/container, which is completed in 23 minutes. The results are reproduced identically.

$ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_annot_freq.tsv...
Done.
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_annot_freq.tsv...
Done.

real    22m43.293s
user    22m31.474s
sys     0m14.797s

The code updates in commit f18ca24 look good to me.

The TSV file has no NA in it.

$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P '\tNA\t|\tNaN\t|\tNULL\t' | wc -l
0

The primary #/# and frequencies are also not 0/0 and blank anymore. The relapse #/# and frequencies are still 0/0 and blank for not only the TARGET cohort, so I wonder if this is expected @jharenza .

$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | head -1
Gene_symbol     Gene_Ensembl_ID Variant_type    Variant_category        Dataset Disease Total_alterations/Patients_in_dataset   Frequency_in_overall_dataset    Total_primary_tumors_altered/Primary_tumors_in_dataset      Frequency_in_primary_tumors     Total_relapse_tumors_altered/Relapse_tumors_in_dataset  Frequency_in_relapse_tumors     Gene_full_name  RMTL    OncoKB_cancer_gene  OncoKB_oncogene_TSG     EFO     MONDO
$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P 'MYCN\t.*\tNeuroblastoma\t'
MYCN    ENSG00000134323 amplification           all_cohorts     Neuroblastoma   86/414  20.77%  81/410  19.76%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            all_cohorts     Neuroblastoma   68/414  16.43%  65/410  15.85%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            all_cohorts     Neuroblastoma   60/414  14.49%  59/410  14.39%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         all_cohorts     Neuroblastoma   74/414  17.87%  72/410  17.56%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           GMKF    Neuroblastoma   25/200  12.50%  25/200  12.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            GMKF    Neuroblastoma   39/200  19.50%  39/200  19.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            GMKF    Neuroblastoma   1/200   0.50%   1/200   0.50%   0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           TARGET  Neuroblastoma   62/222  27.93%  56/210  26.67%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            TARGET  Neuroblastoma   30/222  13.51%  26/210  12.38%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            TARGET  Neuroblastoma   59/222  26.58%  58/210  27.62%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         TARGET  Neuroblastoma   74/222  33.33%  72/210  34.29%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072

@logstar, looking the independent-samples/results/independent-specimens.wgswxspanel.relapse.eachcohort.tsv, there no Neuroblastoma TARGET samples (TARGET-30-*) listed. There are also no relapse samples for GMKF in the broad_tumor_discriptor column of the histologies.tsv. I Iooked in the histologies files because can't different the Kids_First_Biospecimen_IDs between GMKF and PBTA in the relapse independent specimens file.

logstar

@jharenza @ewafula Thank you for confirming that the aforementioned relapse #/# and frequencies are expected.

This PR looks good to me now.

jharenza

This looks good to me now as well. Thank you!

intial cnv_frequecies module

7c299e1

ewafula requested review from jharenza, logstar, afarrel and kgaonkar6 July 14, 2021 08:34

jharenza removed request for afarrel and kgaonkar6 July 14, 2021 10:33

logstar suggested changes Jul 14, 2021

View reviewed changes

jharenza requested review from jharenza and removed request for jharenza July 14, 2021 18:02

logstar mentioned this pull request Jul 15, 2021

Proposed Analysis: Create an API for annotating long-format tables generated by analysis modules d3b-center/ticket-tracker-OPC#112

Closed

ewafula added 2 commits July 15, 2021 23:57

Merge remote-tracking branch 'upstream/dev' into cnv-frequencies

8b827bc

update local cnv-frequencies with remote OpenPedCan

update for docker compatibility initial cnv-frequecies module

994a417

amended primary/relapse tumor frequecies calculation, annotation func…

118e8da

…, and JSONL conversion

logstar approved these changes Jul 19, 2021

View reviewed changes

logstar mentioned this pull request Jul 20, 2021

Updated analysis: use annotator CLI in the cnv-frequencies module d3b-center/ticket-tracker-OPC#124

Closed

1 task

ewafula added 2 commits July 20, 2021 14:09

Merge remote-tracking branch 'upstream/dev' into cnv-frequencies

2b7fefa

update from upstream

run time optimization

8ba3d79

logstar mentioned this pull request Jul 21, 2021

Updated analysis: optimize run times of the cnv-frequencies module d3b-center/ticket-tracker-OPC#120

Closed

jharenza requested changes Jul 21, 2021

View reviewed changes

ewafula added 3 commits July 26, 2021 10:01

Merge remote-tracking branch 'upstream/dev' into cnv-frequencies

d0df165

Merge remote-tracking branch 'upstream/dev' into cnv-frequencies

f9637d1

integrated annotator for v7 from long-format-table-utils module

ce0aa12

jharenza requested changes Jul 27, 2021

View reviewed changes

analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved

analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved

jharenza added the work in progress label Jul 27, 2021

logstar suggested changes Jul 27, 2021

View reviewed changes

analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh Outdated Show resolved Hide resolved

analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh Show resolved Hide resolved

ewafula added 2 commits July 27, 2021 19:18

Merge remote-tracking branch 'upstream/dev' into cnv-frequencies

d6c7c3a

changed to independent specimen files which contain wxs and panel data

f18ca24

logstar reviewed Jul 28, 2021

View reviewed changes

logstar approved these changes Jul 28, 2021

View reviewed changes

jharenza approved these changes Jul 28, 2021

View reviewed changes

jharenza merged commit 16981db into d3b-center:dev Jul 28, 2021

ewafula deleted the cnv-frequencies branch August 16, 2021 18:43

This was referenced Sep 16, 2021

Proposed Analysis: Create CNV frequencies file d3b-center/ticket-tracker-OPC#66

Closed

Proposed Analysis: Create JSON files for CNV tables d3b-center/ticket-tracker-OPC#68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotate CNV table with mutation frequencies #52

Annotate CNV table with mutation frequencies #52

ewafula commented Jul 14, 2021

logstar left a comment

logstar commented Jul 14, 2021

ewafula commented Jul 16, 2021

ewafula commented Jul 16, 2021 via email

ewafula commented Jul 16, 2021 via email

logstar commented Jul 16, 2021

ewafula commented Jul 16, 2021 via email

logstar commented Jul 16, 2021 via email

ewafula commented Jul 17, 2021 •

edited

Loading

logstar left a comment

ewafula commented Jul 19, 2021

logstar commented Jul 19, 2021

ewafula commented Jul 20, 2021

logstar commented Jul 20, 2021

logstar commented Jul 21, 2021

jharenza left a comment

ewafula commented Jul 21, 2021 via email

jharenza commented Jul 21, 2021

logstar commented Jul 21, 2021 •

edited

Loading

ewafula commented Jul 27, 2021

jharenza left a comment

logstar commented Jul 27, 2021

jharenza commented Jul 27, 2021

logstar commented Jul 27, 2021

logstar commented Jul 27, 2021

logstar left a comment

jharenza commented Jul 27, 2021

ewafula commented Jul 28, 2021

logstar left a comment

jharenza commented Jul 28, 2021

ewafula commented Jul 28, 2021 •

edited

Loading

logstar left a comment

jharenza left a comment

Annotate CNV table with mutation frequencies #52

Annotate CNV table with mutation frequencies #52

Conversation

ewafula commented Jul 14, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

logstar left a comment

Choose a reason for hiding this comment

logstar commented Jul 14, 2021

ewafula commented Jul 16, 2021

ewafula commented Jul 16, 2021 via email

ewafula commented Jul 16, 2021 via email

logstar commented Jul 16, 2021

ewafula commented Jul 16, 2021 via email

logstar commented Jul 16, 2021 via email

ewafula commented Jul 17, 2021 • edited Loading

logstar left a comment

Choose a reason for hiding this comment

ewafula commented Jul 19, 2021

logstar commented Jul 19, 2021

ewafula commented Jul 20, 2021

logstar commented Jul 20, 2021

logstar commented Jul 21, 2021

jharenza left a comment

Choose a reason for hiding this comment

ewafula commented Jul 21, 2021 via email

jharenza commented Jul 21, 2021

logstar commented Jul 21, 2021 • edited Loading

ewafula commented Jul 27, 2021

jharenza left a comment

Choose a reason for hiding this comment

logstar commented Jul 27, 2021

jharenza commented Jul 27, 2021

logstar commented Jul 27, 2021

logstar commented Jul 27, 2021

logstar left a comment

Choose a reason for hiding this comment

jharenza commented Jul 27, 2021

ewafula commented Jul 28, 2021

logstar left a comment

Choose a reason for hiding this comment

jharenza commented Jul 28, 2021

ewafula commented Jul 28, 2021 • edited Loading

logstar left a comment

Choose a reason for hiding this comment

jharenza left a comment

Choose a reason for hiding this comment

ewafula commented Jul 17, 2021 •

edited

Loading

logstar commented Jul 21, 2021 •

edited

Loading

ewafula commented Jul 28, 2021 •

edited

Loading