Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate CNV table with mutation frequencies #52

Merged
merged 13 commits into from
Jul 28, 2021

Conversation

ewafula
Copy link

@ewafula ewafula commented Jul 14, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

Uses consensus_seg_annotated_cn_autosomes.tsv and consensus_seg_annotated_cn_x_and_y.tsv consensus CNV calls and variant types (amplification, deep deletion, gain, loss, and neutral) to determine Ensembl gene-level mutation frequencies for each cancer type in an overall cohort dateset and in the independent primary/relapse cohort subsets of the data.

What was your approach?

The code is adapted from https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/f70645b6c7e4eb15ea29e45e9ebf0adeb5798b9b/analyses/snv-frequencies by @logstar and https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/kgaonkar6/fusion_freq/analyses/fusion-frequencies by @kgaonkar6

Given CNV consensus table with Kids_First_Biospecimen_ID and Variant_Type, python script ,01-cnv-frequencies.py computes gene-level mutation frequencies per cancer_group within cohort and add annotations.

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#66
d3b-center/ticket-tracker-OPC#68

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

lf mutation frequencies should be restricted to only Ensembl gene identifiers without consideration of variant types (amplification, deep deletion, gain, loss, and neutral). Currently variant types are included in combination with Ensembl gene identifiers to count mutations.

Is there anything that you want to discuss further?

Still requires additional information to update the table with variant categories such as focal, segmental, chromosomal e.t.c., and Oncogene/TSG categories from OncoKB

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

JSONL and TSV tables

What is your summary of the results?

The CNV consensus frequencies results currently only for PBTA and GMKF cohorts.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Copy link

@logstar logstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating the cnv-frequencies module @ewafula !

I wonder if you could only use tab or space for indentation in 01-cnv-frequencies.py. Mixing tabs and spaces for indentation may cause hard-to-detect errors in future updates, if the script is edited by certain text editors.

To make sure that other people could reproduce your results identically, could you rerun your analysis module in the Docker image? You can add RUN pip3 install mygene in your local Dockerfile to install the mygene package. I assume you did not use the Docker image, because you used python > 3.5 syntax, and the Docker image only has python == 3.5.

Following are specific suggestions and comments.

analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh Outdated Show resolved Hide resolved
analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh Outdated Show resolved Hide resolved
analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
@logstar
Copy link

logstar commented Jul 14, 2021

lf mutation frequencies should be restricted to only Ensembl gene identifiers without consideration of variant types (amplification, deep deletion, gain, loss, and neutral). Currently variant types are included in combination with Ensembl gene identifiers to count mutations.

@jharenza Are (gene, variant type)-level CNV mutation frequencies expected results?

@ewafula
Copy link
Author

ewafula commented Jul 16, 2021

@logstar, @jharenza, I made all the changes you recommended and regenerated the results using a docker image build from the OpenPedCan Dockerfile. I also include the OnkoKB categories incase results need to be used for mock data. I'll need to amend the function that adds annotations after the annotation module is ready. Currently it is taking ~6-7 hrs to retrieve annotations for all ~25k genes from MyGene on my old Mac mini. It is time consuming if I need to amend code and rerun. jq, the JSONL converter @logstar using doesn't install properly on my machine. As result, I have left the python code for converting TSV to JSONL using cvs.DictWriter. Works ok in python >v3.6, but was experimental in earlier versions, including python v3.5 in the project docker image. The conversion is unstable in python v3.5 and sometime does not maintain the order of the columns in the table when dumped to JSON. I am exploring if I can implement using OrderedDict from the python Collection module. I did not commit the Dockerfile (with mygene module) because we will not be retrieving annotations onwards using MyGene API.

@ewafula
Copy link
Author

ewafula commented Jul 16, 2021 via email

@ewafula
Copy link
Author

ewafula commented Jul 16, 2021 via email

@logstar
Copy link

logstar commented Jul 16, 2021

Sorry, must have misunderstood that. I’ll make the change. So, only the relapse and primary independent samples uses number of samples instead of number of patients? I have not nailed the reasoning for it in my head yet! A Patients might more that one sample?

On Fri, Jul 16, 2021 at 1:12 PM Yuanchao Zhang @.> wrote: @.* commented on this pull request. ------------------------------ In analyses/cnv-frequencies/01-cnv-frequencies.py <#52 (comment)> : > + tumor_dfs["relapse_tumors"] = relapse_tumors_df + + # compute variant frequencies for each cancer group per cohort and cancer group in cohorts + # for the overal dataset (all tumor samples) and independent primary/replase tumor samples + def func(x): + d = {} + d["Gene_Symbol"] = ",".join(x["gene_symbol"].unique()) + d["total_alterations"] = x["Kids_First_Participant_ID"].nunique() + return pd.Series(d, index=["Gene_Symbol", "total_alterations"]) + all_tumors_frequecy_dfs = [] + primary_tumors_frequecy_dfs = [] + relapse_tumors_frequecy_dfs = [] + for index, row in cancer_group_cohort_df.iterrows(): + if row["num_samples"] > 5: + for df_name, tumor_df in tumor_dfs.items(): + df = tumor_df[(tumor_df["cancer_group"] == row["cancer_group"]) & (tumor_df["cohort"] == row["cohort"])] cohort == 'all_cohorts' cases are handled in https://github.com/ewafula/OpenPedCan-analysis/blob/994a4178e3fd68fd2a342d1f5ffd5e6bd7030c6f/analyses/cnv-frequencies/01-cnv-frequencies.py#L116-L119 . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26EX53H4EYXCO6PS5E3TYBSBTANCNFSM5AK7DEKQ .

No problem!

You are right. Only the relapse and primary independent samples use the number of samples instead of number of patients. A patient ID could have more than one independent sample IDs if we use each-cohort independent sample list.

@ewafula
Copy link
Author

ewafula commented Jul 16, 2021 via email

@logstar
Copy link

logstar commented Jul 16, 2021 via email

@ewafula
Copy link
Author

ewafula commented Jul 17, 2021

@logstar, @jharenza, all changes done:

Copy link

@logstar logstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates @ewafula !

The revised module looks good to me. All issues are resolved. The x_and_y command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes command, because it was not completed after > 30 minutes.

I wonder how long does the autosomes command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.

Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.

cc @jharenza

@ewafula
Copy link
Author

ewafula commented Jul 19, 2021

Thank you for the updates @ewafula !

The revised module looks good to me. All issues are resolved. The x_and_y command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes command, because it was not completed after > 30 minutes.

I wonder how long does the autosomes command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.

Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.

cc @jharenza

@logstar
Just did timing. Autosomes takes 2 hr 9 min. I'll work on optimizing to reduce the run time.

@logstar
Copy link

logstar commented Jul 19, 2021

Thank you for the updates @ewafula !
The revised module looks good to me. All issues are resolved. The x_and_y command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes command, because it was not completed after > 30 minutes.
I wonder how long does the autosomes command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.
Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.
cc @jharenza

@logstar
Just did timing. Autosomes takes 2 hr 9 min. I'll work on optimizing to reduce the run time.

@ewafula Thank you for looking into the run time issue.

As the code looks good and the results look correct and reproducible, I think we could merge this PR soon, so this PR will not become too long to review.

@jharenza I have not evaluated the results with CNV specific knowledge, so I will leave this PR open for now.

@ewafula I wonder if you could create a short ticket/issue at https://github.com/PediatricOpenTargets/ticket-tracker for optimizing the cnv-frequencies module and cc @jharenza and me, if you think it would be necessary to reduce the run time, and add a comment here to link that ticket/issue. This way, we could continue our discussion on the optimization in another ticket/issue. Maybe > 2hr run time is still acceptable, or the optimization task has low priority.

I will create a ticket/issue for adapting the annotation module CLI, when it is available.

@ewafula
Copy link
Author

ewafula commented Jul 20, 2021

Thank you for the updates @ewafula !
The revised module looks good to me. All issues are resolved. The x_and_y command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes command, because it was not completed after > 30 minutes.
I wonder how long does the autosomes command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.
Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.
cc @jharenza

@logstar
Just did timing. Autosomes takes 2 hr 9 min. I'll work on optimizing to reduce the run time.

@ewafula Thank you for looking into the run time issue.

As the code looks good and the results look correct and reproducible, I think we could merge this PR soon, so this PR will not become too long to review.

@jharenza I have not evaluated the results with CNV specific knowledge, so I will leave this PR open for now.

@ewafula I wonder if you could create a short ticket/issue at https://github.com/PediatricOpenTargets/ticket-tracker for optimizing the cnv-frequencies module and cc @jharenza and me, if you think it would be necessary to reduce the run time, and add a comment here to link that ticket/issue. This way, we could continue our discussion on the optimization in another ticket/issue. Maybe > 2hr run time is still acceptable, or the optimization task has low priority.

I will create a ticket/issue for adapting the annotation module CLI, when it is available.

@logstar, @jharenza, I have open a ticket/issue to work on optimizing run times for cnv-module
d3b-center/ticket-tracker-OPC#120

@logstar
Copy link

logstar commented Jul 20, 2021

Thank you @ewafula !

@logstar
Copy link

logstar commented Jul 21, 2021

@logstar, the annotation function is going to be replaced anyway. So there no need for optimize it further. All the changes were in the function.

On Wed, Jul 21, 2021 at 4:12 PM Yuanchao Zhang @.***> wrote: Thank you for fixing the errors @ewafula https://github.com/ewafula ! The results are identical to the previously uploaded ones now. However, the run time now is about 108 minutes now. $ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh real 108m58.665s user 108m55.404s sys 0m4.508s @jharenza https://github.com/jharenza I wonder if the frequencies and other parts of the results look good. I think this PR is ready for merging. Regarding the run time, we could discuss further at PediatricOpenTargets/ticket-tracker#120 <PediatricOpenTargets/ticket-tracker#120>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26DTNQPEYBE7CP4ZU33TY4SZBANCNFSM5AK7DEKQ .

I agree. The annotation function needs no further optimization, as it will be replaced by the upcoming annotation module. All other parts should run within 40 minutes, so they are also good. I will close the optimization ticket d3b-center/ticket-tracker-OPC#120.

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ewafula! thanks for working on this!

I have a few comments and requested changes:

  1. I spot checked GMKF cohort for MYCN:
> cnv %>%
+   filter(Dataset == "GMKF" & Disease == "Neuroblastoma" & Gene_Symbol == "MYCN") %>%
+   select(Gene_Symbol, Variant_Type, `Total_Primary_Tumors_Altered/Primary_Tumors_in_Dataset`, Frequency_in_Primary_Tumors)
# A tibble: 3 x 4
  Gene_Symbol Variant_Type  `Total_Primary_Tumors_Altered/Primary… Frequency_in_Primary…
  <chr>       <chr>         <chr>                                  <chr>                
1 MYCN        amplification 25/200                                 12.50%               
2 MYCN        gain          39/200                                 19.50%               
3 MYCN        loss          1/200                                  0.50%                
> v6 %>%
+   filter(cohort == "GMKF" & experimental_strategy == "WGS" & tumor_descriptor == "Primary Tumor") %>%
+   select(molecular_subtype) %>%
+   table()
.
    MYCN amp MYCN non-amp      Unknown 
          47          288            1 
> 47/(47+288)
[1] 0.1402985

I think this is close-ish right now - we are making updates to the CNV module within OpenPBTA and we also have an open ticket to assess CNV thresholding, so this will definitely change.

  1. Can you combine the autosomes and xy chromosome files to one file?
  2. Will you also name these files more distinctly - they are no longer the seg files, but the cnv frequency files.
  3. I noticed you have only the cohort level analysis here, i.e. cohort+cancer_group, but do not have the all_cohorts analysis. Each of these uses distinct independent specimen files as well. Can you add this?

Thanks!

@ewafula
Copy link
Author

ewafula commented Jul 21, 2021 via email

@jharenza
Copy link
Member

Ah perhaps the overlap is in RNA only right now - GMKF and PBTA have Neuroblastoma in common, but only few in GMKF. I hadn't checked experimental strategy, but good you have it in there - thanks!

@logstar
Copy link

logstar commented Jul 21, 2021

@* jharenza, the the code accounts all_cohorts. We just don’t have any cancer_group that overlaps the two cohorts (PBTA and GMFK). Therefore, all_cohort results are not present with the current input @logstar, is there some logic that am missing? * I will combine the input consensus files and rename the out files.

On Wed, Jul 21, 2021 at 5:52 PM Jo Lynne Rokita @.> wrote: @.* requested changes on this pull request. Hi @ewafula https://github.com/ewafula! thanks for working on this! I have a few comments and requested changes: 1. I spot checked GMKF cohort for MYCN: > cnv %>% + filter(Dataset == "GMKF" & Disease == "Neuroblastoma" & Gene_Symbol == "MYCN") %>% + select(Gene_Symbol, Variant_Type, Total_Primary_Tumors_Altered/Primary_Tumors_in_Dataset, Frequency_in_Primary_Tumors) # A tibble: 3 x 4 Gene_Symbol Variant_Type `Total_Primary_Tumors_Altered/Primary… Frequency_in_Primary… 1 MYCN amplification 25/200 12.50% 2 MYCN gain 39/200 19.50% 3 MYCN loss 1/200 0.50% > v6 %>% + filter(cohort == "GMKF" & experimental_strategy == "WGS" & tumor_descriptor == "Primary Tumor") %>% + select(molecular_subtype) %>% + table() . MYCN amp MYCN non-amp Unknown 47 288 1 > 47/(47+288) [1] 0.1402985 I think this is close-ish right now - we are making updates to the CNV module within OpenPBTA <AlexsLemonade#1113> and we also have an open ticket to assess CNV thresholding <PediatricOpenTargets/ticket-tracker#113>, so this will definitely change. 1. Can you combine the autosomes and xy chromosome files to one file? 2. Will you also name these files more distinctly - they are no longer the seg files, but the cnv frequency files. 3. I noticed you have only the cohort level analysis here, i.e. cohort+cancer_group, but do not have the all_cohorts analysis. Each of these uses distinct independent specimen files as well. Can you add this? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#52 (review)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZN26DPXACJGHUJ7MR77Q3TY46TDANCNFSM5AK7DEKQ .

Thank you for checking. The all_cohorts is correctly handled at

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/abb449d57ebef086e6f827854b05a17f27e6957f/analyses/cnv-frequencies/01-cnv-frequencies.py#L72-L81

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/abb449d57ebef086e6f827854b05a17f27e6957f/analyses/cnv-frequencies/01-cnv-frequencies.py#L118-L121

@ewafula
Copy link
Author

ewafula commented Jul 27, 2021

@logstar, @jharenza, update for CNV frequencies module PR

  • the annotator from the long-format-table-utils module integrated in the cnv-frequencies module
  • results regenerated using v7 data release
  • autosome and x_and_y results merged
  • JSONL and TSV result files named appropriately (similar to snv-frequencies module)

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ewafula - thanks for the updates.

In your table, I noticed that frequencies are missing for the table:

> cnv %>%
+   filter(Gene_symbol == "MYCN" & Disease == "Neuroblastoma") %>%
+   select(Gene_symbol, Dataset, Disease, Frequency_in_overall_dataset, `Total_primary_tumors_altered/Primary_tumors_in_dataset`, Frequency_in_primary_tumors)
# A tibble: 11 x 6
   Gene_symbol Dataset    Disease     Frequency_in_overall_da… `Total_primary_tumors_altered/Primary_tum… Frequency_in_primary_t…
   <chr>       <chr>      <chr>       <chr>                    <chr>                                      <lgl>                  
 1 MYCN        all_cohor… Neuroblast… 20.77%                   25/200                                     NA                     
 2 MYCN        all_cohor… Neuroblast… 16.43%                   39/200                                     NA                     
 3 MYCN        all_cohor… Neuroblast… 14.49%                   1/200                                      NA                     
 4 MYCN        all_cohor… Neuroblast… 17.87%                   0/0                                        NA                     
 5 MYCN        GMKF       Neuroblast… 12.50%                   25/200                                     NA                     
 6 MYCN        GMKF       Neuroblast… 19.50%                   39/200                                     NA                     
 7 MYCN        GMKF       Neuroblast… 0.50%                    1/200                                      NA                     
 8 MYCN        TARGET     Neuroblast… 27.93%                   0/0                                        NA                     
 9 MYCN        TARGET     Neuroblast… 13.51%                   0/0                                        NA                     
10 MYCN        TARGET     Neuroblast… 26.58%                   0/0                                        NA                     
11 MYCN        TARGET     Neuroblast… 33.33%                   0/0                                        NA 

Also, for the anove samples, most in TARGET are primary tumors, so it seems the total calculations are being missed here:

v7 %>%
  filter(cohort == "TARGET" & experimental_strategy != "RNA-Seq" & cancer_group == "Neuroblastoma") %>%
  select(tumor_descriptor) %>%
  table()

I think this has to do with the independent sample files you are using - see my comment.

analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
analyses/cnv-frequencies/01-cnv-frequencies.py Outdated Show resolved Hide resolved
@logstar
Copy link

logstar commented Jul 27, 2021

Hi @ewafula - thanks for the updates.

In your table, I noticed that frequencies are missing for the table:

> cnv %>%
+   filter(Gene_symbol == "MYCN" & Disease == "Neuroblastoma") %>%
+   select(Gene_symbol, Dataset, Disease, Frequency_in_overall_dataset, `Total_primary_tumors_altered/Primary_tumors_in_dataset`, Frequency_in_primary_tumors)
# A tibble: 11 x 6
   Gene_symbol Dataset    Disease     Frequency_in_overall_da… `Total_primary_tumors_altered/Primary_tum… Frequency_in_primary_t…
   <chr>       <chr>      <chr>       <chr>                    <chr>                                      <lgl>                  
 1 MYCN        all_cohor… Neuroblast… 20.77%                   25/200                                     NA                     
 2 MYCN        all_cohor… Neuroblast… 16.43%                   39/200                                     NA                     
 3 MYCN        all_cohor… Neuroblast… 14.49%                   1/200                                      NA                     
 4 MYCN        all_cohor… Neuroblast… 17.87%                   0/0                                        NA                     
 5 MYCN        GMKF       Neuroblast… 12.50%                   25/200                                     NA                     
 6 MYCN        GMKF       Neuroblast… 19.50%                   39/200                                     NA                     
 7 MYCN        GMKF       Neuroblast… 0.50%                    1/200                                      NA                     
 8 MYCN        TARGET     Neuroblast… 27.93%                   0/0                                        NA                     
 9 MYCN        TARGET     Neuroblast… 13.51%                   0/0                                        NA                     
10 MYCN        TARGET     Neuroblast… 26.58%                   0/0                                        NA                     
11 MYCN        TARGET     Neuroblast… 33.33%                   0/0                                        NA 

Also, for the anove samples, most in TARGET are primary tumors, so it seems the total calculations are being missed here:

v7 %>%
  filter(cohort == "TARGET" & experimental_strategy != "RNA-Seq" & cancer_group == "Neuroblastoma") %>%
  select(tumor_descriptor) %>%
  table()

I think this has to do with the independent sample files you are using - see my comment.

@jharenza Thank you for checking the results! I think the NAs might be caused by read_tsv default na = c("", "NA"). read_tsv with na = c("NA") retain the empty strings.

I did a quick check on the TSV file and found no NA.

$ grep -P '\tNA\t' results/gene-level-cnv-consensus-annotated-mut-freq.tsv | wc -l
0

@jharenza
Copy link
Member

Oh yes I need to read in as blanks but they should not be blank is what I was saying. There are values in the numerator and denominator but then no values in frequency. Can you check?

@logstar
Copy link

logstar commented Jul 27, 2021

Oh yes I need to read in as blanks but they should not be blank is what I was saying. There are values in the numerator and denominator but then no values in frequency. Can you check?

Sorry that I missed those. Thank you for pointing those out @jharenza !

They are probably caused by read_tsv guess_max number being too small, and the frequency column is read as logical. I found no NA in the result TSV by grep.

$ grep -P 'MYCN\t.*Neuroblastoma' results/gene-level-cnv-consensus-annotated-mut-freq.tsv 
MYCN    ENSG00000134323 amplification           all_cohorts     Neuroblastoma   86/414  20.77%  25/200  12.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            all_cohorts     Neuroblastoma   68/414  16.43%  39/200  19.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            all_cohorts     Neuroblastoma   60/414  14.49%  1/200   0.50%   0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         all_cohorts     Neuroblastoma   74/414  17.87%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           GMKF    Neuroblastoma   25/200  12.50%  25/200  12.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            GMKF    Neuroblastoma   39/200  19.50%  39/200  19.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            GMKF    Neuroblastoma   1/200   0.50%   1/200   0.50%   0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           TARGET  Neuroblastoma   62/222  27.93%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            TARGET  Neuroblastoma   30/222  13.51%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            TARGET  Neuroblastoma   59/222  26.58%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         TARGET  Neuroblastoma   74/222  33.33%  0/0             0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072

@logstar
Copy link

logstar commented Jul 27, 2021

@jharenza I am not sure why I am seeing different numbers from yours. The all_cohorts have ###/414 on my end. I will double-check.

Copy link

@logstar logstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates @ewafula .

The module runs well on the Docker image/container and completes in 18 minutes! All results are also identically reproduced.

$ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.shRead analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_annot_freq.tsv...
Done.
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_annot_freq.tsv...
Done.

real    17m37.967s
user    17m28.653s
sys     0m11.996s

The code updates in commit ce0aa12 also look good to me.

Following are some minor suggestions.

@jharenza
Copy link
Member

Ok great- thanks for checking!

@ewafula
Copy link
Author

ewafula commented Jul 28, 2021

@jharenza, @logstar, PR update. Made all the changes discussed above, including using independent specimen files which contain WXS and panel data - analyses/independent-samples/results/independent-specimens.wgswxspanel.primary.eachcohort.tsv and analyses/independent-samples/results/independent-specimens.wgswxspanel.relapse.eachcohort.tsv

Copy link

@logstar logstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates @ewafula !

The module runs well in the Docker image/container, which is completed in 23 minutes. The results are reproduced identically.

$ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_annot_freq.tsv...
Done.
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_annot_freq.tsv...
Done.

real    22m43.293s
user    22m31.474s
sys     0m14.797s

The code updates in commit f18ca24 look good to me.

The TSV file has no NA in it.

$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P '\tNA\t|\tNaN\t|\tNULL\t' | wc -l
0

The primary #/# and frequencies are also not 0/0 and blank anymore. The relapse #/# and frequencies are still 0/0 and blank for not only the TARGET cohort, so I wonder if this is expected @jharenza .

$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | head -1
Gene_symbol     Gene_Ensembl_ID Variant_type    Variant_category        Dataset Disease Total_alterations/Patients_in_dataset   Frequency_in_overall_dataset    Total_primary_tumors_altered/Primary_tumors_in_dataset      Frequency_in_primary_tumors     Total_relapse_tumors_altered/Relapse_tumors_in_dataset  Frequency_in_relapse_tumors     Gene_full_name  RMTL    OncoKB_cancer_gene  OncoKB_oncogene_TSG     EFO     MONDO
$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P 'MYCN\t.*\tNeuroblastoma\t'
MYCN    ENSG00000134323 amplification           all_cohorts     Neuroblastoma   86/414  20.77%  81/410  19.76%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            all_cohorts     Neuroblastoma   68/414  16.43%  65/410  15.85%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            all_cohorts     Neuroblastoma   60/414  14.49%  59/410  14.39%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         all_cohorts     Neuroblastoma   74/414  17.87%  72/410  17.56%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           GMKF    Neuroblastoma   25/200  12.50%  25/200  12.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            GMKF    Neuroblastoma   39/200  19.50%  39/200  19.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            GMKF    Neuroblastoma   1/200   0.50%   1/200   0.50%   0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           TARGET  Neuroblastoma   62/222  27.93%  56/210  26.67%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            TARGET  Neuroblastoma   30/222  13.51%  26/210  12.38%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            TARGET  Neuroblastoma   59/222  26.58%  58/210  27.62%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         TARGET  Neuroblastoma   74/222  33.33%  72/210  34.29%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072

@jharenza
Copy link
Member

The relapse #/# and frequencies are still 0/0 and blank for not only the TARGET cohort, so I wonder if this is expected @jharenza .

Yes, this is expected if you look at the broad_tumor_descriptor field we added in v7. Thanks!

@ewafula
Copy link
Author

ewafula commented Jul 28, 2021

Thank you for the updates @ewafula !

The module runs well in the Docker image/container, which is completed in 23 minutes. The results are reproduced identically.

$ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_annot_freq.tsv...
Done.
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_annot_freq.tsv...
Done.

real    22m43.293s
user    22m31.474s
sys     0m14.797s

The code updates in commit f18ca24 look good to me.

The TSV file has no NA in it.

$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P '\tNA\t|\tNaN\t|\tNULL\t' | wc -l
0

The primary #/# and frequencies are also not 0/0 and blank anymore. The relapse #/# and frequencies are still 0/0 and blank for not only the TARGET cohort, so I wonder if this is expected @jharenza .

$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | head -1
Gene_symbol     Gene_Ensembl_ID Variant_type    Variant_category        Dataset Disease Total_alterations/Patients_in_dataset   Frequency_in_overall_dataset    Total_primary_tumors_altered/Primary_tumors_in_dataset      Frequency_in_primary_tumors     Total_relapse_tumors_altered/Relapse_tumors_in_dataset  Frequency_in_relapse_tumors     Gene_full_name  RMTL    OncoKB_cancer_gene  OncoKB_oncogene_TSG     EFO     MONDO
$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P 'MYCN\t.*\tNeuroblastoma\t'
MYCN    ENSG00000134323 amplification           all_cohorts     Neuroblastoma   86/414  20.77%  81/410  19.76%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            all_cohorts     Neuroblastoma   68/414  16.43%  65/410  15.85%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            all_cohorts     Neuroblastoma   60/414  14.49%  59/410  14.39%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         all_cohorts     Neuroblastoma   74/414  17.87%  72/410  17.56%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           GMKF    Neuroblastoma   25/200  12.50%  25/200  12.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            GMKF    Neuroblastoma   39/200  19.50%  39/200  19.50%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            GMKF    Neuroblastoma   1/200   0.50%   1/200   0.50%   0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 amplification           TARGET  Neuroblastoma   62/222  27.93%  56/210  26.67%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 gain            TARGET  Neuroblastoma   30/222  13.51%  26/210  12.38%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 loss            TARGET  Neuroblastoma   59/222  26.58%  58/210  27.62%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072
MYCN    ENSG00000134323 neutral         TARGET  Neuroblastoma   74/222  33.33%  72/210  34.29%  0/0             MYCN proto-oncogene, bHLH transcription factor  Relevant Molecular Target (RMTL version 1.0)        Y       Oncogene        EFO_0000621     MONDO_0005072

@logstar, looking the independent-samples/results/independent-specimens.wgswxspanel.relapse.eachcohort.tsv, there no Neuroblastoma TARGET samples (TARGET-30-*) listed. There are also no relapse samples for GMKF in the broad_tumor_discriptor column of the histologies.tsv. I Iooked in the histologies files because can't different the Kids_First_Biospecimen_IDs between GMKF and PBTA in the relapse independent specimens file.

Copy link

@logstar logstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jharenza @ewafula Thank you for confirming that the aforementioned relapse #/# and frequencies are expected.

This PR looks good to me now.

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now as well. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants