-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotate CNV table with mutation frequencies #52
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for creating the cnv-frequencies
module @ewafula !
I wonder if you could only use tab or space for indentation in 01-cnv-frequencies.py
. Mixing tabs and spaces for indentation may cause hard-to-detect errors in future updates, if the script is edited by certain text editors.
To make sure that other people could reproduce your results identically, could you rerun your analysis module in the Docker image? You can add RUN pip3 install mygene
in your local Dockerfile to install the mygene package. I assume you did not use the Docker image, because you used python > 3.5 syntax, and the Docker image only has python == 3.5.
Following are specific suggestions and comments.
@jharenza Are |
update local cnv-frequencies with remote OpenPedCan
@logstar, @jharenza, I made all the changes you recommended and regenerated the results using a docker image build from the OpenPedCan Dockerfile. I also include the |
Ok will. Thanks!
…On Fri, Jul 16, 2021 at 1:04 PM Yuanchao Zhang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh
<#52 (comment)>
:
> +# Independent primary tumor samples file path
+primary_tumors=analyses/independent-samples/results/independent-specimens.wgs.primary.tsv
+
+# Independent relapse tumor samples file path
+relapse_tumors=analyses/independent-samples/results/independent-specimens.wgs.relapse.tsv
rerunning now with each cohort
No problem. The differences are documented by @runjin326
<https://github.com/runjin326> at README.md.
Could you rerun without the mygene part? So you would not need to run it
for a couple of hours? Or you could use this table for ENSG -> gene full
name mapping,
https://github.com/logstar/OpenPedCan-analysis/blob/lft-utils-ann-data-download/analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZN26A6W27RW363RAH6AUDTYBRBHANCNFSM5AK7DEKQ>
.
|
Sorry, must have misunderstood that. I’ll make the change. So, only the
relapse and primary independent samples uses number of samples instead of
number of patients? I have not nailed the reasoning for it in my head yet!
A Patients might more that one sample?
…On Fri, Jul 16, 2021 at 1:12 PM Yuanchao Zhang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In analyses/cnv-frequencies/01-cnv-frequencies.py
<#52 (comment)>
:
> + tumor_dfs["relapse_tumors"] = relapse_tumors_df
+
+ # compute variant frequencies for each cancer group per cohort and cancer group in cohorts
+ # for the overal dataset (all tumor samples) and independent primary/replase tumor samples
+ def func(x):
+ d = {}
+ d["Gene_Symbol"] = ",".join(x["gene_symbol"].unique())
+ d["total_alterations"] = x["Kids_First_Participant_ID"].nunique()
+ return pd.Series(d, index=["Gene_Symbol", "total_alterations"])
+ all_tumors_frequecy_dfs = []
+ primary_tumors_frequecy_dfs = []
+ relapse_tumors_frequecy_dfs = []
+ for index, row in cancer_group_cohort_df.iterrows():
+ if row["num_samples"] > 5:
+ for df_name, tumor_df in tumor_dfs.items():
+ df = tumor_df[(tumor_df["cancer_group"] == row["cancer_group"]) & (tumor_df["cohort"] == row["cohort"])]
cohort == 'all_cohorts' cases are handled in
https://github.com/ewafula/OpenPedCan-analysis/blob/994a4178e3fd68fd2a342d1f5ffd5e6bd7030c6f/analyses/cnv-frequencies/01-cnv-frequencies.py#L116-L119
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZN26EX53H4EYXCO6PS5E3TYBSBTANCNFSM5AK7DEKQ>
.
|
No problem! You are right. Only the relapse and primary independent samples use the number of samples instead of number of patients. A patient ID could have more than one independent sample IDs if we use each-cohort independent sample list. |
Got it! My understanding of independent sample was off. Thanks!
On Fri, Jul 16, 2021 at 1:47 PM Yuanchao Zhang ***@***.***>
wrote:
… Sorry, must have misunderstood that. I’ll make the change. So, only the
relapse and primary independent samples uses number of samples instead of
number of patients? I have not nailed the reasoning for it in my head yet!
A Patients might more that one sample?
… <#m_-2194758999075739861_>
On Fri, Jul 16, 2021 at 1:12 PM Yuanchao Zhang *@*.*> wrote: @.**
commented on this pull request. ------------------------------ In
analyses/cnv-frequencies/01-cnv-frequencies.py <#52 (comment)
<#52 (comment)>>
: > + tumor_dfs["relapse_tumors"] = relapse_tumors_df + + # compute variant
frequencies for each cancer group per cohort and cancer group in cohorts +
# for the overal dataset (all tumor samples) and independent
primary/replase tumor samples + def func(x): + d = {} + d["Gene_Symbol"] =
",".join(x["gene_symbol"].unique()) + d["total_alterations"] =
x["Kids_First_Participant_ID"].nunique() + return pd.Series(d,
index=["Gene_Symbol", "total_alterations"]) + all_tumors_frequecy_dfs = []
+ primary_tumors_frequecy_dfs = [] + relapse_tumors_frequecy_dfs = [] + for
index, row in cancer_group_cohort_df.iterrows(): + if row["num_samples"] >
5: + for df_name, tumor_df in tumor_dfs.items(): + df =
tumor_df[(tumor_df["cancer_group"] == row["cancer_group"]) &
(tumor_df["cohort"] == row["cohort"])] cohort == 'all_cohorts' cases are
handled in
https://github.com/ewafula/OpenPedCan-analysis/blob/994a4178e3fd68fd2a342d1f5ffd5e6bd7030c6f/analyses/cnv-frequencies/01-cnv-frequencies.py#L116-L119
. — You are receiving this because you were mentioned. Reply to this email
directly, view it on GitHub <#52 (comment)
<#52 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAZN26EX53H4EYXCO6PS5E3TYBSBTANCNFSM5AK7DEKQ
.
No problem!
You are right. Only the relapse and primary independent samples use the
number of samples instead of number of patients. A patient ID could have
more than one independent sample IDs if we use each-cohort independent
sample list.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZN26DKDVBXES6O2ZLFIFTTYBWCZANCNFSM5AK7DEKQ>
.
|
No problem at all. Let me know if you have any questions.
… On Jul 16, 2021, at 1:53 PM, ewafula ***@***.***> wrote:
Got it! My understanding of independent sample was off. Thanks!
On Fri, Jul 16, 2021 at 1:47 PM Yuanchao Zhang ***@***.***>
wrote:
> Sorry, must have misunderstood that. I’ll make the change. So, only the
> relapse and primary independent samples uses number of samples instead of
> number of patients? I have not nailed the reasoning for it in my head yet!
> A Patients might more that one sample?
> … <#m_-2194758999075739861_>
> On Fri, Jul 16, 2021 at 1:12 PM Yuanchao Zhang *@*.*> wrote: @.**
> commented on this pull request. ------------------------------ In
> analyses/cnv-frequencies/01-cnv-frequencies.py <#52 (comment)
> <#52 (comment)>>
> : > + tumor_dfs["relapse_tumors"] = relapse_tumors_df + + # compute variant
> frequencies for each cancer group per cohort and cancer group in cohorts +
> # for the overal dataset (all tumor samples) and independent
> primary/replase tumor samples + def func(x): + d = {} + d["Gene_Symbol"] =
> ",".join(x["gene_symbol"].unique()) + d["total_alterations"] =
> x["Kids_First_Participant_ID"].nunique() + return pd.Series(d,
> index=["Gene_Symbol", "total_alterations"]) + all_tumors_frequecy_dfs = []
> + primary_tumors_frequecy_dfs = [] + relapse_tumors_frequecy_dfs = [] + for
> index, row in cancer_group_cohort_df.iterrows(): + if row["num_samples"] >
> 5: + for df_name, tumor_df in tumor_dfs.items(): + df =
> tumor_df[(tumor_df["cancer_group"] == row["cancer_group"]) &
> (tumor_df["cohort"] == row["cohort"])] cohort == 'all_cohorts' cases are
> handled in
> https://github.com/ewafula/OpenPedCan-analysis/blob/994a4178e3fd68fd2a342d1f5ffd5e6bd7030c6f/analyses/cnv-frequencies/01-cnv-frequencies.py#L116-L119
> . — You are receiving this because you were mentioned. Reply to this email
> directly, view it on GitHub <#52 (comment)
> <#52 (comment)>>,
> or unsubscribe
> https://github.com/notifications/unsubscribe-auth/AAZN26EX53H4EYXCO6PS5E3TYBSBTANCNFSM5AK7DEKQ
> .
>
> No problem!
>
> You are right. Only the relapse and primary independent samples use the
> number of samples instead of number of patients. A patient ID could have
> more than one independent sample IDs if we use each-cohort independent
> sample list.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#52 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZN26DKDVBXES6O2ZLFIFTTYBWCZANCNFSM5AK7DEKQ>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#52 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJGV54FW53D5NEPS23ZUCDTYBWYFANCNFSM5AK7DEKQ>.
|
…, and JSONL conversion
@logstar, @jharenza, all changes done:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates @ewafula !
The revised module looks good to me. All issues are resolved. The x_and_y
command runs well with TSV and JSONL results identical to the uploaded ones. I was not able to test-run the autosomes
command, because it was not completed after > 30 minutes.
I wonder how long does the autosomes
command takes on your computer. If it takes too long, we could discuss whether it is necessary to create a new ticket to optimize this module, since you are currently familiar with the code. If we optimize at a later point, it might take more of your time and effort.
Also to note here that the annotation module is still under development. Once it is completed and merged, I will create a new ticket for updating this module to use the annotation module for annotation.
cc @jharenza
@logstar |
@ewafula Thank you for looking into the run time issue. As the code looks good and the results look correct and reproducible, I think we could merge this PR soon, so this PR will not become too long to review. @jharenza I have not evaluated the results with CNV specific knowledge, so I will leave this PR open for now. @ewafula I wonder if you could create a short ticket/issue at https://github.com/PediatricOpenTargets/ticket-tracker for optimizing the I will create a ticket/issue for adapting the annotation module CLI, when it is available. |
@logstar, @jharenza, I have open a ticket/issue to work on optimizing run times for cnv-module |
Thank you @ewafula ! |
I agree. The annotation function needs no further optimization, as it will be replaced by the upcoming annotation module. All other parts should run within 40 minutes, so they are also good. I will close the optimization ticket d3b-center/ticket-tracker-OPC#120. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ewafula! thanks for working on this!
I have a few comments and requested changes:
- I spot checked GMKF cohort for MYCN:
> cnv %>%
+ filter(Dataset == "GMKF" & Disease == "Neuroblastoma" & Gene_Symbol == "MYCN") %>%
+ select(Gene_Symbol, Variant_Type, `Total_Primary_Tumors_Altered/Primary_Tumors_in_Dataset`, Frequency_in_Primary_Tumors)
# A tibble: 3 x 4
Gene_Symbol Variant_Type `Total_Primary_Tumors_Altered/Primary… Frequency_in_Primary…
<chr> <chr> <chr> <chr>
1 MYCN amplification 25/200 12.50%
2 MYCN gain 39/200 19.50%
3 MYCN loss 1/200 0.50%
> v6 %>%
+ filter(cohort == "GMKF" & experimental_strategy == "WGS" & tumor_descriptor == "Primary Tumor") %>%
+ select(molecular_subtype) %>%
+ table()
.
MYCN amp MYCN non-amp Unknown
47 288 1
> 47/(47+288)
[1] 0.1402985
I think this is close-ish right now - we are making updates to the CNV module within OpenPBTA and we also have an open ticket to assess CNV thresholding, so this will definitely change.
- Can you combine the autosomes and xy chromosome files to one file?
- Will you also name these files more distinctly - they are no longer the seg files, but the cnv frequency files.
- I noticed you have only the cohort level analysis here, i.e. cohort+cancer_group, but do not have the
all_cohorts
analysis. Each of these uses distinct independent specimen files as well. Can you add this?
Thanks!
@* jharenza, the the code accounts all_cohorts. We just don’t have any
cancer_group that overlaps the two cohorts (PBTA and GMFK). Therefore,
all_cohort results are not present with the current input @logstar, is
there some logic that am missing? *
*I will combine the input consensus files and rename the out files.*
…On Wed, Jul 21, 2021 at 5:52 PM Jo Lynne Rokita ***@***.***> wrote:
***@***.**** requested changes on this pull request.
Hi @ewafula <https://github.com/ewafula>! thanks for working on this!
I have a few comments and requested changes:
1. I spot checked GMKF cohort for MYCN:
> cnv %>%
+ filter(Dataset == "GMKF" & Disease == "Neuroblastoma" & Gene_Symbol == "MYCN") %>%
+ select(Gene_Symbol, Variant_Type, `Total_Primary_Tumors_Altered/Primary_Tumors_in_Dataset`, Frequency_in_Primary_Tumors)
# A tibble: 3 x 4
Gene_Symbol Variant_Type `Total_Primary_Tumors_Altered/Primary… Frequency_in_Primary…
<chr> <chr> <chr> <chr>
1 MYCN amplification 25/200 12.50%
2 MYCN gain 39/200 19.50%
3 MYCN loss 1/200 0.50%
> v6 %>%
+ filter(cohort == "GMKF" & experimental_strategy == "WGS" & tumor_descriptor == "Primary Tumor") %>%
+ select(molecular_subtype) %>%
+ table()
.
MYCN amp MYCN non-amp Unknown
47 288 1
> 47/(47+288)
[1] 0.1402985
I think this is close-ish right now - we are making updates to the CNV
module within OpenPBTA
<AlexsLemonade#1113> and we
also have an open ticket to assess CNV thresholding
<d3b-center/ticket-tracker-OPC#113>, so
this will definitely change.
1. Can you combine the autosomes and xy chromosome files to one file?
2. Will you also name these files more distinctly - they are no longer
the seg files, but the cnv frequency files.
3. I noticed you have only the cohort level analysis here, i.e.
cohort+cancer_group, but do not have the all_cohorts analysis. Each of
these uses distinct independent specimen files as well. Can you add this?
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#52 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZN26DPXACJGHUJ7MR77Q3TY46TDANCNFSM5AK7DEKQ>
.
|
Ah perhaps the overlap is in RNA only right now - GMKF and PBTA have Neuroblastoma in common, but only few in GMKF. I hadn't checked experimental strategy, but good you have it in there - thanks! |
Thank you for checking. The |
@logstar, @jharenza, update for CNV frequencies module PR
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ewafula - thanks for the updates.
In your table, I noticed that frequencies are missing for the table:
> cnv %>%
+ filter(Gene_symbol == "MYCN" & Disease == "Neuroblastoma") %>%
+ select(Gene_symbol, Dataset, Disease, Frequency_in_overall_dataset, `Total_primary_tumors_altered/Primary_tumors_in_dataset`, Frequency_in_primary_tumors)
# A tibble: 11 x 6
Gene_symbol Dataset Disease Frequency_in_overall_da… `Total_primary_tumors_altered/Primary_tum… Frequency_in_primary_t…
<chr> <chr> <chr> <chr> <chr> <lgl>
1 MYCN all_cohor… Neuroblast… 20.77% 25/200 NA
2 MYCN all_cohor… Neuroblast… 16.43% 39/200 NA
3 MYCN all_cohor… Neuroblast… 14.49% 1/200 NA
4 MYCN all_cohor… Neuroblast… 17.87% 0/0 NA
5 MYCN GMKF Neuroblast… 12.50% 25/200 NA
6 MYCN GMKF Neuroblast… 19.50% 39/200 NA
7 MYCN GMKF Neuroblast… 0.50% 1/200 NA
8 MYCN TARGET Neuroblast… 27.93% 0/0 NA
9 MYCN TARGET Neuroblast… 13.51% 0/0 NA
10 MYCN TARGET Neuroblast… 26.58% 0/0 NA
11 MYCN TARGET Neuroblast… 33.33% 0/0 NA
Also, for the anove samples, most in TARGET are primary tumors, so it seems the total calculations are being missed here:
v7 %>%
filter(cohort == "TARGET" & experimental_strategy != "RNA-Seq" & cancer_group == "Neuroblastoma") %>%
select(tumor_descriptor) %>%
table()
I think this has to do with the independent sample files you are using - see my comment.
@jharenza Thank you for checking the results! I think the NAs might be caused by I did a quick check on the TSV file and found no NA.
|
Oh yes I need to read in as blanks but they should not be blank is what I was saying. There are values in the numerator and denominator but then no values in frequency. Can you check? |
Sorry that I missed those. Thank you for pointing those out @jharenza ! They are probably caused by
|
@jharenza I am not sure why I am seeing different numbers from yours. The all_cohorts have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates @ewafula .
The module runs well on the Docker image/container and completes in 18 minutes! All results are also identically reproduced.
$ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.shRead analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_annot_freq.tsv...
Done.
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_annot_freq.tsv...
Done.
real 17m37.967s
user 17m28.653s
sys 0m11.996s
The code updates in commit ce0aa12 also look good to me.
Following are some minor suggestions.
Ok great- thanks for checking! |
@jharenza, @logstar, PR update. Made all the changes discussed above, including using independent specimen files which contain WXS and panel data - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates @ewafula !
The module runs well in the Docker image/container, which is completed in 23 minutes. The results are reproduced identically.
$ time bash analyses/cnv-frequencies/run-cnv-frequencies-analysis.sh
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_autosomes_annot_freq.tsv...
Done.
Read analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Annotate analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_freq.tsv...
Output analyses/cnv-frequencies/results/consensus_wgs_plus_cnvkit_wxs_x_and_y_annot_freq.tsv...
Done.
real 22m43.293s
user 22m31.474s
sys 0m14.797s
The code updates in commit f18ca24 look good to me.
The TSV file has no NA in it.
$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P '\tNA\t|\tNaN\t|\tNULL\t' | wc -l
0
The primary #/#
and frequencies are also not 0/0
and blank anymore. The relapse #/#
and frequencies are still 0/0
and blank for not only the TARGET cohort, so I wonder if this is expected @jharenza .
$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | head -1
Gene_symbol Gene_Ensembl_ID Variant_type Variant_category Dataset Disease Total_alterations/Patients_in_dataset Frequency_in_overall_dataset Total_primary_tumors_altered/Primary_tumors_in_dataset Frequency_in_primary_tumors Total_relapse_tumors_altered/Relapse_tumors_in_dataset Frequency_in_relapse_tumors Gene_full_name RMTL OncoKB_cancer_gene OncoKB_oncogene_TSG EFO MONDO
$ gunzip -c results/gene-level-cnv-consensus-annotated-mut-freq.tsv.gz | grep -P 'MYCN\t.*\tNeuroblastoma\t'
MYCN ENSG00000134323 amplification all_cohorts Neuroblastoma 86/414 20.77% 81/410 19.76% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 gain all_cohorts Neuroblastoma 68/414 16.43% 65/410 15.85% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 loss all_cohorts Neuroblastoma 60/414 14.49% 59/410 14.39% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 neutral all_cohorts Neuroblastoma 74/414 17.87% 72/410 17.56% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 amplification GMKF Neuroblastoma 25/200 12.50% 25/200 12.50% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 gain GMKF Neuroblastoma 39/200 19.50% 39/200 19.50% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 loss GMKF Neuroblastoma 1/200 0.50% 1/200 0.50% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 amplification TARGET Neuroblastoma 62/222 27.93% 56/210 26.67% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 gain TARGET Neuroblastoma 30/222 13.51% 26/210 12.38% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 loss TARGET Neuroblastoma 59/222 26.58% 58/210 27.62% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
MYCN ENSG00000134323 neutral TARGET Neuroblastoma 74/222 33.33% 72/210 34.29% 0/0 MYCN proto-oncogene, bHLH transcription factor Relevant Molecular Target (RMTL version 1.0) Y Oncogene EFO_0000621 MONDO_0005072
Yes, this is expected if you look at the broad_tumor_descriptor field we added in v7. Thanks! |
@logstar, looking the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now as well. Thank you!
Purpose/implementation Section
What scientific question is your analysis addressing?
Uses
consensus_seg_annotated_cn_autosomes.tsv
andconsensus_seg_annotated_cn_x_and_y.tsv
consensus CNV calls and variant types (amplification
,deep deletion
,gain
,loss
, andneutral
) to determineEnsembl
gene-level mutation frequencies for each cancer type in an overall cohort dateset and in the independent primary/relapse cohort subsets of the data.What was your approach?
The code is adapted from
https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/f70645b6c7e4eb15ea29e45e9ebf0adeb5798b9b/analyses/snv-frequencies
by @logstar andhttps://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/kgaonkar6/fusion_freq/analyses/fusion-frequencies
by @kgaonkar6Given CNV consensus table with
Kids_First_Biospecimen_ID
andVariant_Type
, python script ,01-cnv-frequencies.py
computes gene-level mutation frequencies per cancer_group within cohort and add annotations.What GitHub issue does your pull request address?
d3b-center/ticket-tracker-OPC#66
d3b-center/ticket-tracker-OPC#68
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
lf mutation frequencies should be restricted to only
Ensembl
gene identifiers without consideration of variant types (amplification
,deep deletion
,gain
,loss
, andneutral
). Currently variant types are included in combination withEnsembl
gene identifiers to count mutations.Is there anything that you want to discuss further?
Still requires additional information to update the table with variant categories such as
focal
,segmental
,chromosomal
e.t.c., andOncogene/TSG
categories fromOncoKB
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
JSONL and TSV tables
What is your summary of the results?
The CNV consensus frequencies results currently only for PBTA and GMKF cohorts.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.