Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Proposed Analysis: Create an API for annotating long-format tables generated by analysis modules #112

Closed
logstar opened this issue Jul 15, 2021 · 3 comments
Assignees

Comments

@logstar
Copy link

logstar commented Jul 15, 2021

What are the scientific goals of the analysis?

Create an R function and an R script API for adding gene and cancer_group annotations to the long-format tables that are generated by analysis modules. The module developers could incorporate the function or in their modules via either source('path/to/the/function') or Rscript --vanilla path/to/the/script long.tsv long_annotated.tsv.

This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.

The gene annotations to be added:

Annotation column name Source data
RMTL data/ensg-hugo-rmtl-v1-mapping.tsv
Gene_type analyses/fusion_filtering/references/genelistreference.txt
OncoKB_cancer_gene analyses/snv-frequencies/input/oncokb_cancer_gene_list.tsv
OncoKB_oncogene_TSG analyses/snv-frequencies/input/oncokb_cancer_gene_list.tsv
Gene_full_name TBD. Download from https://mygene.info/ to this module.
Protein_RefSeq_ID TBD. Download from https://mygene.info/ to this module.

Update Fri Jul 16 2021 by @logstar : Add Protein_RefSeq_ID to gene annotations.

Note: only add Gene_type to gene-level tables.

The disease annotations to be added:

Annotation column name Source data
EFO data/efo-mondo-map.tsv
MONDO data/efo-mondo-map.tsv

The tables that could be annotated by the script:

Analysis module Table path GitHub PR
snv-frequencies analyses/snv-frequencies/results/gene-level-snv-consensus-annotated-mut-freq.tsv
analyses/snv-frequencies/results/var-level-snv-consensus-annotated-mut-freq.tsv
Open PR: d3b-center/OpenPedCan-analysis#45
rna-seq-expression-summary-stats analyses/rna-seq-expression-summary-stats/results/long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz
analyses/rna-seq-expression-summary-stats/results/long_n_tpm_mean_sd_quantile_group_wise_zscore.tsv.gz
Merged PR: d3b-center/OpenPedCan-analysis#27
fusion-frequencies analyses/fusion-frequencies/results/putative-oncogene-fusion-freq.tsv
analyses/fusion-frequencies/results/putative-oncogene-fused-gene-freq.tsv
Open PR: d3b-center/OpenPedCan-analysis#49
cnv-frequencies analyses/cnv-frequencies/results/cnv-consensus-annotated-autosomes-frequencies.tsv.gz
analyses/cnv-frequencies/results/cnv-consensus-annotated-frequencies-x_and_y.tsv.gz
Open PR: d3b-center/OpenPedCan-analysis#52
tumor-gtex-plots TBD (?) Open PR: d3b-center/OpenPedCan-analysis#29
DESeq_analysis TBD (?) Open PR: d3b-center/OpenPedCan-analysis#28

What methods do you plan to use to accomplish the scientific goals?

Check all long-format TSV tables have the following columns:

Column name Description
Gene_symbol HUGO symbols, e.g. PHLPP1, TM6SF1, and DNAH5.
Gene_Ensembl_ID Ensembl ENSG IDs without .# versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710
Disease The cancer_group in the histologies.tsv, e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytoma

Annotate the aforementioned annotation columns with their corresponding source data.

What input data are required for this analysis?

  • data/ensg-hugo-rmtl-v1-mapping.tsv
  • analyses/fusion_filtering/references/genelistreference.txt
  • analyses/snv-frequencies/input/oncokb_cancer_gene_list.tsv
  • data/efo-mondo-map.tsv

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

2-4 days.

Who will complete the analysis (please add a GitHub handle here if relevant)?

@logstar

What relevant scientific literature relates to this analysis?

OncoKB papers: https://ascopubs.org/doi/full/10.1200/PO.17.00011 and https://science.sciencemag.org/content/339/6127/1546.full.

Gene_type sources https://github.com/d3b-center/annoFuse#prerequisites-for-cohort-level-analysis.

Notes

@kgaonkar6 @ewafula @sangeetashukla @komalsrathi @jharenza Let me know if you have any suggestions on this module.

@logstar logstar self-assigned this Jul 15, 2021
@jharenza
Copy link
Member

Great description @logstar! We will also want to add Uberon codes for Gtex tissues, as will be released in v7, per issue #85

@logstar
Copy link
Author

logstar commented Jul 15, 2021

Great description @logstar! We will also want to add Uberon codes for Gtex tissues, as will be released in v7, per issue #85

Thank you for the note!

@logstar
Copy link
Author

logstar commented Jul 26, 2021

All relevant PRs merged.

@logstar logstar closed this as completed Jul 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants