Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Long-format table annotation Part 1] download gene names and protein RefSeq IDs #55

Merged
merged 23 commits into from
Jul 21, 2021
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions analyses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`immune-deconv`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/immune-deconv) | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | Immune/Stroma characterization across PBTA (part of [#15](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/15)) | `results/deconv-output.RData`
| [`independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | `pbta-histologies.tsv` | Generates independent specimen lists for WGS/WXS samples | `results/independent-specimens.wgs.primary.tsv` (included in data download) <br> `results/independent-specimens.wgs.primary-plus.tsv` (included in data download) <br> `results/independent-specimens.wgswxs.primary.tsv` (included in data download) <br> `results/independent-specimens.wgswxs.primary-plus.tsv` (included in data download)
| [`interaction-plots`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/interaction-plots) | `independent-specimens.wgs.primary-plus.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Creates interaction plots for mutation mutual exclusivity/co-occurrence [#13](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/13); may be updated to include other data types (e.g., fusions) | N/A
| [`long-format-table-utils`](https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/long-format-table-utils) | `data/ensg-hugo-rmtl-v1-mapping.tsv` <br> `analyses/fusion_filtering/references/genelistreference.txt` <br> `data/efo-mondo-map.tsv` | Functions and scripts for handling long-format tables | `annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` <br> `annotator/annotation-data/oncokb-cancer-gene-list.tsv`
| [`molecular-subtyping-ATRT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-ATRT) | `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `pbta-snv-consensus-mutation-tmb-all.tsv` <br> `pbta-cnv-consensus-gistic.zip`| Summarizing data into tabular format in order to molecularly subtype ATRT samples [#244](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/244); this analysis did not work | N/A
| [`molecular-subtyping-CRANIO`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-CRANIO) | `pbta-histologies-base.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Molecular subtyping of craniopharyngiomas samples [#810](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/810) | `results/CRANIO_molecular_subtype.tsv`
| [`molecular-subtyping-EPN`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EPN) | `pbta-histologies-base.tsv` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-cnv-consensus-gistic.zip` <br> `analyses/chromosomal-instability/breakpoint-data/union_of_breaks_densities.tsv` <br> `analyses/fusion-summary/results/fusion_summary_ependymoma_foi.tsv` <br> `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` | *In progress*; molecular subtyping of ependymoma tumors | `results/EPN_all_data_withsubgroup.tsv`
Expand Down
86 changes: 86 additions & 0 deletions analyses/long-format-table-utils/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
## Long-format table utils

**Module author:** Yuanchao Zhang ([@logstar](https://github.com/logstar))

- [Long-format table utils](#long-format-table-utils)
- [Purpose](#purpose)
- [Methods](#methods)
- [Update downloaded data that are used in this module](#update-downloaded-data-that-are-used-in-this-module)
- [Add gene and `cancer_group` annotations](#add-gene-and-cancer_group-annotations)
- [Implementation of long-format table annotator](#implementation-of-long-format-table-annotator)
- [API usage of long-format table annotator](#api-usage-of-long-format-table-annotator)
- [CLI usage of long-format table annotator](#cli-usage-of-long-format-table-annotator)

### Purpose

Create application programming interface (API) and command line interface (CLI) for handling [long-format tables](https://en.wikipedia.org/wiki/Wide_and_narrow_data#Narrow) that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R `source('path/to/the/function/file.R')` or Python `import os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py`. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with either `Rscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv` or `python3 path/to/the/script.py arg long.tsv long_edited.tsv`.

This module is suggested by @jharenza and @kgaonkar6 in Slack at <https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98>, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.

| Sub-module name | Implemented function | Available interface(s) |
|---------------------------------------------------------------|-----------------------------------------|--------------------------------------------------------------------------------------------------------|
| [`annotator`](#implementation-of-long-format-table-annotator) | Add gene and `cancer_group` annotations | [R API](#api-usage-of-long-format-table-annotator), [R CLI](#cli-usage-of-long-format-table-annotator) |

API and CLI usages and descriptions are in the Methods section.

### Methods

#### Update downloaded data that are used in this module

Run the following command to update downloaded data that are used in this module.

```bash
bash run-update-long-format-table-utils.sh
```

The `run-update-long-format-table-utils.sh` runs data downloading scripts in sub-modules, e.g. `annotator/run-download-annotation-data.sh`. The data downloading scripts use `git diff --stat` to check for data file changes.

Users could also use `git diff --stat` to check for data file changes.

Following is the table of data files that need to be updated by the maintainer of this module.

| Data file | Date of the last update | Update method | Data version(s) |
|--------------------------------------------------------------------|-------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
| `annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` | 07/20/2021 | `bash run-update-long-format-table-utils.sh` | [MyGene](https://docs.mygene.info/en/latest/doc/release_changes.html): 20210627, NCBI snapshot: 20210625, Ensembl release: 104. |
| `annotator/annotation-data/oncokb-cancer-gene-list.tsv` | 07/16/2021 | Manually download from <https://www.oncokb.org/cancerGenes> | 06/16/2021 |

Note on MyGene version: [MyGene releases](https://docs.mygene.info/en/latest/doc/release_changes.html) are built regularly using data from various sources, e.g. Ensembl, NCBI, and UCSC. In each release note, updated data sources are listed, e.g. Ensembl gene is updated from 103 to 104 in Build version 20210510. The `annotator/download-annotation-data.R` script uses the [R mygene package 1.22.0](https://bioconductor.org/packages/3.10/bioc/html/mygene.html) to query `Gene_Ensembl_ID` values to retrieve `Gene_full_name` and `Protein_RefSeq_ID` values. The R mygene package 1.22.0 uses [MyGene v3 API](https://mygene.info/v3/api) with `default_url <- "http://mygene.info/v3"` specified at line 8 of `mygene.R`, even though the R mygene package 1.22.0 documentation says that v2 API is used.

Note on using MyGene instead of biomaRt: MyGene has more `Gene_Ensembl_ID`s that have `Gene_full_name` or `Protein_RefSeq_ID` available than biomaRt, and more details are discussed at <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/55#discussion_r673485741>. However, biomaRt allows users to specify Ensembl (GENCODE) versions with `biomaRt::useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = 90)`. If biomaRt is preferred at a later point, use the code at <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/55#discussion_r673491208> as a starting point for updating `annotator/download-annotation-data.R`.

#### Add gene and `cancer_group` annotations

##### Implementation of long-format table annotator

Check input long-format tables have the following columns:

| Column name | Description |
|-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Gene_symbol` | HUGO symbols, e.g. PHLPP1, TM6SF1, and DNAH5. |
| `Gene_Ensembl_ID` | Ensembl ENSG IDs without `.#` versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710 |
| `Disease` | The `cancer_group` in the `histologies.tsv`, e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytoma |

Add one or more of the following gene and disease (/`cancer_group`) annotations, by specifying the `columns_to_add` parameter in the `annotate_long_format_table` function, or by specifying the `-c`/`--columns-to-add` option when running `annotator-cli.R`.

| Annotation column name | `join_by` column name | Non-missing value | Annotation data file | Source |
|------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `RMTL` | `Gene_Ensembl_ID` | `Relevant Molecular Target (RMTL version 1)` or `Non-Relevant Molecular Target (RMTL version 1)` | `data/ensg-hugo-rmtl-v1-mapping.tsv` | PediatricOpenTargets/OpenPedCan-analysis data release |
| `Gene_type` | `Gene_symbol` | A sorted comma separated list of one or more of the following gene types: `CosmicCensus`, `Kinase`, `Oncogene`, `TranscriptionFactor`, and `TumorSuppressorGene`. Example values: `CosmicCensus`, `CosmicCensus,Kinase`, and `CosmicCensus,Kinase,TumorSuppressorGene`. | `analyses/fusion_filtering/references/genelistreference.txt` | Described at [https://github.com/d3b-center/annoFuse](https://github.com/d3b-center/annoFuse/blob/92cdd6975d6db84a692ad1bd631fa7db9834003d/README.md#prerequisites-for-cohort-level-analysis) |
| `OncoKB_cancer_gene` | `Gene_symbol` | `Y` or `N` | `analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv` | Downloaded from <https://www.oncokb.org/cancerGenes> |
| `OncoKB_oncogene_TSG` | `Gene_symbol` | `Oncogene`, or `TumorSuppressorGene`, or `Oncogene,TumorSuppressorGene` | `analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv` | Downloaded from <https://www.oncokb.org/cancerGenes> |
| `Gene_full_name` | `Gene_Ensembl_ID` | A single string of gene full name, e.g. `cytochrome c oxidase subunit III` and `ATP synthase F0 subunit 6` | `analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` | [MyGene.info v3 API](https://mygene.info/v3/api) |
| `Protein_RefSeq_ID` | `Gene_Ensembl_ID` | A sorted comma separated list of one or more protein RefSeq IDs, e.g. `NP_004065.1`, `NP_000053.2,NP_001027466.1`, and `NP_000985.1,NP_001007074.1,NP_001007075.1` | `analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` | [MyGene.info v3 API](https://mygene.info/v3/api) |
| `EFO` | `Disease` | A single string of EFO code, e.g. `EFO_1000069`, `EFO_1002008`, and `EFO_1000177` | `data/efo-mondo-map.tsv` | PediatricOpenTargets/OpenPedCan-analysis data release |
| `MONDO` | `Disease` | A single string of MONDO code, e.g. `MONDO_0002787`, `MONDO_0020560`, and `MONDO_0009837` | `data/efo-mondo-map.tsv` | PediatricOpenTargets/OpenPedCan-analysis data release |

Note: only add `Gene_type` to gene-level tables, which can be implemented by leaving `"Gene_type"` out of the `columns_to_add` parameter of the `annotate_long_format_table` function in `annotator-api.R`, or by leaving `"Gene_type"` out of the `-c`/`--columns-to-add` option when running `annotator-cli.R`.

The version of PediatricOpenTargets/OpenPedCan-analysis data release is determined by the `download-data.sh` under the `OpenPedCan-analysis` directory.

The version of `analyses/fusion_filtering/references/genelistreference.txt` is tracked by GitHub commits, and the GitHub permalink to the currently used file is <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/7fb11a020a92d06c8685736546e860bfe23da7e2/analyses/fusion_filtering/references/genelistreference.txt>.

The versions of other sources are listed in the [Update downloaded data that are used in this module](#update-downloaded-data-that-are-used-in-this-module) section.

##### API usage of long-format table annotator

##### CLI usage of long-format table annotator
Loading