Merge branch 'lft-utils-ann-data-download' into lft-utils-ann-r-api

Merge changes in the data downloading PR d3b-center#55 . Rename update-long-format-table-utils.sh to run-update-long-format-table-utils.sh . Specify annotation data versions in README.md. Change the date of the last update of annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021.
logstar · Jul 21, 2021 · 64fb672 · 64fb672
2 parents 9387a22 + 5716523
commit 64fb672
Show file tree

Hide file tree

Showing 3 changed files with 30 additions and 28 deletions.
diff --git a/analyses/long-format-table-utils/README.md b/analyses/long-format-table-utils/README.md
@@ -30,14 +30,23 @@ API and CLI usages and descriptions are in the Methods section.
 Run the following command to update downloaded data that are used in this module.
 
 ```bash
-bash update-long-format-table-utils.sh
+bash run-update-long-format-table-utils.sh
 ```
 
-The `update-long-format-table-utils.sh` runs data downloading scripts in sub-modules, e.g. `annotator/run-download-annotation-data.sh`. The data downloading scripts use `git diff --stat` to check for data file changes.
+The `run-update-long-format-table-utils.sh` runs data downloading scripts in sub-modules, e.g. `annotator/run-download-annotation-data.sh`. The data downloading scripts use `git diff --stat` to check for data file changes.
 
 Users could also use `git diff --stat` to check for data file changes.
 
-Note: To update `annotator/annotation-data/oncokb-cancer-gene-list.tsv` (last updated on 06/16/2021), re-download the updated table from <https://www.oncokb.org/cancerGenes>. The website does not provide any URL for downloading the table, so the maintainer of this module has to manually update the table.
+Following is the table of data files that need to be updated by the maintainer of this module.
+
+| Data file                                                          | Date of the last update | Update method                                               | Data version(s)                                                                                                                 |
+|--------------------------------------------------------------------|-------------------------|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|
+| `annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` | 07/20/2021              | `bash run-update-long-format-table-utils.sh`                | [MyGene](https://docs.mygene.info/en/latest/doc/release_changes.html): 20210627, NCBI snapshot: 20210625, Ensembl release: 104. |
+| `annotator/annotation-data/oncokb-cancer-gene-list.tsv`            | 07/16/2021              | Manually download from <https://www.oncokb.org/cancerGenes> | 06/16/2021                                                                                                                      |
+
+Note on MyGene version: [MyGene releases](https://docs.mygene.info/en/latest/doc/release_changes.html) are built regularly using data from various sources, e.g. Ensembl, NCBI, and UCSC. In each release note, updated data sources are listed, e.g. Ensembl gene is updated from 103 to 104 in Build version 20210510. The `annotator/download-annotation-data.R` script uses the [R mygene package 1.22.0](https://bioconductor.org/packages/3.10/bioc/html/mygene.html) to query `Gene_Ensembl_ID` values to retrieve `Gene_full_name` and `Protein_RefSeq_ID` values. The R mygene package 1.22.0 uses [MyGene v3 API](https://mygene.info/v3/api) with `default_url <- "http://mygene.info/v3"` specified at line 8 of `mygene.R`, even though the R mygene package 1.22.0 documentation says that v2 API is used.
+
+Note on using MyGene instead of biomaRt: MyGene has more `Gene_Ensembl_ID`s that have `Gene_full_name` or `Protein_RefSeq_ID` available than biomaRt, and more details are discussed at <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/55#discussion_r673485741>. However, biomaRt allows users to specify Ensembl (GENCODE) versions with `biomaRt::useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = 90)`. If biomaRt is preferred at a later point, use the code at <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/55#discussion_r673491208> as a starting point for updating `annotator/download-annotation-data.R`.
 
 #### Add gene and `cancer_group` annotations
 
@@ -51,33 +60,26 @@ Check input long-format tables have the following columns:
 | `Gene_Ensembl_ID` | Ensembl ENSG IDs without `.#` versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710                                                       |
 | `Disease`         | The `cancer_group` in the `histologies.tsv`, e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytoma |
 
-Add one or more of the following gene annotations:
-
-| Annotation column name | Non-missing value                                                                                                                                                                                                                                                      | Source data                                                                                         |
-|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
-| `RMTL`                 | `Relevant Molecular Target (RMTL version 1)` or `Non-Relevant Molecular Target (RMTL version 1)`                                                                                                                                                                       | `data/ensg-hugo-rmtl-v1-mapping.tsv`                                                                |
-| `Gene_type`            | A sorted comma separated list of one or more of the following gene types: `CosmicCensus`, `Kinase`, `Oncogene`, `TranscriptionFactor`, and `TumorSuppressorGene`. Example values: `CosmicCensus`, `CosmicCensus,Kinase`, and `CosmicCensus,Kinase,TumorSuppressorGene`. | `analyses/fusion_filtering/references/genelistreference.txt`                                        |
-| `OncoKB_cancer_gene`   | `Y` or `N`                                                                                                                                                                                                                                                             | `analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv`            |
-| `OncoKB_oncogene_TSG`  | `Oncogene`, or `TumorSuppressorGene`, or `Oncogene,TumorSuppressorGene`                                                                                                                                                                                                | `analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv`            |
-| `Gene_full_name`       | A single string of gene full name, e.g. `cytochrome c oxidase subunit III` and `ATP synthase F0 subunit 6`                                                                                                                                                             | `analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` |
-| `Protein_RefSeq_ID`    | A sorted comma separated list of one or more protein RefSeq IDs, e.g. `NP_004065.1`, `NP_000053.2,NP_001027466.1`, and `NP_000985.1,NP_001007074.1,NP_001007075.1`                                                                                                     | `analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` |
-
-The `RMTL` information is obtained from PediatricOpenTargets/OpenPedCan-analysis data release.
-
-Note: only add `Gene_type` to gene-level tables, which can be implemented by leaving `"Gene_type"` out of the `columns_to_add` parameter of the `annotate_long_format_table` function in `annotator-api.R`. The `Gene_type` information is obtained from `../fusion_filtering/references/genelistreference.txt`, and its sources are described at <https://github.com/d3b-center/annoFuse#prerequisites-for-cohort-level-analysis>.
+Add one or more of the following gene and disease (/`cancer_group`) annotations, by specifying the `columns_to_add` parameter in the `annotate_long_format_table` function, or by specifying the `-c`/`--columns-to-add` option when running `annotator-cli.R`.
 
-The `OncoKB_cancer_gene` and `OncoKB_oncogene_TSG` information is listed in `annotator/annotation-data/oncokb-cancer-gene-list.tsv`, which is downloaded from <https://www.oncokb.org/cancerGenes>. The last update of the table is on 06/16/2021. To update the table, re-download the updated table from <https://www.oncokb.org/cancerGenes>.
+| Annotation column name | `join_by` column name | Non-missing value                                                                                                                                                                                                                                                       | Annotation data file                                                                                | Source                                                                                                                                                                                        |
+|------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `RMTL`                 | `Gene_Ensembl_ID`     | `Relevant Molecular Target (RMTL version 1)` or `Non-Relevant Molecular Target (RMTL version 1)`                                                                                                                                                                        | `data/ensg-hugo-rmtl-v1-mapping.tsv`                                                                | PediatricOpenTargets/OpenPedCan-analysis data release                                                                                                                                         |
+| `Gene_type`            | `Gene_symbol`         | A sorted comma separated list of one or more of the following gene types: `CosmicCensus`, `Kinase`, `Oncogene`, `TranscriptionFactor`, and `TumorSuppressorGene`. Example values: `CosmicCensus`, `CosmicCensus,Kinase`, and `CosmicCensus,Kinase,TumorSuppressorGene`. | `analyses/fusion_filtering/references/genelistreference.txt`                                        | Described at [https://github.com/d3b-center/annoFuse](https://github.com/d3b-center/annoFuse/blob/92cdd6975d6db84a692ad1bd631fa7db9834003d/README.md#prerequisites-for-cohort-level-analysis) |
+| `OncoKB_cancer_gene`   | `Gene_symbol`         | `Y` or `N`                                                                                                                                                                                                                                                              | `analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv`            | Downloaded from <https://www.oncokb.org/cancerGenes>                                                                                                                                          |
+| `OncoKB_oncogene_TSG`  | `Gene_symbol`         | `Oncogene`, or `TumorSuppressorGene`, or `Oncogene,TumorSuppressorGene`                                                                                                                                                                                                 | `analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv`            | Downloaded from <https://www.oncokb.org/cancerGenes>                                                                                                                                          |
+| `Gene_full_name`       | `Gene_Ensembl_ID`     | A single string of gene full name, e.g. `cytochrome c oxidase subunit III` and `ATP synthase F0 subunit 6`                                                                                                                                                              | `analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` | [MyGene.info v3 API](https://mygene.info/v3/api)                                                                                                                                              |
+| `Protein_RefSeq_ID`    | `Gene_Ensembl_ID`     | A sorted comma separated list of one or more protein RefSeq IDs, e.g. `NP_004065.1`, `NP_000053.2,NP_001027466.1`, and `NP_000985.1,NP_001007074.1,NP_001007075.1`                                                                                                      | `analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` | [MyGene.info v3 API](https://mygene.info/v3/api)                                                                                                                                              |
+| `EFO`                  | `Disease`             | A single string of EFO code, e.g. `EFO_1000069`, `EFO_1002008`, and `EFO_1000177`                                                                                                                                                                                       | `data/efo-mondo-map.tsv`                                                                            | PediatricOpenTargets/OpenPedCan-analysis data release                                                                                                                                         |
+| `MONDO`                | `Disease`             | A single string of MONDO code, e.g. `MONDO_0002787`, `MONDO_0020560`, and `MONDO_0009837`                                                                                                                                                                               | `data/efo-mondo-map.tsv`                                                                            | PediatricOpenTargets/OpenPedCan-analysis data release                                                                                                                                         |
 
-The `Gene_full_name` and `Protein_RefSeq_ID` information is downloaded from <https://mygene.info/> using the [mygene package](https://bioconductor.org/packages/release/bioc/html/mygene.html).
+Note: only add `Gene_type` to gene-level tables, which can be implemented by leaving `"Gene_type"` out of the `columns_to_add` parameter of the `annotate_long_format_table` function in `annotator-api.R`, or by leaving `"Gene_type"` out of the `-c`/`--columns-to-add` option when running `annotator-cli.R`.
 
-Add the following disease(/`cancer_group`) annotations:
+The version of PediatricOpenTargets/OpenPedCan-analysis data release is determined by the `download-data.sh` under the `OpenPedCan-analysis` directory.
 
-| Annotation column name | Non-missing value                                                                         | Source data              |
-|------------------------|-------------------------------------------------------------------------------------------|--------------------------|
-| `EFO`                  | A single string of EFO code, e.g. `EFO_1000069`, `EFO_1002008`, and `EFO_1000177`         | `data/efo-mondo-map.tsv` |
-| `MONDO`                | A single string of MONDO code, e.g. `MONDO_0002787`, `MONDO_0020560`, and `MONDO_0009837` | `data/efo-mondo-map.tsv` |
+The version of `analyses/fusion_filtering/references/genelistreference.txt` is tracked by GitHub commits, and the GitHub permalink to the currently used file is <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/7fb11a020a92d06c8685736546e860bfe23da7e2/analyses/fusion_filtering/references/genelistreference.txt>.
 
-The `EFO` and `MONDO` information is obtained from PediatricOpenTargets/OpenPedCan-analysis data release.
+The versions of other sources are listed in the [Update downloaded data that are used in this module](#update-downloaded-data-that-are-used-in-this-module) section.
 
 ##### R API usage of long-format table annotator
 

diff --git a/analyses/long-format-table-utils/annotator/download-annotation-data.R b/analyses/long-format-table-utils/annotator/download-annotation-data.R
@@ -153,21 +153,21 @@ ensg_hugo_rmtl_df <- dplyr::distinct(
 if (!identical(sum(is.na(ensg_hugo_rmtl_df$ensg_id)), as.integer(0))) {
   stop(paste0("Found NA in ensg-hugo-rmtl-v1-mapping.tsv ensg_id.\n",
               "Check if PedOT release data are downloaded properly.\n",
-              "If data is downloaded properly, submit a GitHub data question."))
+              "If data is downloaded properly, submit a GitHub data issue."))
 }
 
 if (!identical(sum(is.na(ensg_hugo_rmtl_df$gene_symbol)), as.integer(0))) {
   stop(paste0("Found NA in ensg-hugo-rmtl-v1-mapping.tsv gene_symbol.\n",
               "Check if PedOT release data are downloaded properly.\n",
-              "If data is downloaded properly, submit a GitHub data question."))
+              "If data is downloaded properly, submit a GitHub data issue."))
 }
 
 # assert all ensg_id are unique
 if (!identical(length(unique(ensg_hugo_rmtl_df$ensg_id)),
                nrow(ensg_hugo_rmtl_df))) {
   stop(paste0("Found duplicated ensg_id in ensg-hugo-rmtl-v1-mapping.tsv.\n",
               "Check if PedOT release data are downloaded properly.\n",
-              "If data is downloaded properly, submit a GitHub data question."))
+              "If data is downloaded properly, submit a GitHub data issue."))
 }
 
 # Download data from https://mygene.info/ --------------------------------------

diff --git a/...e-utils/update-long-format-table-utils.sh → ...ils/run-update-long-format-table-utils.sh b/...e-utils/update-long-format-table-utils.sh → ...ils/run-update-long-format-table-utils.sh