d3b-center · logstar · Jul 26, 2021 · Jul 15, 2021 · Jul 15, 2021 · Jul 15, 2021
@@ -28,6 +28,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`immune-deconv`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/immune-deconv) | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | Immune/Stroma characterization across PBTA (part of [#15](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/15)) | `results/deconv-output.RData`
 | [`independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | `pbta-histologies.tsv` | Generates independent specimen lists for WGS/WXS samples | `results/independent-specimens.wgs.primary.tsv` (included in data download) <br> `results/independent-specimens.wgs.primary-plus.tsv` (included in data download) <br> `results/independent-specimens.wgswxs.primary.tsv` (included in data download) <br> `results/independent-specimens.wgswxs.primary-plus.tsv` (included in data download)
 | [`interaction-plots`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/interaction-plots) | `independent-specimens.wgs.primary-plus.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Creates interaction plots for mutation mutual exclusivity/co-occurrence [#13](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/13); may be updated to include other data types (e.g., fusions) | N/A
+| [`long-format-table-utils`](https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/long-format-table-utils) | `data/ensg-hugo-rmtl-v1-mapping.tsv` <br> `analyses/fusion_filtering/references/genelistreference.txt` <br> `data/efo-mondo-map.tsv` | Functions and scripts for handling long-format tables | `annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` <br> `annotator/annotation-data/oncokb-cancer-gene-list.tsv`
 | [`molecular-subtyping-ATRT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-ATRT) | `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `pbta-snv-consensus-mutation-tmb-all.tsv`  <br>  `pbta-cnv-consensus-gistic.zip`| Summarizing data into tabular format in order to molecularly subtype ATRT samples [#244](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/244); this analysis did not work | N/A
 | [`molecular-subtyping-CRANIO`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-CRANIO) | `pbta-histologies-base.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Molecular subtyping of craniopharyngiomas samples [#810](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/810) | `results/CRANIO_molecular_subtype.tsv`
 | [`molecular-subtyping-EPN`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EPN) | `pbta-histologies-base.tsv` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-cnv-consensus-gistic.zip` <br> `analyses/chromosomal-instability/breakpoint-data/union_of_breaks_densities.tsv` <br> `analyses/fusion-summary/results/fusion_summary_ependymoma_foi.tsv` <br> `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` | *In progress*; molecular subtyping of ependymoma tumors | `results/EPN_all_data_withsubgroup.tsv`

@@ -0,0 +1,163 @@
+# This script adds gene and cancer_group annotations to an input long-format
+# table TSV file and outputs an annotated long-format table TSV file
+#
+# This script parses arguments and calls the annotate_long_format_table function
+# in the annotator/annotator-api.R file to add required annotation columns
+#
+# EXAMPLE USAGES:
+#
+# - Print help message:
+#
+# Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R \
+#   -h
+#
+# - Add RMTL, EFO, and MONDO columns
+#   - The `-r` option replaces NAs with empty strings for **ALL** columns of the
+#     input table
+#   - The `-v` option prints extra messages on progress
+#
+# Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R \
+#   -r -v -c RMTL,EFO,MONDO \
+#   -i long_n_tpm_mean_sd_quantile_group_gene_wise_zscore.tsv \
+#   -o long_n_tpm_mean_sd_quantile_group_gene_wise_zscore_annotated.tsv
+
+
+
+# source annotator-api.R to get the annotate_long_format_table function --------
+# Detect the ".git" folder -- this will be in the project root directory. Use
+# this as the root directory to ensure proper execution, no matter where it is
+# called from.
+#
+# This only works if the working directory is OpenPedCan-analysis or a
+# subdirectory of OpenPedCan-analysis
+#
+# root_dir is the absolute path of OpenPedCan-analysis
+#
+# Adapted from the oncoprint-landscape module
+#
+# rprojroot::has_file(".git/index") returns a rprojroot::root_criterion, and
+# main git working tree, created by git clone and git init, has the .git/index
+# file
+#
+# rprojroot::has_file(".git") returns a rprojroot::root_criterion, and linked
+# git working tree, created by git worktree add, has the .git file
+#
+# "Root criteria can be combined with the | operator. The result is a
+# composite root criterion that requires either of the original criteria to
+# match." -- help("root_criterion", "rprojroot") rprojroot_1.3-2
+tryCatch(
+  {
+    root_dir <- rprojroot::find_root(
+      rprojroot::has_file(".git/index") | rprojroot::has_file(".git"))
+  },
+  error = function(err_cond) {
+    # adapted from http://adv-r.had.co.nz/Exceptions-Debugging.html
+    err_cond$message <- paste0(
+      err_cond$message,
+      "\nTry re-running this script with working directory as ",
+      "OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis.\n")
+    stop(err_cond)
+  }
+)
+# Get the annotate_long_format_table function from annotator-api.R
+source(file.path(
+  root_dir, "analyses", "long-format-table-utils", "annotator",
+  "annotator-api.R"))
+
+
+
+# Parse arguments --------------------------------------------------------------
+option_list <- list(
+  optparse::make_option(
+    c("-r", "--replace-na-with-empty-string"), action = "store_true",
+    default = FALSE,
+    help = paste0(
+      "Replace NAs with empty strings for **ALL COLUMNS THAT HAVE NA** ",
+      "in the output table")),
+  optparse::make_option(
+    c("-c", "--columns-to-add"), type = "character",
+    default = paste0(
+      "RMTL,Gene_type,OncoKB_cancer_gene,OncoKB_oncogene_TSG,",
+      "Gene_full_name,Protein_RefSeq_ID,EFO,MONDO"),
+    help = paste0(
+      "A comma-separated list of unique annotation columns to be added, ",
+      "e.g. \"EFO,MONDO\" and \"RMTL,Gene_type,OncoKB_cancer_gene\". ",
+      "Available columns are: RMTL, Gene_type, OncoKB_cancer_gene, ",
+      "OncoKB_oncogene_TSG, Gene_full_name, Protein_RefSeq_ID, EFO, MONDO. ",
+      "[Default value is \"%default\", which is to add all available ",
+      "annotation columns]")),
+  optparse::make_option(
+    c("-i", "--input-long-format-table-tsv"), type = "character",
+    help = "Path to the input long-format table TSV file to be annotated"),
+  optparse::make_option(
+    c("-o", "--output-long-format-table-tsv"), type = "character",
+    help = "Path to output the annotated long-format table TSV file"),
+  optparse::make_option(
+    c("-v", "--verbose"), action = "store_true",
+    default = FALSE,
+    help = "Print extra messages on progress")
+)
+
+option_parser <- optparse::OptionParser(
+  option_list = option_list,
+  epilogue = paste0(
+    "**NOTE** on the --input-long-format-table-tsv file: 1) the TSV file ",
+    "should use double quotes for field values that need escape, e.g. \"NA\"",
+    " for string literal \"NA\" and \"\\t\" for tab; ",
+    "2) only unquoted NA field values are treated as missing values ",
+    "internally; 3) leading and trailing white spaces in field values are ",
+    "**NOT** trimmed before parsing."))
+
+parsed_opts <- optparse::parse_args(option_parser)
+
+
+
+# Read, annotate, and output ---------------------------------------------------
+columns_to_add <- stringr::str_split(parsed_opts$`columns-to-add`, ",")[[1]]
+if (identical(columns_to_add, "")) {
+  # If no annotation column to add, run the following code with columns_to_add =
+  # character(0)
+  columns_to_add <- character(0)
+}
+
+if (parsed_opts$verbose) {
+  message(paste0("Read ", parsed_opts$`input-long-format-table-tsv`, "..."))
+}
+
+# Non-default parameter values are used to preserve the TSV content
+#
+# - quote = "\"": same as default value, but specify it here to note that double
+#   quotes are used to quote values that need escapes, e.g. "NA", "\t", and "\n"
+# - quoted_na = FALSE: "NA" in TSV content will be treated as a string "NA"
+#   value in the returned tibble
+# - na = c("NA"): Only a plain NA in the TSV content will be treated as a
+#   missing value NA in the returned tibble
+# - trim_ws = FALSE: white-space characters are not trimmed before parsing, e.g.
+#   "\t" will be preserved in the returned tibble
+# - .default = readr::col_character(): read all columns as character, in order
+#   to avoid the types of columns being guessed as logical if too many NAs are
+#   at the beginning of the file
+input_df <- readr::read_tsv(
+  parsed_opts$`input-long-format-table-tsv`,
+  quote = "\"", quoted_na = FALSE, na = c("NA"), trim_ws = FALSE,
+  col_types = readr::cols(.default = readr::col_character()))
+
+if (parsed_opts$verbose) {
+  message(paste0("Annotate ", parsed_opts$`input-long-format-table-tsv`, "..."))
+}
+ann_df <- annotate_long_format_table(
+  long_format_table = input_df, columns_to_add = columns_to_add,
+  replace_na_with_empty_string = parsed_opts$`replace-na-with-empty-string`)
+
+if (parsed_opts$verbose) {
+  message(paste0("Output ", parsed_opts$`output-long-format-table-tsv`, "..."))
+}
+# quote_escape = "double": same as default value, but specify it here to note
+# that double quotes are used to quote values that need escapes, e.g. "NA",
+# "\t", and "\n"
+readr::write_tsv(
+  ann_df, parsed_opts$`output-long-format-table-tsv`, quote_escape = "double")
+
+if (parsed_opts$verbose) {
+  message("Done.")
+}
@@ -0,0 +1,198 @@
+# Download gene Ensembl ENSG ID to gene full name and protein RefSeq IDs mapping
+# file `annotation-data/ensg-gene-full-name-refseq-protein.tsv` from
+# https://mygene.info/
+
+# Import functions -------------------------------------------------------------
+# Get %>% without loading the whole library
+`%>%` <- dplyr::`%>%`
+
+
+
+# Define functions -------------------------------------------------------------
+# Collapse a list of refseq.protein character vectors from mygene.info query
+# results
+#
+# Args:
+# - rp_vec_list: list of refseq.protein character vectors from mygene.info query
+#   results
+#
+# Returns a single character value of a comma-separated refseq.protein value or
+# NA
+collapse_rp_lists <- function(rp_vec_list) {
+  # remove list elements that are NULL
+  rm_null_rp_vec_list <- purrr::discard(rp_vec_list, is.null)
+  # assert non-null list elements are all characters
+  lapply(rm_null_rp_vec_list, function(x) {
+    if (!is.character(x)) {
+      stop(paste0("mygene query returns non-character refseq.protein.\n",
+                  "Check query results. Revise download-annotation-data.R ",
+                  "to handle non-character values."))
+    }
+  })
+
+  # combined vector of unique refseq.protein values
+  uniq_c_rm_null_rp_vec <- sort(unique(
+    purrr::reduce(rm_null_rp_vec_list, c, .init = character(0))))
+  # remove NA
+  rm_na_uniq_c_rm_null_rp_vec <- purrr::discard(
+    uniq_c_rm_null_rp_vec, is.na)
+  # keep only NP_### refseq.protein values
+  np_rp_vec <- purrr::keep(
+    rm_na_uniq_c_rm_null_rp_vec, function(x) stringr::str_detect(x, "^NP_"))
+
+  clp_np_rp_str <- paste(np_rp_vec, collapse = ",")
+
+  if (identical(clp_np_rp_str, "")) {
+    # no NP_### value for the gene query
+    #
+    # dplyr::summarise needs function return value to be character for a
+    # character column. NA_character_ is still NA in character class rather than
+    # "NA".
+    return(NA_character_)
+  } else {
+    return(clp_np_rp_str)
+  }
+}
+
+
+
+# Collapse a name character vector from mygene.info query results
+#
+# Args:
+# - gn_vec: character vector of name values from mygene.info query results
+#
+# Returns a single character value of a comma-separated name value or NA
+collapse_name_vec <- function(gn_vec) {
+  # assert non-null list elements are all characters
+  if (!is.character(gn_vec)) {
+    stop(paste0("mygene query returns non-character refseq.protein.\n",
+                "Check query results. Revise download-annotation-data.R ",
+                "to handle non-character values."))
+  }
+  # remove NAs
+  uniq_rm_na_gn_vec <- sort(unique(purrr::discard(gn_vec, is.na)))
+
+  clp_gn_str <- paste(uniq_rm_na_gn_vec, collapse = ",")
+  if (identical(clp_gn_str, "")) {
+    # no non-NA name for the gene query
+    #
+    # dplyr::summarise needs function return value to be character for a
+    # character column. NA_character_ is still NA in character class rather than
+    # "NA".
+    return(NA_character_)
+  } else {
+    return(clp_gn_str)
+  }
+}
+
+
+
+# Set up directory paths -------------------------------------------------------
+# Detect the ".git" folder -- this will be in the project root directory. Use
+# this as the root directory to ensure proper execution, no matter where it is
+# called from.
+#
+# This only works if the working directory is OpenPedCan-analysis or a
+# subdirectory of OpenPedCan-analysis
+#
+# root_dir is the absolute path of OpenPedCan-analysis
+#
+# Adapted from the oncoprint-landscape module
+#
+# rprojroot::has_file(".git/index") returns a rprojroot::root_criterion, and
+# main git working tree, created by git clone and git init, has the .git/index
+# file
+#
+# rprojroot::has_file(".git") returns a rprojroot::root_criterion, and linked
+# git working tree, created by git worktree add, has the .git file
+#
+# "Root criteria can be combined with the | operator. The result is a
+# composite root criterion that requires either of the original criteria to
+# match." -- help("root_criterion", "rprojroot") rprojroot_1.3-2
+tryCatch(
+  {
+    root_dir <- rprojroot::find_root(
+      rprojroot::has_file(".git/index") | rprojroot::has_file(".git"))
+  },
+  error = function(err_cond) {
+    # adapted from http://adv-r.had.co.nz/Exceptions-Debugging.html
+    err_cond$message <- paste0(
+      err_cond$message,
+      "\nTry re-running this function with working directory as ",
+      "OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis.\n")
+    stop(err_cond)
+  }
+)
+
+input_data_dir <- file.path(root_dir, "data")
+
+output_data_dir <- file.path(
+  root_dir, "analyses", "long-format-table-utils", "annotator",
+  "annotation-data")
+
+if (!dir.exists(output_data_dir)) {
+  dir.create(output_data_dir)
+}
+
+
+
+# Read input data --------------------------------------------------------------
+# ensg hugo rmtl mappings
+ensg_hugo_rmtl_df <- dplyr::distinct(
+  readr::read_tsv(file.path(input_data_dir, "ensg-hugo-rmtl-v1-mapping.tsv"),
+                  col_types = readr::cols(.default = readr::col_guess())))
+
+# assert all ensg_ids and gene_symbols are not NA
+if (!identical(sum(is.na(ensg_hugo_rmtl_df$ensg_id)), as.integer(0))) {
+  stop(paste0("Found NA in ensg-hugo-rmtl-v1-mapping.tsv ensg_id.\n",
+              "Check if PedOT release data are downloaded properly.\n",
+              "If data is downloaded properly, submit a GitHub data issue."))
+}
+
+if (!identical(sum(is.na(ensg_hugo_rmtl_df$gene_symbol)), as.integer(0))) {
+  stop(paste0("Found NA in ensg-hugo-rmtl-v1-mapping.tsv gene_symbol.\n",
+              "Check if PedOT release data are downloaded properly.\n",
+              "If data is downloaded properly, submit a GitHub data issue."))
+}
+
+# assert all ensg_id are unique
+if (!identical(length(unique(ensg_hugo_rmtl_df$ensg_id)),
+               nrow(ensg_hugo_rmtl_df))) {
+  stop(paste0("Found duplicated ensg_id in ensg-hugo-rmtl-v1-mapping.tsv.\n",
+              "Check if PedOT release data are downloaded properly.\n",
+              "If data is downloaded properly, submit a GitHub data issue."))
+}
+
+# Download data from https://mygene.info/ --------------------------------------
+message("Retrieve Gene_full_name and Protein_RefSeq_ID from mygene.info...")
+ens_gids <- ensg_hugo_rmtl_df$ensg_id
+
+mg_qres_list <- mygene::queryMany(
+  ens_gids, scopes = "ensembl.gene", fields = c("refseq", "name"),
+  species = "human", returnall = TRUE, return.as = "DataFrame")
+
+found_mg_qres_df <- tibble::as_tibble(
+  mg_qres_list$response[, c("query", "notfound", "name", "refseq.protein")]) %>%
+  tidyr::replace_na(list(notfound = FALSE)) %>%
+  dplyr::filter(!notfound)
+
+# remove rows that have both NA name and NULL refseq.protein
+rm_bnn_found_mg_qres_df <- found_mg_qres_df %>%
+  dplyr::filter(!(is.na(name) & purrr::map_lgl(refseq.protein, is.null)))
+
+# collapses name and refseq.protein for output
+out_rm_bnn_found_mg_qres_df <- rm_bnn_found_mg_qres_df %>%
+  dplyr::group_by(query) %>%
+  dplyr::summarise(name = collapse_name_vec(name),
+                   refseq_protein = collapse_rp_lists(refseq.protein)) %>%
+  dplyr::rename(Gene_Ensembl_ID = query, Gene_full_name = name,
+                Protein_RefSeq_ID = refseq_protein)
+
+
+
+# Output TSV file --------------------------------------------------------------
+readr::write_tsv(
+  out_rm_bnn_found_mg_qres_df,
+  file.path(output_data_dir, "ensg-gene-full-name-refseq-protein.tsv"))
+
+message("Done running download-annotation-data.R")
@@ -0,0 +1,20 @@
+#!/bin/bash
+# PediatricOpenTargets 2021
+# Yuanchao Zhang
+set -e
+set -o pipefail
+
+# This script should always run as if it were being called from
+# the directory it lives in.
+# copied from the run_in_ci.sh file at
+# <https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/scripts/>
+script_directory="$(perl -e 'use File::Basename;
+ use Cwd "abs_path";
+ print dirname(abs_path(@ARGV[0]));' -- "$0")"
+cd "$script_directory" || exit
+
+mkdir -p 'annotation-data'
+
+Rscript --vanilla 'download-annotation-data.R'
+
+echo 'Done running run-download-annotation-data.sh'