Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Long-format table annotation Part 3] annotator R CLI #59

Merged
merged 101 commits into from
Jul 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
0b316f4
Add long-format-table-utils module
logstar Jul 15, 2021
75769b5
Add oncokb-cancer-gene-list.tsv
logstar Jul 15, 2021
b750ed5
Update README.md
logstar Jul 15, 2021
18ea05a
Download annotation data
logstar Jul 16, 2021
f40e997
Clean up ensg-gene-full-name-refseq-protein.tsv
logstar Jul 16, 2021
f941405
Add echo commands in shell scripts
logstar Jul 16, 2021
96e8885
Sort mygene returned character values
logstar Jul 16, 2021
80b794e
Update README.md
logstar Jul 16, 2021
9784aae
Update README.md
logstar Jul 16, 2021
0a80994
Add annotator API function signature
logstar Jul 16, 2021
3adf7de
Find OpenPedcan-analysis dir path in annotator API
logstar Jul 16, 2021
4e5cdde
Check input long_format_table is valid
logstar Jul 16, 2021
bfc8470
Add replace_na_with_empty_string parameter
logstar Jul 17, 2021
9817edf
Update README.md
logstar Jul 17, 2021
e402fb9
Merge branch 'lft-utils-ann-data-download' into lft-utils
logstar Jul 17, 2021
2202441
Read annotation data
logstar Jul 17, 2021
b1da9d0
Check no NA or duplicate in annotation key columns
logstar Jul 17, 2021
5686fc1
Rename ensg_hgsb_rmtl_df to ensg_rmtl_df
logstar Jul 17, 2021
ebda13b
Process annotation data for joining input table
logstar Jul 17, 2021
2c044ed
Require that input long_format_table is tibble
logstar Jul 17, 2021
9d0061f
Use double quotes for strings
logstar Jul 18, 2021
c6b4553
Annotate input long_format_table
logstar Jul 18, 2021
e168a0d
Change is_gene_level_table param to add_Gene_type
logstar Jul 18, 2021
1eb2da3
Add add_OncoKB_columns parameter
logstar Jul 18, 2021
d1a10ec
Change default to add all annotation columns
logstar Jul 18, 2021
92bd0a8
Add a note about the order of added columns
logstar Jul 19, 2021
51cbe37
Order the columns of the annotated table
logstar Jul 19, 2021
492a30d
Correct variable name
logstar Jul 19, 2021
4983aff
Correct replace_na call with tidyr:: prefix
logstar Jul 19, 2021
79bc8ed
Refactor annotation part
logstar Jul 19, 2021
1eec9bd
Refactor annotation data processing with %>%
logstar Jul 19, 2021
5a2811e
Assert columns_to_add are not in the input table
logstar Jul 19, 2021
fac1a84
Update README.md
logstar Jul 19, 2021
e64f82f
Update README.md
logstar Jul 19, 2021
66e8a7c
Add documentation coment on the required columns
logstar Jul 19, 2021
c735ca8
Update README.md
logstar Jul 19, 2021
c409e1f
Update README.md
logstar Jul 19, 2021
7e5a331
Update README.md
logstar Jul 19, 2021
c2fa1d3
Update error message
logstar Jul 20, 2021
eac83e8
Add annotator-cli.R
logstar Jul 20, 2021
6820050
Add CLI options
logstar Jul 20, 2021
7344502
Implement CLI read, annotate, and output procedure
logstar Jul 20, 2021
d49344f
Add CLI example usages
logstar Jul 20, 2021
2f4eeed
Update README.md
logstar Jul 20, 2021
9387a22
Update README.md
logstar Jul 20, 2021
5b76fb0
Merge branch 'lft-utils-ann-r-api' into lft-utils
logstar Jul 20, 2021
2a36e00
Update README.md
logstar Jul 20, 2021
721c043
Update README.md
logstar Jul 20, 2021
029b238
Update README.md
logstar Jul 20, 2021
909277d
Update error message in download-annotation-data.R
logstar Jul 20, 2021
80b6af7
Update error message in download-annotation-data.R
logstar Jul 20, 2021
a1399bc
Update error message in download-annotation-data.R
logstar Jul 20, 2021
4ef7685
Specify input TSV format
logstar Jul 20, 2021
16d23b4
Update README.md
logstar Jul 20, 2021
090f080
Rename update-long-format-table-utils.sh to run-update-long-format-ta…
logstar Jul 20, 2021
a183470
Specify annotation data versions in README.md
logstar Jul 21, 2021
5716523
Update README.md
logstar Jul 21, 2021
64fb672
Merge branch 'lft-utils-ann-data-download' into lft-utils-ann-r-api
logstar Jul 21, 2021
d61e21b
Merge branch 'lft-utils-ann-r-api' into lft-utils-ann-r-cli
logstar Jul 21, 2021
03516c9
Replace as.character(NA) with NA_character_
logstar Jul 21, 2021
6f76fae
Remove test cases in download-annotation-data.R
logstar Jul 21, 2021
48a131d
Add unit testing using the testthat package
logstar Jul 21, 2021
0773a1e
Fix typos in download-annotation-data.R
logstar Jul 21, 2021
a08cf5f
Update README.md
logstar Jul 21, 2021
796a26c
Update README.md
logstar Jul 21, 2021
6d6838c
Merge branch 'lft-utils-ann-data-download' into lft-utils-ann-r-api
logstar Jul 21, 2021
5029d4b
Merge branch 'lft-utils-ann-r-api' into lft-utils-ann-r-cli
logstar Jul 21, 2021
29325c7
Fix a typo in helper_import_function.R
logstar Jul 21, 2021
3fc592e
Merge branch 'lft-utils-ann-data-download' into lft-utils-ann-r-api
logstar Jul 21, 2021
76dea21
Merge branch 'lft-utils-ann-r-api' into lft-utils-ann-r-cli
logstar Jul 21, 2021
cb683a3
Add note on requiring both Gene_Ensembl_ID and Gene_symbol
logstar Jul 21, 2021
76b1322
Update README.md
logstar Jul 21, 2021
24ba5d2
Update README.md
logstar Jul 21, 2021
e5a653c
Update README.md
logstar Jul 21, 2021
fd99539
Change root criteria in rprojroot::find_root
logstar Jul 22, 2021
a85a97d
Remove git diff in run-download-annotation-data.sh
logstar Jul 22, 2021
2cdadf4
Remove outdated comments in annotator-api.R
logstar Jul 22, 2021
0f3d912
Add comments on how to add new annotation columns
logstar Jul 22, 2021
3292b7f
Refactor annotator-api.R using functional paradigm
logstar Jul 22, 2021
0cd9a8c
Add comments in test R files
logstar Jul 22, 2021
81a041e
Fix a bug that can add duplicated annotation cols
logstar Jul 22, 2021
7e38c19
Add tests to annotate_long_format_table
logstar Jul 22, 2021
53f27fd
Merge branch 'lft-utils-ann-r-api' into lft-utils-ann-r-cli
logstar Jul 22, 2021
e337abd
Fix package version compatibility issues for tests
logstar Jul 22, 2021
f9f6d17
Merge branch 'lft-utils-ann-r-api' into lft-utils-ann-r-cli
logstar Jul 22, 2021
eca90e5
Change root criteria in rprojroot::find_root
logstar Jul 22, 2021
ebcc6ac
Properly handle `-c ''` CLI calls
logstar Jul 22, 2021
52ff63e
Add tests for annotator-cli.R
logstar Jul 22, 2021
a725257
Update README.md
logstar Jul 22, 2021
9c4e47f
Remove input file after running certain CLI tests
logstar Jul 23, 2021
3e0ef2f
Test CLI calls with unavailable input/output paths
logstar Jul 23, 2021
2455609
Add comments for import_function usage
logstar Jul 23, 2021
365fcf9
Add tests for importing nested functions
logstar Jul 23, 2021
0224e3b
Add tests on importing functions defined multiple times
logstar Jul 23, 2021
ea90284
Test comments and line breaks for import_function
logstar Jul 23, 2021
be21860
Change test file context
logstar Jul 23, 2021
85aa52f
Put imported function in the importing environment
logstar Jul 23, 2021
2047329
Specify envir = parent.frame() in eval call
logstar Jul 24, 2021
0545a41
Only replace NA with empty string in columns that have NA
logstar Jul 25, 2021
f87a4c3
Test annotator API replace_na_with_empty_string
logstar Jul 25, 2021
dd72ce6
Test annotator CLI --replace-na-with-empty-string
logstar Jul 25, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions analyses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`immune-deconv`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/immune-deconv) | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | Immune/Stroma characterization across PBTA (part of [#15](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/15)) | `results/deconv-output.RData`
| [`independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | `pbta-histologies.tsv` | Generates independent specimen lists for WGS/WXS samples | `results/independent-specimens.wgs.primary.tsv` (included in data download) <br> `results/independent-specimens.wgs.primary-plus.tsv` (included in data download) <br> `results/independent-specimens.wgswxs.primary.tsv` (included in data download) <br> `results/independent-specimens.wgswxs.primary-plus.tsv` (included in data download)
| [`interaction-plots`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/interaction-plots) | `independent-specimens.wgs.primary-plus.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Creates interaction plots for mutation mutual exclusivity/co-occurrence [#13](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/13); may be updated to include other data types (e.g., fusions) | N/A
| [`long-format-table-utils`](https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/long-format-table-utils) | `data/ensg-hugo-rmtl-v1-mapping.tsv` <br> `analyses/fusion_filtering/references/genelistreference.txt` <br> `data/efo-mondo-map.tsv` | Functions and scripts for handling long-format tables | `annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv` <br> `annotator/annotation-data/oncokb-cancer-gene-list.tsv`
| [`molecular-subtyping-ATRT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-ATRT) | `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `pbta-snv-consensus-mutation-tmb-all.tsv` <br> `pbta-cnv-consensus-gistic.zip`| Summarizing data into tabular format in order to molecularly subtype ATRT samples [#244](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/244); this analysis did not work | N/A
| [`molecular-subtyping-CRANIO`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-CRANIO) | `pbta-histologies-base.tsv` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | Molecular subtyping of craniopharyngiomas samples [#810](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/810) | `results/CRANIO_molecular_subtype.tsv`
| [`molecular-subtyping-EPN`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EPN) | `pbta-histologies-base.tsv` <br> `analyses/collapse-rnaseq/results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-cnv-consensus-gistic.zip` <br> `analyses/chromosomal-instability/breakpoint-data/union_of_breaks_densities.tsv` <br> `analyses/fusion-summary/results/fusion_summary_ependymoma_foi.tsv` <br> `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv` | *In progress*; molecular subtyping of ependymoma tumors | `results/EPN_all_data_withsubgroup.tsv`
Expand Down
238 changes: 238 additions & 0 deletions analyses/long-format-table-utils/README.md

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

435 changes: 435 additions & 0 deletions analyses/long-format-table-utils/annotator/annotator-api.R

Large diffs are not rendered by default.

163 changes: 163 additions & 0 deletions analyses/long-format-table-utils/annotator/annotator-cli.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# This script adds gene and cancer_group annotations to an input long-format
# table TSV file and outputs an annotated long-format table TSV file
#
# This script parses arguments and calls the annotate_long_format_table function
# in the annotator/annotator-api.R file to add required annotation columns
#
# EXAMPLE USAGES:
#
# - Print help message:
#
# Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R \
# -h
#
# - Add RMTL, EFO, and MONDO columns
# - The `-r` option replaces NAs with empty strings for **ALL** columns of the
# input table
# - The `-v` option prints extra messages on progress
#
# Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R \
# -r -v -c RMTL,EFO,MONDO \
# -i long_n_tpm_mean_sd_quantile_group_gene_wise_zscore.tsv \
# -o long_n_tpm_mean_sd_quantile_group_gene_wise_zscore_annotated.tsv



# source annotator-api.R to get the annotate_long_format_table function --------
# Detect the ".git" folder -- this will be in the project root directory. Use
# this as the root directory to ensure proper execution, no matter where it is
# called from.
#
# This only works if the working directory is OpenPedCan-analysis or a
# subdirectory of OpenPedCan-analysis
#
# root_dir is the absolute path of OpenPedCan-analysis
#
# Adapted from the oncoprint-landscape module
#
# rprojroot::has_file(".git/index") returns a rprojroot::root_criterion, and
# main git working tree, created by git clone and git init, has the .git/index
# file
#
# rprojroot::has_file(".git") returns a rprojroot::root_criterion, and linked
# git working tree, created by git worktree add, has the .git file
#
# "Root criteria can be combined with the | operator. The result is a
# composite root criterion that requires either of the original criteria to
# match." -- help("root_criterion", "rprojroot") rprojroot_1.3-2
tryCatch(
{
root_dir <- rprojroot::find_root(
rprojroot::has_file(".git/index") | rprojroot::has_file(".git"))
},
error = function(err_cond) {
# adapted from http://adv-r.had.co.nz/Exceptions-Debugging.html
err_cond$message <- paste0(
err_cond$message,
"\nTry re-running this script with working directory as ",
"OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis.\n")
stop(err_cond)
}
)
# Get the annotate_long_format_table function from annotator-api.R
source(file.path(
root_dir, "analyses", "long-format-table-utils", "annotator",
"annotator-api.R"))



# Parse arguments --------------------------------------------------------------
option_list <- list(
optparse::make_option(
c("-r", "--replace-na-with-empty-string"), action = "store_true",
default = FALSE,
help = paste0(
"Replace NAs with empty strings for **ALL COLUMNS THAT HAVE NA** ",
"in the output table")),
optparse::make_option(
c("-c", "--columns-to-add"), type = "character",
default = paste0(
"RMTL,Gene_type,OncoKB_cancer_gene,OncoKB_oncogene_TSG,",
"Gene_full_name,Protein_RefSeq_ID,EFO,MONDO"),
help = paste0(
"A comma-separated list of unique annotation columns to be added, ",
"e.g. \"EFO,MONDO\" and \"RMTL,Gene_type,OncoKB_cancer_gene\". ",
"Available columns are: RMTL, Gene_type, OncoKB_cancer_gene, ",
"OncoKB_oncogene_TSG, Gene_full_name, Protein_RefSeq_ID, EFO, MONDO. ",
"[Default value is \"%default\", which is to add all available ",
"annotation columns]")),
optparse::make_option(
c("-i", "--input-long-format-table-tsv"), type = "character",
help = "Path to the input long-format table TSV file to be annotated"),
optparse::make_option(
c("-o", "--output-long-format-table-tsv"), type = "character",
help = "Path to output the annotated long-format table TSV file"),
optparse::make_option(
c("-v", "--verbose"), action = "store_true",
default = FALSE,
help = "Print extra messages on progress")
)

option_parser <- optparse::OptionParser(
option_list = option_list,
epilogue = paste0(
"**NOTE** on the --input-long-format-table-tsv file: 1) the TSV file ",
"should use double quotes for field values that need escape, e.g. \"NA\"",
" for string literal \"NA\" and \"\\t\" for tab; ",
"2) only unquoted NA field values are treated as missing values ",
"internally; 3) leading and trailing white spaces in field values are ",
"**NOT** trimmed before parsing."))

parsed_opts <- optparse::parse_args(option_parser)



# Read, annotate, and output ---------------------------------------------------
columns_to_add <- stringr::str_split(parsed_opts$`columns-to-add`, ",")[[1]]
if (identical(columns_to_add, "")) {
# If no annotation column to add, run the following code with columns_to_add =
# character(0)
columns_to_add <- character(0)
}

if (parsed_opts$verbose) {
message(paste0("Read ", parsed_opts$`input-long-format-table-tsv`, "..."))
}

# Non-default parameter values are used to preserve the TSV content
#
# - quote = "\"": same as default value, but specify it here to note that double
# quotes are used to quote values that need escapes, e.g. "NA", "\t", and "\n"
# - quoted_na = FALSE: "NA" in TSV content will be treated as a string "NA"
# value in the returned tibble
# - na = c("NA"): Only a plain NA in the TSV content will be treated as a
# missing value NA in the returned tibble
# - trim_ws = FALSE: white-space characters are not trimmed before parsing, e.g.
# "\t" will be preserved in the returned tibble
# - .default = readr::col_character(): read all columns as character, in order
# to avoid the types of columns being guessed as logical if too many NAs are
# at the beginning of the file
input_df <- readr::read_tsv(
parsed_opts$`input-long-format-table-tsv`,
quote = "\"", quoted_na = FALSE, na = c("NA"), trim_ws = FALSE,
col_types = readr::cols(.default = readr::col_character()))

if (parsed_opts$verbose) {
message(paste0("Annotate ", parsed_opts$`input-long-format-table-tsv`, "..."))
}
ann_df <- annotate_long_format_table(
long_format_table = input_df, columns_to_add = columns_to_add,
replace_na_with_empty_string = parsed_opts$`replace-na-with-empty-string`)

if (parsed_opts$verbose) {
message(paste0("Output ", parsed_opts$`output-long-format-table-tsv`, "..."))
}
# quote_escape = "double": same as default value, but specify it here to note
# that double quotes are used to quote values that need escapes, e.g. "NA",
# "\t", and "\n"
readr::write_tsv(
ann_df, parsed_opts$`output-long-format-table-tsv`, quote_escape = "double")

if (parsed_opts$verbose) {
message("Done.")
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Download gene Ensembl ENSG ID to gene full name and protein RefSeq IDs mapping
# file `annotation-data/ensg-gene-full-name-refseq-protein.tsv` from
# https://mygene.info/

# Import functions -------------------------------------------------------------
# Get %>% without loading the whole library
`%>%` <- dplyr::`%>%`



# Define functions -------------------------------------------------------------
# Collapse a list of refseq.protein character vectors from mygene.info query
# results
#
# Args:
# - rp_vec_list: list of refseq.protein character vectors from mygene.info query
# results
#
# Returns a single character value of a comma-separated refseq.protein value or
# NA
collapse_rp_lists <- function(rp_vec_list) {
# remove list elements that are NULL
rm_null_rp_vec_list <- purrr::discard(rp_vec_list, is.null)
# assert non-null list elements are all characters
lapply(rm_null_rp_vec_list, function(x) {
if (!is.character(x)) {
stop(paste0("mygene query returns non-character refseq.protein.\n",
"Check query results. Revise download-annotation-data.R ",
"to handle non-character values."))
}
})

# combined vector of unique refseq.protein values
uniq_c_rm_null_rp_vec <- sort(unique(
purrr::reduce(rm_null_rp_vec_list, c, .init = character(0))))
# remove NA
rm_na_uniq_c_rm_null_rp_vec <- purrr::discard(
uniq_c_rm_null_rp_vec, is.na)
# keep only NP_### refseq.protein values
np_rp_vec <- purrr::keep(
rm_na_uniq_c_rm_null_rp_vec, function(x) stringr::str_detect(x, "^NP_"))

clp_np_rp_str <- paste(np_rp_vec, collapse = ",")

if (identical(clp_np_rp_str, "")) {
# no NP_### value for the gene query
#
# dplyr::summarise needs function return value to be character for a
# character column. NA_character_ is still NA in character class rather than
# "NA".
return(NA_character_)
} else {
return(clp_np_rp_str)
}
}



# Collapse a name character vector from mygene.info query results
#
# Args:
# - gn_vec: character vector of name values from mygene.info query results
#
# Returns a single character value of a comma-separated name value or NA
collapse_name_vec <- function(gn_vec) {
# assert non-null list elements are all characters
if (!is.character(gn_vec)) {
stop(paste0("mygene query returns non-character refseq.protein.\n",
"Check query results. Revise download-annotation-data.R ",
"to handle non-character values."))
}
# remove NAs
uniq_rm_na_gn_vec <- sort(unique(purrr::discard(gn_vec, is.na)))

clp_gn_str <- paste(uniq_rm_na_gn_vec, collapse = ",")
if (identical(clp_gn_str, "")) {
# no non-NA name for the gene query
#
# dplyr::summarise needs function return value to be character for a
# character column. NA_character_ is still NA in character class rather than
# "NA".
return(NA_character_)
} else {
return(clp_gn_str)
}
}



# Set up directory paths -------------------------------------------------------
# Detect the ".git" folder -- this will be in the project root directory. Use
# this as the root directory to ensure proper execution, no matter where it is
# called from.
#
# This only works if the working directory is OpenPedCan-analysis or a
# subdirectory of OpenPedCan-analysis
#
# root_dir is the absolute path of OpenPedCan-analysis
#
# Adapted from the oncoprint-landscape module
#
# rprojroot::has_file(".git/index") returns a rprojroot::root_criterion, and
# main git working tree, created by git clone and git init, has the .git/index
# file
#
# rprojroot::has_file(".git") returns a rprojroot::root_criterion, and linked
# git working tree, created by git worktree add, has the .git file
#
# "Root criteria can be combined with the | operator. The result is a
# composite root criterion that requires either of the original criteria to
# match." -- help("root_criterion", "rprojroot") rprojroot_1.3-2
tryCatch(
{
root_dir <- rprojroot::find_root(
rprojroot::has_file(".git/index") | rprojroot::has_file(".git"))
},
error = function(err_cond) {
# adapted from http://adv-r.had.co.nz/Exceptions-Debugging.html
err_cond$message <- paste0(
err_cond$message,
"\nTry re-running this function with working directory as ",
"OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis.\n")
stop(err_cond)
}
)

input_data_dir <- file.path(root_dir, "data")

output_data_dir <- file.path(
root_dir, "analyses", "long-format-table-utils", "annotator",
"annotation-data")

if (!dir.exists(output_data_dir)) {
dir.create(output_data_dir)
}



# Read input data --------------------------------------------------------------
# ensg hugo rmtl mappings
ensg_hugo_rmtl_df <- dplyr::distinct(
readr::read_tsv(file.path(input_data_dir, "ensg-hugo-rmtl-v1-mapping.tsv"),
col_types = readr::cols(.default = readr::col_guess())))

# assert all ensg_ids and gene_symbols are not NA
if (!identical(sum(is.na(ensg_hugo_rmtl_df$ensg_id)), as.integer(0))) {
stop(paste0("Found NA in ensg-hugo-rmtl-v1-mapping.tsv ensg_id.\n",
"Check if PedOT release data are downloaded properly.\n",
"If data is downloaded properly, submit a GitHub data issue."))
}

if (!identical(sum(is.na(ensg_hugo_rmtl_df$gene_symbol)), as.integer(0))) {
stop(paste0("Found NA in ensg-hugo-rmtl-v1-mapping.tsv gene_symbol.\n",
"Check if PedOT release data are downloaded properly.\n",
"If data is downloaded properly, submit a GitHub data issue."))
}

# assert all ensg_id are unique
if (!identical(length(unique(ensg_hugo_rmtl_df$ensg_id)),
nrow(ensg_hugo_rmtl_df))) {
stop(paste0("Found duplicated ensg_id in ensg-hugo-rmtl-v1-mapping.tsv.\n",
"Check if PedOT release data are downloaded properly.\n",
"If data is downloaded properly, submit a GitHub data issue."))
}

# Download data from https://mygene.info/ --------------------------------------
message("Retrieve Gene_full_name and Protein_RefSeq_ID from mygene.info...")
ens_gids <- ensg_hugo_rmtl_df$ensg_id

mg_qres_list <- mygene::queryMany(
ens_gids, scopes = "ensembl.gene", fields = c("refseq", "name"),
species = "human", returnall = TRUE, return.as = "DataFrame")

found_mg_qres_df <- tibble::as_tibble(
mg_qres_list$response[, c("query", "notfound", "name", "refseq.protein")]) %>%
tidyr::replace_na(list(notfound = FALSE)) %>%
dplyr::filter(!notfound)

# remove rows that have both NA name and NULL refseq.protein
rm_bnn_found_mg_qres_df <- found_mg_qres_df %>%
dplyr::filter(!(is.na(name) & purrr::map_lgl(refseq.protein, is.null)))

# collapses name and refseq.protein for output
out_rm_bnn_found_mg_qres_df <- rm_bnn_found_mg_qres_df %>%
dplyr::group_by(query) %>%
dplyr::summarise(name = collapse_name_vec(name),
refseq_protein = collapse_rp_lists(refseq.protein)) %>%
dplyr::rename(Gene_Ensembl_ID = query, Gene_full_name = name,
Protein_RefSeq_ID = refseq_protein)



# Output TSV file --------------------------------------------------------------
readr::write_tsv(
out_rm_bnn_found_mg_qres_df,
file.path(output_data_dir, "ensg-gene-full-name-refseq-protein.tsv"))

message("Done running download-annotation-data.R")
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash
# PediatricOpenTargets 2021
# Yuanchao Zhang
set -e
set -o pipefail

# This script should always run as if it were being called from
# the directory it lives in.
# copied from the run_in_ci.sh file at
# <https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/scripts/>
script_directory="$(perl -e 'use File::Basename;
use Cwd "abs_path";
print dirname(abs_path(@ARGV[0]));' -- "$0")"
cd "$script_directory" || exit

mkdir -p 'annotation-data'

Rscript --vanilla 'download-annotation-data.R'

echo 'Done running run-download-annotation-data.sh'
Loading