[Long-format table annotation Part 3] annotator R CLI #59

logstar · 2021-07-20T16:31:19Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R source('path/to/the/function/file.R') or Python import os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with either Rscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv or python3 path/to/the/script.py arg long.tsv long_edited.tsv.

This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.

Sub-module name	Implemented function	Available interface(s)
`annotator`	Add gene and `cancer_group` annotations	R API, R CLI

This pull request is the third part of the long-format-table-utils module, which implements the annotator R CLI.

What was your approach?

The long-format-table-utils/annotator/annotator-cli.R file provides an R CLI for using the API to annotate long-format tables.

Check input long-format tables have the following columns:

Column name	Description
`Gene_symbol`	HUGO symbols, e.g. PHLPP1, TM6SF1, and DNAH5.
`Gene_Ensembl_ID`	Ensembl ENSG IDs without `.#` versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710
`Disease`	The `cancer_group` in the `histologies.tsv`, e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytoma

Add one or more of the following gene annotations, by specifying the columns_to_add parameter in the annotate_long_format_table function, or specifying the -c option when running annotator-cli.R.

Annotation column name	Non-missing value	Source data
`RMTL`	`Relevant Molecular Target (RMTL version 1)` or `Non-Relevant Molecular Target (RMTL version 1)`	`data/ensg-hugo-rmtl-v1-mapping.tsv`
`Gene_type`	A sorted comma separated list of one or more of the following gene types: `CosmicCensus`, `Kinase`, `Oncogene`, `TranscriptionFactor`, and `TumorSuppressorGene`. Example values: `CosmicCensus`, `CosmicCensus,Kinase`, and `CosmicCensus,Kinase,TumorSuppressorGene`.	`analyses/fusion_filtering/references/genelistreference.txt`
`OncoKB_cancer_gene`	`Y` or `N`	`analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv`
`OncoKB_oncogene_TSG`	`Oncogene`, or `TumorSuppressorGene`, or `Oncogene,TumorSuppressorGene`	`analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv`
`Gene_full_name`	A single string of gene full name, e.g. `cytochrome c oxidase subunit III` and `ATP synthase F0 subunit 6`	`analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv`
`Protein_RefSeq_ID`	A sorted comma separated list of one or more protein RefSeq IDs, e.g. `NP_004065.1`, `NP_000053.2,NP_001027466.1`, and `NP_000985.1,NP_001007074.1,NP_001007075.1`	`analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv`

The RMTL information is obtained from PediatricOpenTargets/OpenPedCan-analysis data release.

Note: only add Gene_type to gene-level tables, which can be implemented by leaving "Gene_type" out of the columns_to_add parameter of the annotate_long_format_table function in annotator-api.R. The Gene_type information is obtained from ../fusion_filtering/references/genelistreference.txt, and its sources are described at https://github.com/d3b-center/annoFuse#prerequisites-for-cohort-level-analysis.

The OncoKB_cancer_gene and OncoKB_oncogene_TSG information is listed in annotator/annotation-data/oncokb-cancer-gene-list.tsv, which is downloaded from https://www.oncokb.org/cancerGenes. The last update of the table is on 06/16/2021. To update the table, re-download the updated table from https://www.oncokb.org/cancerGenes.

The Gene_full_name and Protein_RefSeq_ID information is downloaded from https://mygene.info/ using the mygene package.

Add the following disease(/cancer_group) annotations:

Add one or more of the following disease(/cancer_group) annotations, by specifying the columns_to_add parameter in the annotate_long_format_table function, or specifying the -c option when running annotator-cli.R.

Annotation column name	Non-missing value	Source data
`EFO`	A single string of EFO code, e.g. `EFO_1000069`, `EFO_1002008`, and `EFO_1000177`	`data/efo-mondo-map.tsv`
`MONDO`	A single string of MONDO code, e.g. `MONDO_0002787`, `MONDO_0020560`, and `MONDO_0009837`	`data/efo-mondo-map.tsv`

The EFO and MONDO information is obtained from PediatricOpenTargets/OpenPedCan-analysis data release.

Use the long-format table annotator CLI in an analysis module with the following steps:

If "Gene_symbol", "Gene_Ensembl_ID", "Disease" are not all present in the column names of the table to be annotated, add new columns or rename existing ones to have all these required columns.
Output the table that needs to be annotated in TSV format.
Change the working directory to be OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis. This allows the annotator-cli.R to locate the annotator-api.R.
Run the annotator-cli.R script with Rscript --vanilla path/to/annotator-cli.R and proper options. The Rscript command can be invoked by R system("Rscript --vanilla path/to/annotator-cli.R -h") (if the annotator R API is not preferred) or Python (>= 3.5) import subprocess; subprocess.run("Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R -h".split()). For more information about R system, https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html. For more information about Python (>= 3.5) subprocess.run, https://docs.python.org/3/library/subprocess.html#subprocess.run.
Read the annotated table TSV file.
Rename, select, and reorder the columns of the annotated table for output in TSV, or JSON, or JSONL formats.

Following is an example usage in the rna-seq-expression-summary-stats module 01-tpm-summary-stats.R.

> getwd()
[1] "/home/rstudio/OpenPedCan-analysis/analyses/rna-seq-expression-summary-stats"
> class(m_tpm_ss_long_tbl)
[1] "tbl_df"     "tbl"        "data.frame"
> colnames(m_tpm_ss_long_tbl)
 [1] "gene_symbol"                          "gene_id"                             
 [3] "cancer_group"                         "cohort"                              
 [5] "tpm_mean"                             "tpm_sd"                              
 [7] "tpm_mean_cancer_group_wise_zscore"    "tpm_mean_gene_wise_zscore"           
 [9] "tpm_mean_cancer_group_wise_quantiles" "n_samples"                           
> 
> renamed_m_tpm_ss_long_tbl <- dplyr::rename(
+   m_tpm_ss_long_tbl, Gene_symbol = gene_symbol, Gene_Ensembl_ID = gene_id,
+   Disease = cancer_group)
> 
> readr::write_tsv(
+   renamed_m_tpm_ss_long_tbl,
+   "../../scratch/renamed_m_tpm_ss_long_tbl.tsv")
> 
> system(paste(
+   "Rscript --vanilla ../long-format-table-utils/annotator/annotator-cli.R",
+   "-r -v -c MONDO,RMTL,EFO",
+   "-i ../../scratch/renamed_m_tpm_ss_long_tbl.tsv",
+   "-o ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv"))
Read ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Annotate ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Output ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv...
Done.
> 
> annotated_renamed_m_tpm_ss_long_tbl <- readr::read_tsv(
+   "../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv",
+   na = character(),
+   col_types = readr::cols(.default = readr::col_guess()))
|==================================================================================================| 100%  222 MB
> m_tpm_ss_long_tbl <- dplyr::rename(
+   annotated_renamed_m_tpm_ss_long_tbl,
+   gene_symbol = Gene_symbol, gene_id = Gene_Ensembl_ID,
+   cancer_group = Disease)
> m_tpm_ss_long_tbl <- dplyr::select(
+   m_tpm_ss_long_tbl, gene_symbol, RMTL, gene_id,
+   cancer_group, EFO, MONDO, n_samples, cohort,
+   tpm_mean, tpm_sd,
+   tpm_mean_cancer_group_wise_zscore, tpm_mean_gene_wise_zscore,
+   tpm_mean_cancer_group_wise_quantiles)

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#112

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Whether the code in annotator/annotator-cli.R works as expected?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

No result.

What is your summary of the results?

No result.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[NA. CI is not available yet.] This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

Download `Gene_full_name` and `Protein_RefSeq_ID` from https://mygene.info/ using the mygene package.

Remove rows that have both Gene_full_name and Protein_RefSeq_ID values missing. Write NA for missing values rather than "NA" or "", in order to be consistent with the data release.

Print messages after done running shell scripts.

mygene API may return results in different orders, so sorting the values before output is necessary to reproduce previous results.

Add long-format-table-utils module.

Describe how to update annotator/annotation-data/oncokb-cancer-gene-list.tsv.

Update OncoKB annotation table source to analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv

Update README.md.

For an annotation table, NA or duplicate should not exist in the key column that is used to join_by for adding annotations to input table.

Gene symbol is not added in this function, and gene symbol is required. It is also not necessary to check the gene_symbol column.

The annotate_long_format_table function now only supports tibble, because handling data.frame or other types of table is complex, such as preserving rownames, orders, and column types. Support for other types of table could be added if required at a later point.

add_Gene_type is more informative.

In annotate_long_format_table(), add a add_OncoKB_columns parameter. add_OncoKB_columns: TRUE or FALSE on whether to add OncoKB_cancer_gene and OncoKB_oncogene_TSG columns. Default value is FALSE.

The interface of annotate_long_format_table is changed, so that all annotation columns are added by default, and the columns_to_add parameter can be passed to add only certain columns. The current interface annotate_long_format_table is designed to favor flexibility and readability over efficiency. The funciton can be further refactored to be more efficient, e.g. not check certain tables, process certain tables, or add certain columns if they are not required.

The order of added columns may not be the same as the values of columns_to_add, in annotate_long_format_table.

In annotate_long_format_table, reorder the columns of the returned table to have the same column order as the input table and added columns in the order of columns_to_add parameter.

Renamed to ensg_rmtl_df.

Only left_join an annotation table if it is required.

@NHJohnson

As found by @NHJohnson at <d3b-center#56 (comment)>, some tests failed out side of the Docker image. See comments in annotator/tests/test_annotate_long_format_table.R for more details about the package version compatibility issues.

Add rprojroot::has_file(".git") as an alternative root criterion to handle linked git working trees created by `git worktree add` in rprojroot::find_root.

Previously, `-c ''` calls will fail because `""` is passed to API calls as required columns to add, which is unavailable. Now, `-c ''` calls will pass `character(0)` to API calls, so the output table will not have any additional annotation column.

logstar · 2021-07-22T21:55:53Z

@jharenza @NHJohnson I have made a couple of changes to this PR.

Handled Rscript --vanilla ../annotator-cli.R -c '' -i input/path.tsv -o output/path.tsv properly. See commit ebcc6ac for more details.
Enabled project root finding in git linked working trees. See commit eca90e5 for more details.
Merged changes from the part 2 PR [Long-format table annotation Part 2] annotator R API #56.
Added tests for annotator-cli.R. See commit 52ff63e for more details. All tests passed in the Docker image/container and a recently created R environment.

Add a note on the naming conventions of test files in the tsetthat package 2.1.1.

Remove input files after running CLI tests, if the input files are created by tests as intermediate files, in order to ensure a clean start for the next test to run.

Should fail on annotator CLI calls with unavailable input files or output dirs.

Note that nested functions cannot be imported, and unnamed functions cannot be imported.

Should fail on importing functions that are nestedly defined in other expressions.

Should fail, even if the function is defined in different ways.

Comments and line breaks should not change the behaviors of import_function.

Corrected test helpfer function context. Added test/ prefix to test annotator/tests/test_annotator_cli.R.

Put the imported function in the same environment as the import_function being called, so the imported functions can call other imported functions. Also add a test for such use cases.

In import_function, envir = parent.frame() makes the environment(imported_function) to be the same as the import function being called. Even though the eval documentation says the default envir parameter is parent.frame(), leaving envir = parent.frame() in the eval call will surprisingly make the environment(imported_function) to be the same as the environment of the import_function call. The tests also fail without specifying envir = parent.frame(). Maybe this is caused by the place where parent.frame() is evaluated? If specified, parent.frame() is evaluated in the calling function; if not specified, parent.frame() is evaluated in the eval call environment? Examples executed in terminal R, in order to avoid RStudio customizations. > print(environment()) <environment: R_GlobalEnv> > > foo <- function() { + print(parent.frame()) + print(environment()) + return(eval(quote(function() { return(2) }))) + } > > bar <- foo() <environment: R_GlobalEnv> <environment: 0x5645ead67670> > print(environment(bar)) <environment: 0x5645ead67670> > > baz <- function() { + print(parent.frame()) + print(environment()) + return(eval(quote(function() { return(2) }), + envir = parent.frame())) + } > > qux <- baz() <environment: R_GlobalEnv> <environment: 0x5645ead5fec0> > print(environment(qux)) <environment: R_GlobalEnv>

Previously, if replace_na_with_empty_string=TRUE in annotate_long_format_table, replace_na with empty string is applied to all columns, including columns that have no NA. If a non-character column that has no NA, its data type will be comverted to character, which may break backward compatibility of the JSON/JSONL tables that were generated without using the annotator API. Now, if replace_na_with_empty_string=TRUE in annotate_long_format_table, only replace NA with empty string in columns that have NA in them. This will not change the value types of the columns that have no NA, so the output JSON/JSONL table will be backward compatible with the ones that were generated without using the annotator API. CLI help message is also changed accordingly.

Test that replace_na_with_empty_string = TRUE replaces NA with "". Test that non-character columns without NA are not converted to character columns. Test that replace_na_with_empty_string = FALSE does not replace NA with "".

Test that --replace-na-with-empty-string replaces NA with "". Test that no --replace-na-with-empty-string does not replace NA with "". The --replace-na-with-empty-string does not need to be tested for column type conversions. The CLI write_tsv output table is the same even if a non-character column is converted to a character column before write_tsv, because "Values are only quoted if they contain a comma, quote or newline" (-- help("write_tsv", "readr") 1.3.1).

logstar · 2021-07-25T23:06:57Z

@jharenza @NHJohnson I have made a couple of changes to this PR.

Only replace NA with empty string for columns that have NA in them, if -r/--replace-na-with-empty-string is set in CLI or replace_na_with_empty_string = TRUE in API. See commit 0545a41 for more details.
Added tests for the API replace_na_with_empty_string parameter in commit f87a4c3.
Added tests for the CLI -r/--replace-na-with-empty-string option in commit dd72ce6.

All tests passed in the Docker image/container and a recently created R environment.

NHJohnson

Thanks for making these improvements! All tests pass in Docker environment.

logstar · 2021-07-26T15:01:42Z

@NHJohnson Thank you for reviewing and test-running this PR!

logstar added 30 commits July 15, 2021 16:52

Add long-format-table-utils module

0b316f4

Add oncokb-cancer-gene-list.tsv

75769b5

Update README.md

b750ed5

Download annotation data

18ea05a

Download `Gene_full_name` and `Protein_RefSeq_ID` from https://mygene.info/ using the mygene package.

Clean up ensg-gene-full-name-refseq-protein.tsv

f40e997

Remove rows that have both Gene_full_name and Protein_RefSeq_ID values missing. Write NA for missing values rather than "NA" or "", in order to be consistent with the data release.

Add echo commands in shell scripts

f941405

Print messages after done running shell scripts.

Sort mygene returned character values

96e8885

mygene API may return results in different orders, so sorting the values before output is necessary to reproduce previous results.

Update README.md

80b794e

Add long-format-table-utils module.

Update README.md

9784aae

Describe how to update annotator/annotation-data/oncokb-cancer-gene-list.tsv.

Add annotator API function signature

0a80994

Find OpenPedcan-analysis dir path in annotator API

3adf7de

Check input long_format_table is valid

4e5cdde

Add replace_na_with_empty_string parameter

bfc8470

Update README.md

9817edf

Update OncoKB annotation table source to analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv

Merge branch 'lft-utils-ann-data-download' into lft-utils

e402fb9

Update README.md.

Read annotation data

2202441

Check no NA or duplicate in annotation key columns

b1da9d0

For an annotation table, NA or duplicate should not exist in the key column that is used to join_by for adding annotations to input table.

Rename ensg_hgsb_rmtl_df to ensg_rmtl_df

5686fc1

Gene symbol is not added in this function, and gene symbol is required. It is also not necessary to check the gene_symbol column.

Process annotation data for joining input table

ebda13b

Use double quotes for strings

9d0061f

Annotate input long_format_table

c6b4553

Change is_gene_level_table param to add_Gene_type

e168a0d

add_Gene_type is more informative.

Add add_OncoKB_columns parameter

1eb2da3

In annotate_long_format_table(), add a add_OncoKB_columns parameter. add_OncoKB_columns: TRUE or FALSE on whether to add OncoKB_cancer_gene and OncoKB_oncogene_TSG columns. Default value is FALSE.

Add a note about the order of added columns

92bd0a8

The order of added columns may not be the same as the values of columns_to_add, in annotate_long_format_table.

Order the columns of the annotated table

51cbe37

In annotate_long_format_table, reorder the columns of the returned table to have the same column order as the input table and added columns in the order of columns_to_add parameter.

Correct variable name

492a30d

Renamed to ensg_rmtl_df.

Correct replace_na call with tidyr:: prefix

4983aff

Refactor annotation part

79bc8ed

Only left_join an annotation table if it is required.

logstar added 6 commits July 22, 2021 15:09

Merge branch 'lft-utils-ann-r-api' into lft-utils-ann-r-cli

53f27fd

Fix package version compatibility issues for tests

e337abd

As found by @NHJohnson at <d3b-center#56 (comment)>, some tests failed out side of the Docker image. See comments in annotator/tests/test_annotate_long_format_table.R for more details about the package version compatibility issues.

Merge branch 'lft-utils-ann-r-api' into lft-utils-ann-r-cli

f9f6d17

Change root criteria in rprojroot::find_root

eca90e5

Add rprojroot::has_file(".git") as an alternative root criterion to handle linked git working trees created by `git worktree add` in rprojroot::find_root.

Properly handle -c '' CLI calls

ebcc6ac

Previously, `-c ''` calls will fail because `""` is passed to API calls as required columns to add, which is unavailable. Now, `-c ''` calls will pass `character(0)` to API calls, so the output table will not have any additional annotation column.

Add tests for annotator-cli.R

52ff63e

logstar removed the work in progress label Jul 22, 2021

logstar requested review from NHJohnson and jharenza July 22, 2021 21:56

logstar added 10 commits July 22, 2021 19:52

Update README.md

a725257

Add a note on the naming conventions of test files in the tsetthat package 2.1.1.

Remove input file after running certain CLI tests

9c4e47f

Remove input files after running CLI tests, if the input files are created by tests as intermediate files, in order to ensure a clean start for the next test to run.

Test CLI calls with unavailable input/output paths

3e0ef2f

Should fail on annotator CLI calls with unavailable input files or output dirs.

Add comments for import_function usage

2455609

Note that nested functions cannot be imported, and unnamed functions cannot be imported.

Add tests for importing nested functions

365fcf9

Should fail on importing functions that are nestedly defined in other expressions.

Add tests on importing functions defined multiple times

0224e3b

Should fail, even if the function is defined in different ways.

Test comments and line breaks for import_function

ea90284

Comments and line breaks should not change the behaviors of import_function.

Change test file context

be21860

Corrected test helpfer function context. Added test/ prefix to test annotator/tests/test_annotator_cli.R.

Put imported function in the importing environment

85aa52f

Put the imported function in the same environment as the import_function being called, so the imported functions can call other imported functions. Also add a test for such use cases.

logstar added the work in progress label Jul 25, 2021

logstar added 3 commits July 25, 2021 18:23

Test annotator API replace_na_with_empty_string

f87a4c3

Test that replace_na_with_empty_string = TRUE replaces NA with "". Test that non-character columns without NA are not converted to character columns. Test that replace_na_with_empty_string = FALSE does not replace NA with "".

logstar removed the work in progress label Jul 25, 2021

NHJohnson approved these changes Jul 26, 2021

View reviewed changes

logstar merged commit 82f93d8 into d3b-center:dev Jul 26, 2021

logstar deleted the lft-utils-ann-r-cli branch July 26, 2021 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Long-format table annotation Part 3] annotator R CLI #59

[Long-format table annotation Part 3] annotator R CLI #59

logstar commented Jul 20, 2021

logstar commented Jul 22, 2021

logstar commented Jul 25, 2021

NHJohnson left a comment

logstar commented Jul 26, 2021

[Long-format table annotation Part 3] annotator R CLI #59

[Long-format table annotation Part 3] annotator R CLI #59

Conversation

logstar commented Jul 20, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

logstar commented Jul 22, 2021

logstar commented Jul 25, 2021

NHJohnson left a comment

Choose a reason for hiding this comment

logstar commented Jul 26, 2021