Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Long-format table annotation Part 3] annotator R CLI #59

Merged
merged 101 commits into from
Jul 26, 2021

Conversation

logstar
Copy link

@logstar logstar commented Jul 20, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R source('path/to/the/function/file.R') or Python import os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with either Rscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv or python3 path/to/the/script.py arg long.tsv long_edited.tsv.

This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.

Sub-module name Implemented function Available interface(s)
annotator Add gene and cancer_group annotations R API, R CLI

This pull request is the third part of the long-format-table-utils module, which implements the annotator R CLI.

What was your approach?

The long-format-table-utils/annotator/annotator-cli.R file provides an R CLI for using the API to annotate long-format tables.

Check input long-format tables have the following columns:

Column name Description
Gene_symbol HUGO symbols, e.g. PHLPP1, TM6SF1, and DNAH5.
Gene_Ensembl_ID Ensembl ENSG IDs without .# versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710
Disease The cancer_group in the histologies.tsv, e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytoma

Add one or more of the following gene annotations, by specifying the columns_to_add parameter in the annotate_long_format_table function, or specifying the -c option when running annotator-cli.R.

Annotation column name Non-missing value Source data
RMTL Relevant Molecular Target (RMTL version 1) or Non-Relevant Molecular Target (RMTL version 1) data/ensg-hugo-rmtl-v1-mapping.tsv
Gene_type A sorted comma separated list of one or more of the following gene types: CosmicCensus, Kinase, Oncogene, TranscriptionFactor, and TumorSuppressorGene. Example values: CosmicCensus, CosmicCensus,Kinase, and CosmicCensus,Kinase,TumorSuppressorGene. analyses/fusion_filtering/references/genelistreference.txt
OncoKB_cancer_gene Y or N analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
OncoKB_oncogene_TSG Oncogene, or TumorSuppressorGene, or Oncogene,TumorSuppressorGene analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
Gene_full_name A single string of gene full name, e.g. cytochrome c oxidase subunit III and ATP synthase F0 subunit 6 analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
Protein_RefSeq_ID A sorted comma separated list of one or more protein RefSeq IDs, e.g. NP_004065.1, NP_000053.2,NP_001027466.1, and NP_000985.1,NP_001007074.1,NP_001007075.1 analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv

The RMTL information is obtained from PediatricOpenTargets/OpenPedCan-analysis data release.

Note: only add Gene_type to gene-level tables, which can be implemented by leaving "Gene_type" out of the columns_to_add parameter of the annotate_long_format_table function in annotator-api.R. The Gene_type information is obtained from ../fusion_filtering/references/genelistreference.txt, and its sources are described at https://github.com/d3b-center/annoFuse#prerequisites-for-cohort-level-analysis.

The OncoKB_cancer_gene and OncoKB_oncogene_TSG information is listed in annotator/annotation-data/oncokb-cancer-gene-list.tsv, which is downloaded from https://www.oncokb.org/cancerGenes. The last update of the table is on 06/16/2021. To update the table, re-download the updated table from https://www.oncokb.org/cancerGenes.

The Gene_full_name and Protein_RefSeq_ID information is downloaded from https://mygene.info/ using the mygene package.

Add the following disease(/cancer_group) annotations:

Add one or more of the following disease(/cancer_group) annotations, by specifying the columns_to_add parameter in the annotate_long_format_table function, or specifying the -c option when running annotator-cli.R.

Annotation column name Non-missing value Source data
EFO A single string of EFO code, e.g. EFO_1000069, EFO_1002008, and EFO_1000177 data/efo-mondo-map.tsv
MONDO A single string of MONDO code, e.g. MONDO_0002787, MONDO_0020560, and MONDO_0009837 data/efo-mondo-map.tsv

The EFO and MONDO information is obtained from PediatricOpenTargets/OpenPedCan-analysis data release.

Use the long-format table annotator CLI in an analysis module with the following steps:

  1. If "Gene_symbol", "Gene_Ensembl_ID", "Disease" are not all present in the column names of the table to be annotated, add new columns or rename existing ones to have all these required columns.
  2. Output the table that needs to be annotated in TSV format.
  3. Change the working directory to be OpenPedCan-analysis or a subdirectory of OpenPedCan-analysis. This allows the annotator-cli.R to locate the annotator-api.R.
  4. Run the annotator-cli.R script with Rscript --vanilla path/to/annotator-cli.R and proper options. The Rscript command can be invoked by R system("Rscript --vanilla path/to/annotator-cli.R -h") (if the annotator R API is not preferred) or Python (>= 3.5) import subprocess; subprocess.run("Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R -h".split()). For more information about R system, https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html. For more information about Python (>= 3.5) subprocess.run, https://docs.python.org/3/library/subprocess.html#subprocess.run.
  5. Read the annotated table TSV file.
  6. Rename, select, and reorder the columns of the annotated table for output in TSV, or JSON, or JSONL formats.

Following is an example usage in the rna-seq-expression-summary-stats module 01-tpm-summary-stats.R.

> getwd()
[1] "/home/rstudio/OpenPedCan-analysis/analyses/rna-seq-expression-summary-stats"
> class(m_tpm_ss_long_tbl)
[1] "tbl_df"     "tbl"        "data.frame"
> colnames(m_tpm_ss_long_tbl)
 [1] "gene_symbol"                          "gene_id"                             
 [3] "cancer_group"                         "cohort"                              
 [5] "tpm_mean"                             "tpm_sd"                              
 [7] "tpm_mean_cancer_group_wise_zscore"    "tpm_mean_gene_wise_zscore"           
 [9] "tpm_mean_cancer_group_wise_quantiles" "n_samples"                           
> 
> renamed_m_tpm_ss_long_tbl <- dplyr::rename(
+   m_tpm_ss_long_tbl, Gene_symbol = gene_symbol, Gene_Ensembl_ID = gene_id,
+   Disease = cancer_group)
> 
> readr::write_tsv(
+   renamed_m_tpm_ss_long_tbl,
+   "../../scratch/renamed_m_tpm_ss_long_tbl.tsv")
> 
> system(paste(
+   "Rscript --vanilla ../long-format-table-utils/annotator/annotator-cli.R",
+   "-r -v -c MONDO,RMTL,EFO",
+   "-i ../../scratch/renamed_m_tpm_ss_long_tbl.tsv",
+   "-o ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv"))
Read ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Annotate ../../scratch/renamed_m_tpm_ss_long_tbl.tsv...
Output ../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv...
Done.
> 
> annotated_renamed_m_tpm_ss_long_tbl <- readr::read_tsv(
+   "../../scratch/annotated_renamed_m_tpm_ss_long_tbl.tsv",
+   na = character(),
+   col_types = readr::cols(.default = readr::col_guess()))
|==================================================================================================| 100%  222 MB
> m_tpm_ss_long_tbl <- dplyr::rename(
+   annotated_renamed_m_tpm_ss_long_tbl,
+   gene_symbol = Gene_symbol, gene_id = Gene_Ensembl_ID,
+   cancer_group = Disease)
> m_tpm_ss_long_tbl <- dplyr::select(
+   m_tpm_ss_long_tbl, gene_symbol, RMTL, gene_id,
+   cancer_group, EFO, MONDO, n_samples, cohort,
+   tpm_mean, tpm_sd,
+   tpm_mean_cancer_group_wise_zscore, tpm_mean_gene_wise_zscore,
+   tpm_mean_cancer_group_wise_quantiles)

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#112

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Whether the code in annotator/annotator-cli.R works as expected?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

No result.

What is your summary of the results?

No result.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • [NA. CI is not available yet.] This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Download `Gene_full_name` and `Protein_RefSeq_ID` from
https://mygene.info/ using the mygene package.
Remove rows that have both Gene_full_name and Protein_RefSeq_ID values
missing.

Write NA for missing values rather than "NA" or "", in order to be
consistent with the data release.
Print messages after done running shell scripts.
mygene API may return results in different orders, so sorting the values
before output is necessary to reproduce previous results.
Add long-format-table-utils module.
Describe how to update
annotator/annotation-data/oncokb-cancer-gene-list.tsv.
Update OncoKB annotation table source to
analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
For an annotation table, NA or duplicate should not exist in the key
column that is used to join_by for adding annotations to input table.
Gene symbol is not added in this function, and gene symbol is required.

It is also not necessary to check the gene_symbol column.
The annotate_long_format_table function now only supports tibble,
because handling data.frame or other types of table is complex, such as
preserving rownames, orders, and column types.

Support for other types of table could be added if required at a later
point.
add_Gene_type is more informative.
In annotate_long_format_table(), add a add_OncoKB_columns parameter.

add_OncoKB_columns: TRUE or FALSE on whether to add OncoKB_cancer_gene
and OncoKB_oncogene_TSG columns. Default value is FALSE.
The interface of annotate_long_format_table is changed, so that all
annotation columns are added by default, and the columns_to_add
parameter can be passed to add only certain columns.

The current interface annotate_long_format_table is designed to favor
flexibility and readability over efficiency. The funciton can be further
refactored to be more efficient, e.g. not check certain tables, process
certain tables, or add certain columns if they are not required.
The order of added columns may not be the same as the values of
columns_to_add, in annotate_long_format_table.
In annotate_long_format_table, reorder the columns of the returned table
to have the same column order as the input table and added columns in
the order of columns_to_add parameter.
Renamed to ensg_rmtl_df.
Only left_join an annotation table if it is required.
As found by @NHJohnson at
<d3b-center#56 (comment)>,
some tests failed out side of the Docker image.

See comments in annotator/tests/test_annotate_long_format_table.R for
more details about the package version compatibility issues.
Add rprojroot::has_file(".git") as an alternative root criterion to
handle linked git working trees created by `git worktree add` in
rprojroot::find_root.
Previously, `-c ''` calls will fail because `""` is passed to API calls
as required columns to add, which is unavailable.

Now, `-c ''` calls will pass `character(0)` to API calls, so the output
table will not have any additional annotation column.
@logstar
Copy link
Author

logstar commented Jul 22, 2021

@jharenza @NHJohnson I have made a couple of changes to this PR.

  • Handled Rscript --vanilla ../annotator-cli.R -c '' -i input/path.tsv -o output/path.tsv properly. See commit ebcc6ac for more details.
  • Enabled project root finding in git linked working trees. See commit eca90e5 for more details.
  • Merged changes from the part 2 PR [Long-format table annotation Part 2] annotator R API #56.
  • Added tests for annotator-cli.R. See commit 52ff63e for more details. All tests passed in the Docker image/container and a recently created R environment.

Add a note on the naming conventions of test files in the tsetthat
package 2.1.1.
Remove input files after running CLI tests, if the input files are
created by tests as intermediate files, in order to ensure a clean start
for the next test to run.
Should fail on annotator CLI calls with unavailable input files or
output dirs.
Note that nested functions cannot be imported, and unnamed functions
cannot be imported.
Should fail on importing functions that are nestedly defined in other
expressions.
Should fail, even if the function is defined in different ways.
Comments and line breaks should not change the behaviors of
import_function.
Corrected test helpfer function context.

Added test/ prefix to test annotator/tests/test_annotator_cli.R.
Put the imported function in the same environment as the import_function
being called, so the imported functions can call other imported
functions. Also add a test for such use cases.
In import_function, envir = parent.frame() makes the
environment(imported_function) to be the same as the import function
being called.

Even though the eval documentation says the default envir parameter is
parent.frame(), leaving envir = parent.frame() in the eval call will
surprisingly make the environment(imported_function) to be the same as
the environment of the import_function call. The tests also fail without
specifying envir = parent.frame(). Maybe this is caused by the place
where parent.frame() is evaluated? If specified, parent.frame() is
evaluated in the calling function; if not specified, parent.frame() is
evaluated in the eval call environment?

Examples executed in terminal R, in order to avoid RStudio customizations.
> print(environment())
<environment: R_GlobalEnv>
>
> foo <- function() {
+   print(parent.frame())
+   print(environment())
+   return(eval(quote(function() { return(2) })))
+ }
>
> bar <- foo()
<environment: R_GlobalEnv>
<environment: 0x5645ead67670>
> print(environment(bar))
<environment: 0x5645ead67670>
>
> baz <- function() {
+   print(parent.frame())
+   print(environment())
+   return(eval(quote(function() { return(2) }),
+               envir = parent.frame()))
+ }
>
> qux <- baz()
<environment: R_GlobalEnv>
<environment: 0x5645ead5fec0>
> print(environment(qux))
<environment: R_GlobalEnv>
Previously, if replace_na_with_empty_string=TRUE in
annotate_long_format_table, replace_na with empty string is applied to
all columns, including columns that have no NA. If a non-character
column that has no NA, its data type will be comverted to character,
which may break backward compatibility of the JSON/JSONL tables that
were generated without using the annotator API.

Now, if replace_na_with_empty_string=TRUE in annotate_long_format_table,
only replace NA with empty string in columns that have NA in them.  This
will not change the value types of the columns that have no NA, so the
output JSON/JSONL table will be backward compatible with the ones that
were generated without using the annotator API.

CLI help message is also changed accordingly.
Test that replace_na_with_empty_string = TRUE replaces NA with "". Test
that non-character columns without NA are not converted to character
columns.

Test that replace_na_with_empty_string = FALSE does not replace NA with
"".
Test that --replace-na-with-empty-string replaces NA with "".

Test that no --replace-na-with-empty-string does not replace NA with "".

The --replace-na-with-empty-string does not need to be tested for column
type conversions. The CLI write_tsv output table is the same even if a
non-character column is converted to a character column before
write_tsv, because "Values are only quoted if they contain a comma,
quote or newline" (-- help("write_tsv", "readr") 1.3.1).
@logstar
Copy link
Author

logstar commented Jul 25, 2021

@jharenza @NHJohnson I have made a couple of changes to this PR.

  • Only replace NA with empty string for columns that have NA in them, if -r/--replace-na-with-empty-string is set in CLI or replace_na_with_empty_string = TRUE in API. See commit 0545a41 for more details.
  • Added tests for the API replace_na_with_empty_string parameter in commit f87a4c3.
  • Added tests for the CLI -r/--replace-na-with-empty-string option in commit dd72ce6.

All tests passed in the Docker image/container and a recently created R environment.

Copy link

@NHJohnson NHJohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these improvements! All tests pass in Docker environment.

@logstar
Copy link
Author

logstar commented Jul 26, 2021

@NHJohnson Thank you for reviewing and test-running this PR!

@logstar logstar merged commit 82f93d8 into d3b-center:dev Jul 26, 2021
@logstar logstar deleted the lft-utils-ann-r-cli branch July 26, 2021 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants