-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Long-format table annotation Part 3] annotator R CLI #59
Conversation
Download `Gene_full_name` and `Protein_RefSeq_ID` from https://mygene.info/ using the mygene package.
Remove rows that have both Gene_full_name and Protein_RefSeq_ID values missing. Write NA for missing values rather than "NA" or "", in order to be consistent with the data release.
Print messages after done running shell scripts.
mygene API may return results in different orders, so sorting the values before output is necessary to reproduce previous results.
Add long-format-table-utils module.
Describe how to update annotator/annotation-data/oncokb-cancer-gene-list.tsv.
Update OncoKB annotation table source to analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
Update README.md.
For an annotation table, NA or duplicate should not exist in the key column that is used to join_by for adding annotations to input table.
Gene symbol is not added in this function, and gene symbol is required. It is also not necessary to check the gene_symbol column.
The annotate_long_format_table function now only supports tibble, because handling data.frame or other types of table is complex, such as preserving rownames, orders, and column types. Support for other types of table could be added if required at a later point.
add_Gene_type is more informative.
In annotate_long_format_table(), add a add_OncoKB_columns parameter. add_OncoKB_columns: TRUE or FALSE on whether to add OncoKB_cancer_gene and OncoKB_oncogene_TSG columns. Default value is FALSE.
The interface of annotate_long_format_table is changed, so that all annotation columns are added by default, and the columns_to_add parameter can be passed to add only certain columns. The current interface annotate_long_format_table is designed to favor flexibility and readability over efficiency. The funciton can be further refactored to be more efficient, e.g. not check certain tables, process certain tables, or add certain columns if they are not required.
The order of added columns may not be the same as the values of columns_to_add, in annotate_long_format_table.
In annotate_long_format_table, reorder the columns of the returned table to have the same column order as the input table and added columns in the order of columns_to_add parameter.
Renamed to ensg_rmtl_df.
Only left_join an annotation table if it is required.
As found by @NHJohnson at <d3b-center#56 (comment)>, some tests failed out side of the Docker image. See comments in annotator/tests/test_annotate_long_format_table.R for more details about the package version compatibility issues.
Add rprojroot::has_file(".git") as an alternative root criterion to handle linked git working trees created by `git worktree add` in rprojroot::find_root.
Previously, `-c ''` calls will fail because `""` is passed to API calls as required columns to add, which is unavailable. Now, `-c ''` calls will pass `character(0)` to API calls, so the output table will not have any additional annotation column.
@jharenza @NHJohnson I have made a couple of changes to this PR.
|
Add a note on the naming conventions of test files in the tsetthat package 2.1.1.
Remove input files after running CLI tests, if the input files are created by tests as intermediate files, in order to ensure a clean start for the next test to run.
Should fail on annotator CLI calls with unavailable input files or output dirs.
Note that nested functions cannot be imported, and unnamed functions cannot be imported.
Should fail on importing functions that are nestedly defined in other expressions.
Should fail, even if the function is defined in different ways.
Comments and line breaks should not change the behaviors of import_function.
Corrected test helpfer function context. Added test/ prefix to test annotator/tests/test_annotator_cli.R.
Put the imported function in the same environment as the import_function being called, so the imported functions can call other imported functions. Also add a test for such use cases.
In import_function, envir = parent.frame() makes the environment(imported_function) to be the same as the import function being called. Even though the eval documentation says the default envir parameter is parent.frame(), leaving envir = parent.frame() in the eval call will surprisingly make the environment(imported_function) to be the same as the environment of the import_function call. The tests also fail without specifying envir = parent.frame(). Maybe this is caused by the place where parent.frame() is evaluated? If specified, parent.frame() is evaluated in the calling function; if not specified, parent.frame() is evaluated in the eval call environment? Examples executed in terminal R, in order to avoid RStudio customizations. > print(environment()) <environment: R_GlobalEnv> > > foo <- function() { + print(parent.frame()) + print(environment()) + return(eval(quote(function() { return(2) }))) + } > > bar <- foo() <environment: R_GlobalEnv> <environment: 0x5645ead67670> > print(environment(bar)) <environment: 0x5645ead67670> > > baz <- function() { + print(parent.frame()) + print(environment()) + return(eval(quote(function() { return(2) }), + envir = parent.frame())) + } > > qux <- baz() <environment: R_GlobalEnv> <environment: 0x5645ead5fec0> > print(environment(qux)) <environment: R_GlobalEnv>
Previously, if replace_na_with_empty_string=TRUE in annotate_long_format_table, replace_na with empty string is applied to all columns, including columns that have no NA. If a non-character column that has no NA, its data type will be comverted to character, which may break backward compatibility of the JSON/JSONL tables that were generated without using the annotator API. Now, if replace_na_with_empty_string=TRUE in annotate_long_format_table, only replace NA with empty string in columns that have NA in them. This will not change the value types of the columns that have no NA, so the output JSON/JSONL table will be backward compatible with the ones that were generated without using the annotator API. CLI help message is also changed accordingly.
Test that replace_na_with_empty_string = TRUE replaces NA with "". Test that non-character columns without NA are not converted to character columns. Test that replace_na_with_empty_string = FALSE does not replace NA with "".
Test that --replace-na-with-empty-string replaces NA with "". Test that no --replace-na-with-empty-string does not replace NA with "". The --replace-na-with-empty-string does not need to be tested for column type conversions. The CLI write_tsv output table is the same even if a non-character column is converted to a character column before write_tsv, because "Values are only quoted if they contain a comma, quote or newline" (-- help("write_tsv", "readr") 1.3.1).
@jharenza @NHJohnson I have made a couple of changes to this PR.
All tests passed in the Docker image/container and a recently created R environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these improvements! All tests pass in Docker environment.
@NHJohnson Thank you for reviewing and test-running this PR! |
Purpose/implementation Section
What scientific question is your analysis addressing?
Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R
source('path/to/the/function/file.R')
or Pythonimport os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py
. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with eitherRscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv
orpython3 path/to/the/script.py arg long.tsv long_edited.tsv
.This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.
annotator
cancer_group
annotationsThis pull request is the third part of the
long-format-table-utils
module, which implements theannotator
R CLI.What was your approach?
The
long-format-table-utils/annotator/annotator-cli.R
file provides an R CLI for using the API to annotate long-format tables.Check input long-format tables have the following columns:
Gene_symbol
Gene_Ensembl_ID
.#
versions, e.g. ENSG00000039139, ENSG00000111261, and ENSG00000169710Disease
cancer_group
in thehistologies.tsv
, e.g. Adamantinomatous Craniopharyngioma, Atypical Teratoid Rhabdoid Tumor, and Low-grade glioma/astrocytomaAdd one or more of the following gene annotations, by specifying the
columns_to_add
parameter in theannotate_long_format_table
function, or specifying the-c
option when runningannotator-cli.R
.RMTL
Relevant Molecular Target (RMTL version 1)
orNon-Relevant Molecular Target (RMTL version 1)
data/ensg-hugo-rmtl-v1-mapping.tsv
Gene_type
CosmicCensus
,Kinase
,Oncogene
,TranscriptionFactor
, andTumorSuppressorGene
. Example values:CosmicCensus
,CosmicCensus,Kinase
, andCosmicCensus,Kinase,TumorSuppressorGene
.analyses/fusion_filtering/references/genelistreference.txt
OncoKB_cancer_gene
Y
orN
analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
OncoKB_oncogene_TSG
Oncogene
, orTumorSuppressorGene
, orOncogene,TumorSuppressorGene
analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
Gene_full_name
cytochrome c oxidase subunit III
andATP synthase F0 subunit 6
analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
Protein_RefSeq_ID
NP_004065.1
,NP_000053.2,NP_001027466.1
, andNP_000985.1,NP_001007074.1,NP_001007075.1
analyses/long-format-table-utils/annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
The
RMTL
information is obtained from PediatricOpenTargets/OpenPedCan-analysis data release.Note: only add
Gene_type
to gene-level tables, which can be implemented by leaving"Gene_type"
out of thecolumns_to_add
parameter of theannotate_long_format_table
function inannotator-api.R
. TheGene_type
information is obtained from../fusion_filtering/references/genelistreference.txt
, and its sources are described at https://github.com/d3b-center/annoFuse#prerequisites-for-cohort-level-analysis.The
OncoKB_cancer_gene
andOncoKB_oncogene_TSG
information is listed inannotator/annotation-data/oncokb-cancer-gene-list.tsv
, which is downloaded from https://www.oncokb.org/cancerGenes. The last update of the table is on 06/16/2021. To update the table, re-download the updated table from https://www.oncokb.org/cancerGenes.The
Gene_full_name
andProtein_RefSeq_ID
information is downloaded from https://mygene.info/ using the mygene package.Add the following disease(/
cancer_group
) annotations:Add one or more of the following disease(/
cancer_group
) annotations, by specifying thecolumns_to_add
parameter in theannotate_long_format_table
function, or specifying the-c
option when runningannotator-cli.R
.EFO
EFO_1000069
,EFO_1002008
, andEFO_1000177
data/efo-mondo-map.tsv
MONDO
MONDO_0002787
,MONDO_0020560
, andMONDO_0009837
data/efo-mondo-map.tsv
The
EFO
andMONDO
information is obtained from PediatricOpenTargets/OpenPedCan-analysis data release.Use the long-format table annotator CLI in an analysis module with the following steps:
"Gene_symbol", "Gene_Ensembl_ID", "Disease"
are not all present in the column names of the table to be annotated, add new columns or rename existing ones to have all these required columns.OpenPedCan-analysis
or a subdirectory ofOpenPedCan-analysis
. This allows theannotator-cli.R
to locate theannotator-api.R
.annotator-cli.R
script withRscript --vanilla path/to/annotator-cli.R
and proper options. TheRscript
command can be invoked by Rsystem("Rscript --vanilla path/to/annotator-cli.R -h")
(if the annotator R API is not preferred) or Python (>= 3.5)import subprocess; subprocess.run("Rscript --vanilla analyses/long-format-table-utils/annotator/annotator-cli.R -h".split())
. For more information about Rsystem
, https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html. For more information about Python (>= 3.5)subprocess.run
, https://docs.python.org/3/library/subprocess.html#subprocess.run.Following is an example usage in the
rna-seq-expression-summary-stats
module01-tpm-summary-stats.R
.What GitHub issue does your pull request address?
d3b-center/ticket-tracker-OPC#112
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Whether the code in
annotator/annotator-cli.R
works as expected?Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes.
Results
What types of results are included (e.g., table, figure)?
No result.
What is your summary of the results?
No result.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.