-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Long-format table annotation Part 1] download gene names and protein RefSeq IDs #55
Conversation
Download `Gene_full_name` and `Protein_RefSeq_ID` from https://mygene.info/ using the mygene package.
Remove rows that have both Gene_full_name and Protein_RefSeq_ID values missing. Write NA for missing values rather than "NA" or "", in order to be consistent with the data release.
Print messages after done running shell scripts.
mygene API may return results in different orders, so sorting the values before output is necessary to reproduce previous results.
Add long-format-table-utils module.
Describe how to update annotator/annotation-data/oncokb-cancer-gene-list.tsv.
Update to this PR: Added a note in README.md on how to update
|
Update OncoKB annotation table source to analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @logstar ! this looks great and runs quickly! I had a few minor updates and one question about the ENSG mapping before approving.
analyses/long-format-table-utils/annotator/download-annotation-data.R
Outdated
Show resolved
Hide resolved
analyses/long-format-table-utils/annotator/download-annotation-data.R
Outdated
Show resolved
Hide resolved
analyses/long-format-table-utils/annotator/download-annotation-data.R
Outdated
Show resolved
Hide resolved
# # test cases | ||
# collapse_rp_lists(list(NULL, NULL, NULL)) | ||
# collapse_rp_lists(list()) | ||
# collapse_rp_lists(list(c("NP_1", "NP_2"), c("NP_1", "NP_3"))) | ||
# collapse_rp_lists(list(c("NX_1", "NA"), c("NP_1", "NP_3"))) | ||
# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3"))) | ||
# collapse_rp_lists(list(c("NP_3", "NP_2"))) | ||
# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3", NA), | ||
# c(NA, NA, character(0)), c(NA, character(0)), | ||
# character(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this and the next commented "test cases" be removed at a later time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to leave them there, as these lines could be helpful to future updates if mygene.info API changes, and we currently do not have standardized support for unit testing.
We could definitely remove them here, and store them somewhere else. We could also discuss adding a standardized support for unit testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that's fine. looping in @yuankunzhu for thoughts on adding unit testing support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be difficult to include, along with each test case, the expected return value? This would help understand the intended behavior and would simplify the creation of unit tests if that happens at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented a lightweight unit testing framework for the annotator
submodule using the testthat
package. The testthat
package is available in the Docker image.
To run all tests, run bash analyses/long-format-table-utils/annotator/run-tests.sh
from any working directory. Following is an example run.
~/OpenPedCan-analysis$ bash analyses/long-format-table-utils/annotator/run-tests.sh
✔ | OK F W S | Context
✔ | 8 | tests/test_collapse_name_vec.R
✔ | 7 | tests/test_collapse_rp_lists.R
✔ | 5 | tests/test_collapse_rp_lists.R
══ Results ══════════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 0.2 s
OK: 20
Failed: 0
Warnings: 0
Skipped: 0
Done running run-tests.sh
This is implemented in the 48a131d commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nifty tests, they all passed for me.
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
…ble-utils.sh README.md is also updated accordingly. This is suggested by @jharenza at d3b-center#55 (comment) , in order to follow the shell script name convention of analysis modules.
Add annotation data versions and data of the last update in the "Update downloaded data that are used in this module" section, as suggested by @jharenza at d3b-center#55 (comment) Combine gene and disease (/cancer_group) annotations into one table. Add additional notes on annotation data versions to the "Implementation of long-format table annotator" section.
Change the date of the last update of annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021. The 07/16/2021 annotator/annotation-data/oncokb-cancer-gene-list.tsv is identical to the previous 06/16/2021 version, even though the website at https://www.oncokb.org/cancerGenes has changed last update from 06/16/2021 to 07/16/2021.
Merge changes in the data downloading PR d3b-center#55 . Rename update-long-format-table-utils.sh to run-update-long-format-table-utils.sh . Specify annotation data versions in README.md. Change the date of the last update of annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021.
Merge changes in the data downloading PR d3b-center#55 . Rename update-long-format-table-utils.sh to run-update-long-format-table-utils.sh . Specify annotation data versions in README.md. Change the date of the last update of annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021.
As suggested by @jharenza at <d3b-center#55 (comment)>, test cases should be removed from the source code file.
Run `bash run-tests.sh` to run all tests. In order to import a funciton for testing from an R file without running the whole file, a helper function import_function is defined at tests/helper_import_function.R, and the import_function is also tested in the tests/test_helper_import_function.R file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was able to run run-update-long-format-table-utils.sh
and generate output ensg-gene-full-name-refseq-protein.tsv
. Thanks for helpful comments! I made a note of a few things you might consider but they do not impact code and need not prevent merge.
analyses/long-format-table-utils/annotator/download-annotation-data.R
Outdated
Show resolved
Hide resolved
# # test cases | ||
# collapse_rp_lists(list(NULL, NULL, NULL)) | ||
# collapse_rp_lists(list()) | ||
# collapse_rp_lists(list(c("NP_1", "NP_2"), c("NP_1", "NP_3"))) | ||
# collapse_rp_lists(list(c("NX_1", "NA"), c("NP_1", "NP_3"))) | ||
# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3"))) | ||
# collapse_rp_lists(list(c("NP_3", "NP_2"))) | ||
# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3", NA), | ||
# c(NA, NA, character(0)), c(NA, character(0)), | ||
# character(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be difficult to include, along with each test case, the expected return value? This would help understand the intended behavior and would simplify the creation of unit tests if that happens at some point.
analyses/long-format-table-utils/annotator/download-annotation-data.R
Outdated
Show resolved
Hide resolved
analyses/long-format-table-utils/annotator/download-annotation-data.R
Outdated
Show resolved
Hide resolved
Thank you for the detailed reviews @jharenza @NHJohnson ! Added WIP label to this PR. I will make a few updates according to @NHJohnson's review. I will also add unit testing descriptions to the README.md. |
Add "Unit testing for long-format table annotator" section to descript how to use the unit testing framework.
Edit.
Merge changes from the data downloading PR <d3b-center#55>
Merge changes from the data downloading PR <d3b-center#55>
Purpose/implementation Section
What scientific question is your analysis addressing?
Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R
source('path/to/the/function/file.R')
or Pythonimport os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py
. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with eitherRscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv
orpython3 path/to/the/script.py arg long.tsv long_edited.tsv
.This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.
annotator
cancer_group
annotationsThis pull request is the first part of the
long-format-table-utils
module, which structures the module and provides scripts for updating the data that need to be downloaded.What was your approach?
long-format-table-utils
is used as the name of the module mainly for two reasons:utils
indicate that this module provides functions and scripts that can be used to handle long-format tables. This could help developer to find the module if needed.long-format-table-annotation-utils
, in order to organize otherutils
that may be developed at a later point, such as uploading large files taht exceed GitHub 100MB size limit to certain external storage.The submodule
annotator
will provide R API and R CLI (both are currently WIPs) for adding gene andcancer_group
annotations as described in README.md.The following scripts are developed to download
Gene_full_name
andProtein_RefSeq_ID
from https://mygene.info/ for theannotator
submodule.annotator/download-annotation-data.R
queriesGene_full_name
andProtein_RefSeq_ID
from https://mygene.info/ using the R mygene package, cleans up query results, and outputsannotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
.annotator/run-download-annotation-data.sh
runsannotator/download-annotation-data.R
, so the user does not have to deal with R working directory issues.update-long-format-table-utils.sh
runsannotator/run-download-annotation-data.sh
and potentially other scripts that will be developed for updating the module.To download data, run
bash update-long-format-table-utils.sh
, which is also noted in the README.md.What GitHub issue does your pull request address?
d3b-center/ticket-tracker-OPC#112
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Whether the module organization is as expected?
Whether the code in
annotator/download-annotation-data.R
works as expected?Is there anything that you want to discuss further?
Do I need to rename
update-long-format-table-utils.sh
torun-update-long-format-table-utils.sh
, in order to be consistent with other modules. I did not putrun-
prefix, because users do not need to run the script everytime they use it in there module, and this script is designed to be used by the maintainer oflong-format-table-utils
.Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes. The R API and R CLI parts are still WIPs, so please disregard the relevant content in the README.md, as they may subject to major changers.
Results
What types of results are included (e.g., table, figure)?
Table.
What is your summary of the results?
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.