Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Long-format table annotation Part 1] download gene names and protein RefSeq IDs #55

Merged
merged 23 commits into from
Jul 21, 2021

Conversation

logstar
Copy link

@logstar logstar commented Jul 16, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R source('path/to/the/function/file.R') or Python import os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with either Rscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv or python3 path/to/the/script.py arg long.tsv long_edited.tsv.

This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.

Sub-module name Implemented function Available interface(s)
annotator Add gene and cancer_group annotations R API WIP, R CLI WIP

This pull request is the first part of the long-format-table-utils module, which structures the module and provides scripts for updating the data that need to be downloaded.

What was your approach?

long-format-table-utils is used as the name of the module mainly for two reasons:

  • utils indicate that this module provides functions and scripts that can be used to handle long-format tables. This could help developer to find the module if needed.
  • This module is not named as something like long-format-table-annotation-utils, in order to organize other utils that may be developed at a later point, such as uploading large files taht exceed GitHub 100MB size limit to certain external storage.

The submodule annotator will provide R API and R CLI (both are currently WIPs) for adding gene and cancer_group annotations as described in README.md.

The following scripts are developed to download Gene_full_name and Protein_RefSeq_ID from https://mygene.info/ for the annotator submodule.

  • annotator/download-annotation-data.R queries Gene_full_name and Protein_RefSeq_ID from https://mygene.info/ using the R mygene package, cleans up query results, and outputs annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv.
  • annotator/run-download-annotation-data.sh runs annotator/download-annotation-data.R, so the user does not have to deal with R working directory issues.
  • update-long-format-table-utils.sh runs annotator/run-download-annotation-data.sh and potentially other scripts that will be developed for updating the module.

To download data, run bash update-long-format-table-utils.sh, which is also noted in the README.md.

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#112

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Whether the module organization is as expected?

Whether the code in annotator/download-annotation-data.R works as expected?

Is there anything that you want to discuss further?

Do I need to rename update-long-format-table-utils.sh to run-update-long-format-table-utils.sh, in order to be consistent with other modules. I did not put run- prefix, because users do not need to run the script everytime they use it in there module, and this script is designed to be used by the maintainer of long-format-table-utils.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes. The R API and R CLI parts are still WIPs, so please disregard the relevant content in the README.md, as they may subject to major changers.

Results

What types of results are included (e.g., table, figure)?

Table.

What is your summary of the results?

$ wc -l annotator/annotation-data/*tsv
  41193 annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
   1065 annotator/annotation-data/oncokb-cancer-gene-list.tsv

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • [NA. CI is not available yet.] This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Download `Gene_full_name` and `Protein_RefSeq_ID` from
https://mygene.info/ using the mygene package.
Remove rows that have both Gene_full_name and Protein_RefSeq_ID values
missing.

Write NA for missing values rather than "NA" or "", in order to be
consistent with the data release.
Print messages after done running shell scripts.
mygene API may return results in different orders, so sorting the values
before output is necessary to reproduce previous results.
Add long-format-table-utils module.
@logstar logstar requested a review from jharenza July 16, 2021 15:25
Describe how to update
annotator/annotation-data/oncokb-cancer-gene-list.tsv.
@logstar
Copy link
Author

logstar commented Jul 16, 2021

Update to this PR:

Added a note in README.md on how to update annotator/annotation-data/oncokb-cancer-gene-list.tsv in the "Update downloaded data that are used in this module" subsection.

Note: To update annotator/annotation-data/oncokb-cancer-gene-list.tsv (last updated on 06/16/2021), re-download the updated table from https://www.oncokb.org/cancerGenes. The website does not provide any URL for downloading the table, so the maintainer of this module has to manually update the table.

Update OncoKB annotation table source to
analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv
Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @logstar ! this looks great and runs quickly! I had a few minor updates and one question about the ENSG mapping before approving.

analyses/long-format-table-utils/README.md Outdated Show resolved Hide resolved
Comment on lines 54 to 63
# # test cases
# collapse_rp_lists(list(NULL, NULL, NULL))
# collapse_rp_lists(list())
# collapse_rp_lists(list(c("NP_1", "NP_2"), c("NP_1", "NP_3")))
# collapse_rp_lists(list(c("NX_1", "NA"), c("NP_1", "NP_3")))
# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3")))
# collapse_rp_lists(list(c("NP_3", "NP_2")))
# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3", NA),
# c(NA, NA, character(0)), c(NA, character(0)),
# character(0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this and the next commented "test cases" be removed at a later time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to leave them there, as these lines could be helpful to future updates if mygene.info API changes, and we currently do not have standardized support for unit testing.

We could definitely remove them here, and store them somewhere else. We could also discuss adding a standardized support for unit testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that's fine. looping in @yuankunzhu for thoughts on adding unit testing support.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be difficult to include, along with each test case, the expected return value? This would help understand the intended behavior and would simplify the creation of unit tests if that happens at some point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented a lightweight unit testing framework for the annotator submodule using the testthat package. The testthat package is available in the Docker image.

To run all tests, run bash analyses/long-format-table-utils/annotator/run-tests.sh from any working directory. Following is an example run.

~/OpenPedCan-analysis$ bash analyses/long-format-table-utils/annotator/run-tests.sh
✔ |  OK F W S | Context
✔ |   8       | tests/test_collapse_name_vec.R
✔ |   7       | tests/test_collapse_rp_lists.R
✔ |   5       | tests/test_collapse_rp_lists.R

══ Results ══════════════════════════════════════════════════════════════════════════════════════════════════════
Duration: 0.2 s

OK:       20
Failed:   0
Warnings: 0
Skipped:  0
Done running run-tests.sh

This is implemented in the 48a131d commit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nifty tests, they all passed for me.

logstar and others added 3 commits July 20, 2021 15:01
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
@jharenza jharenza requested a review from NHJohnson July 20, 2021 20:30
…ble-utils.sh

README.md is also updated accordingly.

This is suggested by @jharenza at
d3b-center#55 (comment)
, in order to follow the shell script name convention of analysis
modules.
Add annotation data versions and data of the last update in the "Update
downloaded data that are used in this module" section, as suggested by
@jharenza at
d3b-center#55 (comment)

Combine gene and disease (/cancer_group) annotations into one table.

Add additional notes on annotation data versions to the "Implementation
of long-format table annotator" section.
Change the date of the last update of
annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021.

The 07/16/2021 annotator/annotation-data/oncokb-cancer-gene-list.tsv is
identical to the previous 06/16/2021 version, even though the website at
https://www.oncokb.org/cancerGenes has changed last update from
06/16/2021 to 07/16/2021.
logstar added a commit to logstar/OpenPedCan-analysis that referenced this pull request Jul 21, 2021
Merge changes in the data downloading PR
d3b-center#55 .

Rename update-long-format-table-utils.sh to
run-update-long-format-table-utils.sh .

Specify annotation data versions in README.md.

Change the date of the last update of
annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021.
logstar added a commit to logstar/OpenPedCan-analysis that referenced this pull request Jul 21, 2021
Merge changes in the data downloading PR
d3b-center#55 .

Rename update-long-format-table-utils.sh to
run-update-long-format-table-utils.sh .

Specify annotation data versions in README.md.

Change the date of the last update of
annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021.
As suggested by @jharenza at
<d3b-center#55 (comment)>,
test cases should be removed from the source code file.
Run `bash run-tests.sh` to run all tests.

In order to import a funciton for testing from an R file without running
the whole file, a helper function import_function is defined at
tests/helper_import_function.R, and the import_function is also tested
in the tests/test_helper_import_function.R file.
Copy link

@NHJohnson NHJohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to run run-update-long-format-table-utils.sh and generate output ensg-gene-full-name-refseq-protein.tsv. Thanks for helpful comments! I made a note of a few things you might consider but they do not impact code and need not prevent merge.

Comment on lines 54 to 63
# # test cases
# collapse_rp_lists(list(NULL, NULL, NULL))
# collapse_rp_lists(list())
# collapse_rp_lists(list(c("NP_1", "NP_2"), c("NP_1", "NP_3")))
# collapse_rp_lists(list(c("NX_1", "NA"), c("NP_1", "NP_3")))
# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3")))
# collapse_rp_lists(list(c("NP_3", "NP_2")))
# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3", NA),
# c(NA, NA, character(0)), c(NA, character(0)),
# character(0)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be difficult to include, along with each test case, the expected return value? This would help understand the intended behavior and would simplify the creation of unit tests if that happens at some point.

@logstar
Copy link
Author

logstar commented Jul 21, 2021

Thank you for the detailed reviews @jharenza @NHJohnson !

Added WIP label to this PR. I will make a few updates according to @NHJohnson's review. I will also add unit testing descriptions to the README.md.

Add "Unit testing for long-format table annotator" section to descript
how to use the unit testing framework.
logstar added a commit to logstar/OpenPedCan-analysis that referenced this pull request Jul 21, 2021
logstar added a commit to logstar/OpenPedCan-analysis that referenced this pull request Jul 21, 2021
@logstar logstar merged commit 3a1dd99 into d3b-center:dev Jul 21, 2021
@logstar logstar deleted the lft-utils-ann-data-download branch July 26, 2021 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants