[Long-format table annotation Part 1] download gene names and protein RefSeq IDs #55

logstar · 2021-07-16T15:25:32Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Create application programming interface (API) and command line interface (CLI) for handling long-format tables that are generated by analysis modules. API provides analysis module developers with functions that can be imported into their own scripts via R source('path/to/the/function/file.R') or Python import os, sys; sys.path.append(os.path.abspath("path/to/the/function/dir")); import function_filename_but_no_dot_py. CLI provides analysis module developers with scripts that can be executed in their own run-module shell script with either Rscript --vanilla path/to/the/script.R arg long.tsv long_edited.tsv or python3 path/to/the/script.py arg long.tsv long_edited.tsv.

This module is suggested by @jharenza and @kgaonkar6 in Slack at https://opentargetspediatrics.slack.com/archives/C021Z53SK98/p1626290031138100?thread_ts=1626287625.133600&cid=C021Z53SK98, in order to alleviate the burdens of analysis module developers for adding annotations and keeping track of what annotations need to be added. This module could also potentially handle large file storage issues at a later point, since the file size limit of GitHub is 100MB.

Sub-module name	Implemented function	Available interface(s)
`annotator`	Add gene and `cancer_group` annotations	R API WIP, R CLI WIP

This pull request is the first part of the long-format-table-utils module, which structures the module and provides scripts for updating the data that need to be downloaded.

What was your approach?

long-format-table-utils is used as the name of the module mainly for two reasons:

utils indicate that this module provides functions and scripts that can be used to handle long-format tables. This could help developer to find the module if needed.
This module is not named as something like long-format-table-annotation-utils, in order to organize other utils that may be developed at a later point, such as uploading large files taht exceed GitHub 100MB size limit to certain external storage.

The submodule annotator will provide R API and R CLI (both are currently WIPs) for adding gene and cancer_group annotations as described in README.md.

The following scripts are developed to download Gene_full_name and Protein_RefSeq_ID from https://mygene.info/ for the annotator submodule.

annotator/download-annotation-data.R queries Gene_full_name and Protein_RefSeq_ID from https://mygene.info/ using the R mygene package, cleans up query results, and outputs annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv.
annotator/run-download-annotation-data.sh runs annotator/download-annotation-data.R, so the user does not have to deal with R working directory issues.
update-long-format-table-utils.sh runs annotator/run-download-annotation-data.sh and potentially other scripts that will be developed for updating the module.

To download data, run bash update-long-format-table-utils.sh, which is also noted in the README.md.

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#112

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Whether the module organization is as expected?

Whether the code in annotator/download-annotation-data.R works as expected?

Is there anything that you want to discuss further?

Do I need to rename update-long-format-table-utils.sh to run-update-long-format-table-utils.sh, in order to be consistent with other modules. I did not put run- prefix, because users do not need to run the script everytime they use it in there module, and this script is designed to be used by the maintainer of long-format-table-utils.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes. The R API and R CLI parts are still WIPs, so please disregard the relevant content in the README.md, as they may subject to major changers.

Results

What types of results are included (e.g., table, figure)?

Table.

What is your summary of the results?

$ wc -l annotator/annotation-data/*tsv
  41193 annotator/annotation-data/ensg-gene-full-name-refseq-protein.tsv
   1065 annotator/annotation-data/oncokb-cancer-gene-list.tsv

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
[NA. CI is not available yet.] This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

Download `Gene_full_name` and `Protein_RefSeq_ID` from https://mygene.info/ using the mygene package.

Remove rows that have both Gene_full_name and Protein_RefSeq_ID values missing. Write NA for missing values rather than "NA" or "", in order to be consistent with the data release.

Print messages after done running shell scripts.

mygene API may return results in different orders, so sorting the values before output is necessary to reproduce previous results.

Add long-format-table-utils module.

Describe how to update annotator/annotation-data/oncokb-cancer-gene-list.tsv.

logstar · 2021-07-16T15:44:56Z

Update to this PR:

Added a note in README.md on how to update annotator/annotation-data/oncokb-cancer-gene-list.tsv in the "Update downloaded data that are used in this module" subsection.

Note: To update annotator/annotation-data/oncokb-cancer-gene-list.tsv (last updated on 06/16/2021), re-download the updated table from https://www.oncokb.org/cancerGenes. The website does not provide any URL for downloading the table, so the maintainer of this module has to manually update the table.

Update OncoKB annotation table source to analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv

jharenza

hi @logstar ! this looks great and runs quickly! I had a few minor updates and one question about the ENSG mapping before approving.

analyses/long-format-table-utils/README.md

analyses/long-format-table-utils/annotator/download-annotation-data.R

jharenza · 2021-07-20T18:26:38Z

analyses/long-format-table-utils/annotator/download-annotation-data.R

+# # test cases
+# collapse_rp_lists(list(NULL, NULL, NULL))
+# collapse_rp_lists(list())
+# collapse_rp_lists(list(c("NP_1", "NP_2"), c("NP_1", "NP_3")))
+# collapse_rp_lists(list(c("NX_1", "NA"), c("NP_1", "NP_3")))
+# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3")))
+# collapse_rp_lists(list(c("NP_3", "NP_2")))
+# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3", NA),
+#                        c(NA, NA, character(0)), c(NA, character(0)),
+#                        character(0)))


Will this and the next commented "test cases" be removed at a later time?

I prefer to leave them there, as these lines could be helpful to future updates if mygene.info API changes, and we currently do not have standardized support for unit testing.

We could definitely remove them here, and store them somewhere else. We could also discuss adding a standardized support for unit testing.

sure, that's fine. looping in @yuankunzhu for thoughts on adding unit testing support.

Would it be difficult to include, along with each test case, the expected return value? This would help understand the intended behavior and would simplify the creation of unit tests if that happens at some point.

I have implemented a lightweight unit testing framework for the annotator submodule using the testthat package. The testthat package is available in the Docker image.

To run all tests, run bash analyses/long-format-table-utils/annotator/run-tests.sh from any working directory. Following is an example run.

~/OpenPedCan-analysis$ bash analyses/long-format-table-utils/annotator/run-tests.sh ✔ | OK F W S | Context ✔ | 8 | tests/test_collapse_name_vec.R ✔ | 7 | tests/test_collapse_rp_lists.R ✔ | 5 | tests/test_collapse_rp_lists.R ══ Results ══════════════════════════════════════════════════════════════════════════════════════════════════════ Duration: 0.2 s OK: 20 Failed: 0 Warnings: 0 Skipped: 0 Done running run-tests.sh

This is implemented in the 48a131d commit.

Thanks for the nifty tests, they all passed for me.

analyses/long-format-table-utils/annotator/download-annotation-data.R

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

@jharenza

…ble-utils.sh README.md is also updated accordingly. This is suggested by @jharenza at d3b-center#55 (comment) , in order to follow the shell script name convention of analysis modules.

@jharenza

Add annotation data versions and data of the last update in the "Update downloaded data that are used in this module" section, as suggested by @jharenza at d3b-center#55 (comment) Combine gene and disease (/cancer_group) annotations into one table. Add additional notes on annotation data versions to the "Implementation of long-format table annotator" section.

Change the date of the last update of annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021. The 07/16/2021 annotator/annotation-data/oncokb-cancer-gene-list.tsv is identical to the previous 06/16/2021 version, even though the website at https://www.oncokb.org/cancerGenes has changed last update from 06/16/2021 to 07/16/2021.

Merge changes in the data downloading PR d3b-center#55 . Rename update-long-format-table-utils.sh to run-update-long-format-table-utils.sh . Specify annotation data versions in README.md. Change the date of the last update of annotator/annotation-data/oncokb-cancer-gene-list.tsv to 07/16/2021.

@jharenza

As suggested by @jharenza at <d3b-center#55 (comment)>, test cases should be removed from the source code file.

Run `bash run-tests.sh` to run all tests. In order to import a funciton for testing from an R file without running the whole file, a helper function import_function is defined at tests/helper_import_function.R, and the import_function is also tested in the tests/test_helper_import_function.R file.

NHJohnson

Was able to run run-update-long-format-table-utils.sh and generate output ensg-gene-full-name-refseq-protein.tsv. Thanks for helpful comments! I made a note of a few things you might consider but they do not impact code and need not prevent merge.

analyses/long-format-table-utils/annotator/download-annotation-data.R

NHJohnson · 2021-07-21T17:12:34Z

analyses/long-format-table-utils/annotator/download-annotation-data.R

+# # test cases
+# collapse_rp_lists(list(NULL, NULL, NULL))
+# collapse_rp_lists(list())
+# collapse_rp_lists(list(c("NP_1", "NP_2"), c("NP_1", "NP_3")))
+# collapse_rp_lists(list(c("NX_1", "NA"), c("NP_1", "NP_3")))
+# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3")))
+# collapse_rp_lists(list(c("NP_3", "NP_2")))
+# collapse_rp_lists(list(c("NP_1", "NP_2"), NULL, c("NP_1", "NP_3", NA),
+#                        c(NA, NA, character(0)), c(NA, character(0)),
+#                        character(0)))


Would it be difficult to include, along with each test case, the expected return value? This would help understand the intended behavior and would simplify the creation of unit tests if that happens at some point.

analyses/long-format-table-utils/annotator/download-annotation-data.R

logstar · 2021-07-21T19:32:02Z

Thank you for the detailed reviews @jharenza @NHJohnson !

Added WIP label to this PR. I will make a few updates according to @NHJohnson's review. I will also add unit testing descriptions to the README.md.

@NHJohnson

Suggested by @NHJohnson at d3b-center#55 (review)

Add "Unit testing for long-format table annotator" section to descript how to use the unit testing framework.

Edit.

Merge changes from the data downloading PR <d3b-center#55>

logstar added 8 commits July 15, 2021 16:52

Add long-format-table-utils module

0b316f4

Add oncokb-cancer-gene-list.tsv

75769b5

Update README.md

b750ed5

Download annotation data

18ea05a

Download `Gene_full_name` and `Protein_RefSeq_ID` from https://mygene.info/ using the mygene package.

Clean up ensg-gene-full-name-refseq-protein.tsv

f40e997

Remove rows that have both Gene_full_name and Protein_RefSeq_ID values missing. Write NA for missing values rather than "NA" or "", in order to be consistent with the data release.

Add echo commands in shell scripts

f941405

Print messages after done running shell scripts.

Sort mygene returned character values

96e8885

mygene API may return results in different orders, so sorting the values before output is necessary to reproduce previous results.

Update README.md

80b794e

Add long-format-table-utils module.

logstar requested a review from jharenza July 16, 2021 15:25

Update README.md

9784aae

Describe how to update annotator/annotation-data/oncokb-cancer-gene-list.tsv.

Update README.md

9817edf

Update OncoKB annotation table source to analyses/long-format-table-utils/annotator/annotation-data/oncokb-cancer-gene-list.tsv

jharenza reviewed Jul 20, 2021

View reviewed changes

logstar and others added 3 commits July 20, 2021 15:01

Update error message in download-annotation-data.R

909277d

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

Update error message in download-annotation-data.R

80b6af7

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

Update error message in download-annotation-data.R

a1399bc

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>

jharenza requested a review from NHJohnson July 20, 2021 20:30

logstar added 3 commits July 20, 2021 18:03

Rename update-long-format-table-utils.sh to run-update-long-format-ta…

090f080

…ble-utils.sh README.md is also updated accordingly. This is suggested by @jharenza at d3b-center#55 (comment) , in order to follow the shell script name convention of analysis modules.

logstar added 3 commits July 21, 2021 15:09

Replace as.character(NA) with NA_character_

03516c9

Remove test cases in download-annotation-data.R

6f76fae

As suggested by @jharenza at <d3b-center#55 (comment)>, test cases should be removed from the source code file.

NHJohnson approved these changes Jul 21, 2021

View reviewed changes

logstar added the work in progress label Jul 21, 2021

NHJohnson approved these changes Jul 21, 2021

View reviewed changes

Fix typos in download-annotation-data.R

0773a1e

Suggested by @NHJohnson at d3b-center#55 (review)

Update README.md

a08cf5f

Add "Unit testing for long-format table annotator" section to descript how to use the unit testing framework.

logstar removed the work in progress label Jul 21, 2021

Update README.md

796a26c

Edit.

logstar added a commit to logstar/OpenPedCan-analysis that referenced this pull request Jul 21, 2021

Merge branch 'lft-utils-ann-data-download' into lft-utils-ann-r-api

6d6838c

Merge changes from the data downloading PR <d3b-center#55>

logstar added a commit to logstar/OpenPedCan-analysis that referenced this pull request Jul 21, 2021

Merge branch 'lft-utils-ann-r-api' into lft-utils-ann-r-cli

5029d4b

Merge changes from the data downloading PR <d3b-center#55>

Fix a typo in helper_import_function.R

29325c7

jharenza approved these changes Jul 21, 2021

View reviewed changes

logstar merged commit 3a1dd99 into d3b-center:dev Jul 21, 2021

logstar mentioned this pull request Jul 21, 2021

Annotate SNV table with mutation frequencies #45

Merged

5 tasks

logstar deleted the lft-utils-ann-data-download branch July 26, 2021 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Long-format table annotation Part 1] download gene names and protein RefSeq IDs #55

[Long-format table annotation Part 1] download gene names and protein RefSeq IDs #55

logstar commented Jul 16, 2021

logstar commented Jul 16, 2021

jharenza left a comment

jharenza Jul 20, 2021

logstar Jul 20, 2021

jharenza Jul 20, 2021

NHJohnson Jul 21, 2021

logstar Jul 21, 2021

NHJohnson Jul 21, 2021

NHJohnson left a comment

NHJohnson Jul 21, 2021

logstar commented Jul 21, 2021

[Long-format table annotation Part 1] download gene names and protein RefSeq IDs #55

[Long-format table annotation Part 1] download gene names and protein RefSeq IDs #55

Conversation

logstar commented Jul 16, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

logstar commented Jul 16, 2021

jharenza left a comment

Choose a reason for hiding this comment

jharenza Jul 20, 2021

Choose a reason for hiding this comment

logstar Jul 20, 2021

Choose a reason for hiding this comment

jharenza Jul 20, 2021

Choose a reason for hiding this comment

NHJohnson Jul 21, 2021

Choose a reason for hiding this comment

logstar Jul 21, 2021

Choose a reason for hiding this comment

NHJohnson Jul 21, 2021

Choose a reason for hiding this comment

NHJohnson left a comment

Choose a reason for hiding this comment

NHJohnson Jul 21, 2021

Choose a reason for hiding this comment

logstar commented Jul 21, 2021