Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run de seq 1 0 #28

Closed
wants to merge 66 commits into from
Closed

Run de seq 1 0 #28

wants to merge 66 commits into from

Conversation

sangeetashukla
Copy link
Collaborator

Purpose/implementation Section

Created a new directory and script for implementing DESeq analysis

What scientific question is your analysis addressing?

Perform DE between each cancer type vs. all GTEX as well as each individual GTEX tissue (subgroup).

What was your approach?

Use DESeq2 package

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#26

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

This script is contains a few lines of code to test if the data is loaded from files correctly, and may be removed later when the histology and gene-counts files are finalised.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes. The results will be saved in a txt file for review.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.
Yes.

@jharenza
Copy link
Member

@yuankunzhu @sickler-alex here is the DeSeq2 module which @sangeetashukla is having problems running locally - can you help try to run this on EC2?

@sangeetashukla
Copy link
Collaborator Author

Hi @jharenza and @afarrel

I attempted to run this script with the v5 histology and counts data on Respublica but it won't let me install the DESeq2 package. I am running this script locally but it has been running for a few hours. From Slack conversation with @jharenza , we may be able to run it quicker on an AWS EC2 instance.
Can you please assign this to someone who may be able to do that?

@jharenza jharenza reopened this Jun 22, 2021
@yuankunzhu
Copy link
Member

yuankunzhu commented Jun 23, 2021

@jharenza do you wanna us to do an code review on this PR, or just test whether the script in this PR would run successfully? take this as an stand-along script, at least we need a "requirement list" for the libraries that's been used there.

also @sangeetashukla, I haven't used Respublica for a while, not sure what's the current policy/process, but I wonder if you could submit a IS request to install needed package or install your own R environment so you have fully packages control there. the latter was actually how I was able to install DESeq2 and other libraries before. before you make install, you just need to specify:

./configure --prefix=/where/you/want/R/to/go

@jharenza
Copy link
Member

@sangeetashukla did you have a chance to try this again on isilon per @yuankunzhu 's recommenations? No review @yuankunzhu - just see if it would run.

@sangeetashukla
Copy link
Collaborator Author

Hi @jharenza and @yuankunzhu
I have opened an IS request, but I am not sure what isilon is. I will reach out to you via slack.

@afarrel
Copy link

afarrel commented Jun 23, 2021

Hi @sangeetashukla I can help you get deseq2 installed on Respublica/isilon. I just did it on my account and it seems to be okay.

@jharenza jharenza removed the request for review from afarrel July 2, 2021 00:20
@jharenza
Copy link
Member

jharenza commented Jul 2, 2021

hi @afarrel and @sangeetashukla!

now that #34 is in, will you rerun with v6 and make the following updates:

  • use the cohort in histologies.tsv instead of combining CBTN and PNOC within the code
  • use the ensg-hugo-rmtl-v1-mapping.tsv file to add ENSG IDs (we removed the .suffix, so you should get a 1:1 mapping now
  • use the efo-mondo-map.tsv file to add EFO and MONDO columns based on the cancer_group

It may also be a good idea to create an option in this module to either run on a small sample of the data such that it will run locally or run on the full data, which can be run in CAVATICA (thoughts, @kgaonkar6 @yuankunzhu @migbro @zhangb1 ?)

@jharenza jharenza requested a review from logstar July 15, 2021 16:40
@jharenza
Copy link
Member

Hi @sangeetashukla ! thanks for committing this code - have you run this in the docker image with the v6 data? I suggested to @afarrel that we create two types of runs for this - one being a small subset of samples which we can test here via GitHub and docker and one being the full dataset which you will run later on CAVATICA.

If it is tested and ready for review, we can remove the Work in progress label.

I also noticed the uberon file commit. I suggested to @afarrel for now to leave this out of the module, as that mapping is not in v6, and we have decided to do annotations of all of the files through a function which @logstar will be creating in d3b-center/ticket-tracker-OPC#112.

cc @kgaonkar6

Copy link

@logstar logstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for developing the differential expression analysis module @sangeetashukla @afarrel !

I wonder if you could provide a more detailed description of this module, for example, in a README.md or a run_DESeq_analysis.sh if applicable, so that other people can understand the module and reproduce the results. Although the run-DESeq-analysis.R has an example run at top, it would also be helpful to note somewhere in the module on how to determine all the input number combinations, and the input numbers are 1-based. I noticed in Slack and meetings that this module needs to be run on a cluster or the cavatica platform, so it is different from other modules on how to reproduce the results, but it would be helpful to note the differences and specific procedures on how the results are generated.

I wonder if you could revise run-DESeq-analysis.R to roughly follow the style guide in this link http://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html, especially to break lines with about > 100 characters into multiple lines. The revision would save a lot of time for anyone to read and revise your code.

Could you also rerun one or a few combinations in the Docker image before the next review? So the reviewer can directly run your script without any edit.

The script runs with a few messages and warnings in the Docker image, after adding stringsAsFactors = FALSE.

$ Rscript --vanilla run-DESeq-analysis.R 1 1
converting counts to integer mode
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors
estimating size factors
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
final dispersion estimates
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
fitting model and testing
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
-- replacing outliers and refitting for 2646 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
fitting model and testing
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]

Following are my specific comments.

analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
analyses/DESeq_analysis/run-DESeq-analysis.R Outdated Show resolved Hide resolved
design= ~ Type)


sub.deseqdataset$Type <- factor(sub.deseqdataset$Type, levels=c(GTEX_filtered[J], histology_filtered[I]))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to note somewhere in the module or result table that the log fold change in the result is GTEX / cancer_group.

sangeetashukla and others added 3 commits July 16, 2021 10:30
Co-authored-by: Yuanchao Zhang <logstar@users.noreply.github.com>
Co-authored-by: Yuanchao Zhang <logstar@users.noreply.github.com>
Co-authored-by: Yuanchao Zhang <logstar@users.noreply.github.com>
@sangeetashukla
Copy link
Collaborator Author

Thank you for the updates @sangeetashukla !

I think the package this.path is not available for the rocker/tidyverse:3.6.0 Docker image. None of the following installation methods was able to install this.path. Therefore, I suggest to determine script.dir without using this.path. One option is to add a bash script that always runs run-DESeq-analysis.R at certain directory. Here is the link to an example bash script for that purpose https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/dev/analyses/copy_number_consensus_call/run_consensus_call.sh, and the following lines in the bash script changes the working directory to the directory that contains the bash script.

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/a6a1d0eb009039e3fd2041e5e0bd8a74ea557cc4/analyses/copy_number_consensus_call/run_consensus_call.sh#L6-L11

$ install2.r this.path
Warning message:
package ‘this.path’ is not available (for R version 3.6.0)

$ /rocker-build/install_bioc.r this.path
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Error: (converted from warning) package ‘this.path’ is not available (for R version 3.6.0)
Execution halted

$ R -e "install.packages('this.path', dependencies = TRUE)"

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages('this.path', dependencies = TRUE)
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning message:
package ‘this.path’ is not available (for R version 3.6.0)

I was able to run one cancer_group GTEx comparison after removing library(this.path) and define script.dir manually, and the Results/Results_all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous.tsv file was created. @afarrel's comments at #28 (comment) on the sample table also applies to the created TSV file. In addition, the NAs in the table needs to be replaced by empty strings ("", and it will be written as blank in TSV files) before outputting to the JSONL file mentioned in @jharenza's comment at #28 (review). In order to output a JSONL file for all the comparisons, I suggest this module takes the following steps: 1) output JSON files for each individual comparison using jsonlite::write_json; 2) convert JSON files to JSONL files individually using jq --compact-output '.[]' comparison11.json > comparison11.jsonl; 3) concatenate all JSONL files together with cat comparison*.jsonl > all_comparisons.jsonl. For more details abount jq, see comment #27 (comment).

$ head -n 3 Results/Results_all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous.tsv
datasourceId....paste.strsplit.histology_filtered.I...split...._....1...1...    datatypeId.....rna_expression.  cohort....paste.unique.hist.cohort.which.hist.Kids_First_Biospecimen_ID..in..      gene_symbol....rownames.Result. gene_id....ENSG_Hugo.ensg_id.match.rownames.Result...ENSG_Hugo.gene_symbol..       RMTL....ENSG_Hugo.rmtl.match.rownames.Result...ENSG_Hugo.gene_symbol..     EFO....ifelse.length.which.EFO_MONDO.cancer_group....unique.hist.cancer_group.which.hist.Kids_First_Biospecimen_ID..in..   MONDO....ifelse.length.which.EFO_MONDO.cancer_group....unique.hist.cancer_group.which.hist.Kids_First_Biospecimen_ID..in.. comparisonId....gsub..................._...paste.histology_filtered.I...   cancer_group....paste.unlist.strsplit.histology_filtered.I...   cancer_group_Count....Cancer.Hist_Hits     GTEx....GTEX_filtered.J.        GTEx_Count....GTEX_Hits cancer_group_MeanTpm....Histology_MEAN_TPMs     GTEx_MeanTpm....GTEX_MEAN_TPMs     baseMean        log2FoldChange  lfcSE   stat    pvalue  padj
all_vs_GTex     rna_expression  PBTA    5_8S_rRNA       ENSG00000275877 NA      EFO_0005543     MONDO_0016685   all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous  cohorts Low-grade glioma/astrocytoma    301     Adipose - Subcutaneous     663     0.02    0       0.018264014900362       -0.0197893517057001     2.69946937917747  -0.00733083022105993     0.994150896138315       NA
all_vs_GTex     rna_expression  PBTA    5S_rRNA ENSG00000201285 NA      EFO_0005543     MONDO_0016685   all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous  cohorts Low-grade glioma/astrocytoma    301     Adipose - Subcutaneous     663     0.13    0.38    0.798488052227322       0.871421943985364       0.153577613506163       5.67414692864981   1.39381466150402e-08    2.61929832112782e-08

$ wc -l Results/Results_all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous.tsv
36719 Results/Results_all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous.tsv

$ tail -n 3 Results/Results_all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous.tsv
all_vs_GTex     rna_expression  PBTA    ZYXP1   ENSG00000274572 NA      EFO_0005543     MONDO_0016685   all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous  cohorts Low-grade glioma/astrocytoma    301     Adipose - Subcutaneous     663     0       0       0       NA      NA      NA      NA      NA
all_vs_GTex     rna_expression  PBTA    ZZEF1   ENSG00000074755 NA      EFO_0005543     MONDO_0016685   all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous  cohorts Low-grade glioma/astrocytoma    301     Adipose - Subcutaneous     663     7.32    20.25   6439.94481019794        -0.0541250047621615     0.0210363060870467      -2.57293293500277  0.0100840746039162      0.013778095770223
all_vs_GTex     rna_expression  PBTA    ZZZ3    ENSG00000036549 NA      EFO_0005543     MONDO_0016685   all_cohorts_Low-grade_glioma_astrocytoma_v_Adipose_-_Subcutaneous  cohorts Low-grade glioma/astrocytoma    301     Adipose - Subcutaneous     663     7.43    11.26   2483.9127807217 -0.152154044854749      0.021343010254012       -7.12898710368875  1.01110206143255e-12    2.3203249376904e-12
$ time Rscript --vanilla run-DESeq-analysis.R \
>   --hist_file ../../data/histologies.tsv \
>   --counts_file ../../data/gene-counts-rsem-expected_count-collapsed.rds \
>   --tpm_file ../../data/gene-expression-rsem-tpm-collapsed.rds \
>   --ensg_hugo_file ../../data/ensg-hugo-rmtl-mapping.tsv \
>   --efo_mondo_file ../../data/efo-mondo-map.tsv \
>   --HIST_i 1 --GTEX_i 2
converting counts to integer mode
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors
estimating size factors
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
final dispersion estimates
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
fitting model and testing
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
-- replacing outliers and refitting for 2150 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]
fitting model and testing
  Note: levels of factors in the design contain characters other than
  letters, numbers, '_' and '.'. It is recommended (but not required) to use
  only letters, numbers, and delimiters '_' or '.', as these are safe characters
  for column names in R. [This is a message, not an warning or error]

real    9m55.489s
user    9m50.210s
sys     0m42.261s

@logstar Now the script does not use this.path, and as you suggested, the bash script provides the path

@afarrel
Copy link

afarrel commented Aug 12, 2021

Hi @sangeetashukla, I looked over the code. It's currently written to loop through every comparison of cancer_group vs GTEx through the nested for loop even though you take the Histology index and the GTEx index. Are you still thinking of running this sequentially or in parallel? If in parallel then the nested for loop is unnecessary.

Also, I made a more robust function to convert a data frame directly to jsonl (without going to json first) taking into considerations @logstar concerns:

   Create_josnl = function(DF){
       DF_jsonl = suppressWarnings(
         apply(DF, MARGIN=1, function(X)
           paste("{",
                 paste(colnames(DF),":",sapply(X, function(Y)
                   ifelse(is.na(as.numeric(Y)),paste("\"",Y,"\"",sep=""),gsub(" ","",Y)) ),
                   collapse = ",",sep=""),"}",sep = "")
           )#apply(DF, MARGIN=1, function(X)
       )#DF_jsonl = suppressWarnings(
       return(DF_jsonl)
   }#Create_josnl = function(DF){

   New_jsonl = Create_josnl(DF)
   write(New_jsonl,"FILE_PATH")

@logstar , if you have time, can you give it a shot and let me know your thoughts.

Then concatenate all the jsonl files into one huge jsonl file as @logstar commented above.

Alternatively, you can follow @logstar suggestions above, if it's no issue with creating the docker image and getting up on Cavatica.
Earlier Comment:

  1. output JSON files for each individual comparison using jsonlite::write_json; 2) convert JSON files to JSONL files individually using jq --compact-output '.[]' comparison11.json > comparison11.jsonl; 3) concatenate all JSONL files together with cat comparison*.jsonl > all_comparisons.jsonl. For more details abount jq, see comment #27 (comment).

@afarrel
Copy link

afarrel commented Aug 13, 2021

@sangeetashukla I looked at the code and ran it on isilon and reviewed the results (ticket #164). For the all_cohorts comparison, the cohort column should also have the value "All Cohorts".

The below line should changed:

 cohort = paste(unique(hist$cohort[which(hist$Kids_First_Biospecimen_ID %in% HIST_sample_type_df_filtered$Case_ID)])
             ,collapse=";",sep=";")

Can you kindly change alter the code so "All Cohorts" appear in the code in the all_cohorts comparisons and the individual cohorts (eg PBTA, GMKF) show up on the for the cohort level comparisons.
Thanks!

@sangeetashukla
Copy link
Collaborator Author

@jharenza @afarrel Please disregard this PR and use this PR
I had some conflicts due to which I was unable to push any more changes to this branch. Therefore I created a new branch, with neat files and new README.md

@jharenza jharenza closed this Sep 13, 2021
@sangeetashukla sangeetashukla deleted the run-DESeq-1-0 branch October 15, 2021 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants