-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run de seq 1 0 #28
Run de seq 1 0 #28
Conversation
@yuankunzhu @sickler-alex here is the DeSeq2 module which @sangeetashukla is having problems running locally - can you help try to run this on EC2? |
I attempted to run this script with the v5 histology and counts data on Respublica but it won't let me install the DESeq2 package. I am running this script locally but it has been running for a few hours. From Slack conversation with @jharenza , we may be able to run it quicker on an AWS EC2 instance. |
@jharenza do you wanna us to do an code review on this PR, or just test whether the script in this PR would run successfully? take this as an stand-along script, at least we need a "requirement list" for the libraries that's been used there. also @sangeetashukla, I haven't used Respublica for a while, not sure what's the current policy/process, but I wonder if you could submit a IS request to install needed package or install your own R environment so you have fully packages control there. the latter was actually how I was able to install DESeq2 and other libraries before. before you
|
@sangeetashukla did you have a chance to try this again on isilon per @yuankunzhu 's recommenations? No review @yuankunzhu - just see if it would run. |
Hi @jharenza and @yuankunzhu |
Hi @sangeetashukla I can help you get deseq2 installed on Respublica/isilon. I just did it on my account and it seems to be okay. |
hi @afarrel and @sangeetashukla! now that #34 is in, will you rerun with v6 and make the following updates:
It may also be a good idea to create an option in this module to either run on a small sample of the data such that it will run locally or run on the full data, which can be run in CAVATICA (thoughts, @kgaonkar6 @yuankunzhu @migbro @zhangb1 ?) |
Hi @sangeetashukla ! thanks for committing this code - have you run this in the docker image with the v6 data? I suggested to @afarrel that we create two types of runs for this - one being a small subset of samples which we can test here via GitHub and docker and one being the full dataset which you will run later on CAVATICA. If it is tested and ready for review, we can remove the I also noticed the uberon file commit. I suggested to @afarrel for now to leave this out of the module, as that mapping is not in v6, and we have decided to do annotations of all of the files through a function which @logstar will be creating in d3b-center/ticket-tracker-OPC#112. cc @kgaonkar6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for developing the differential expression analysis module @sangeetashukla @afarrel !
I wonder if you could provide a more detailed description of this module, for example, in a README.md or a run_DESeq_analysis.sh if applicable, so that other people can understand the module and reproduce the results. Although the run-DESeq-analysis.R has an example run at top, it would also be helpful to note somewhere in the module on how to determine all the input number combinations, and the input numbers are 1-based. I noticed in Slack and meetings that this module needs to be run on a cluster or the cavatica platform, so it is different from other modules on how to reproduce the results, but it would be helpful to note the differences and specific procedures on how the results are generated.
I wonder if you could revise run-DESeq-analysis.R
to roughly follow the style guide in this link http://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html, especially to break lines with about > 100 characters into multiple lines. The revision would save a lot of time for anyone to read and revise your code.
Could you also rerun one or a few combinations in the Docker image before the next review? So the reviewer can directly run your script without any edit.
The script runs with a few messages and warnings in the Docker image, after adding stringsAsFactors = FALSE
.
$ Rscript --vanilla run-DESeq-analysis.R 1 1
converting counts to integer mode
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not an warning or error]
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors
estimating size factors
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not an warning or error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not an warning or error]
final dispersion estimates
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not an warning or error]
fitting model and testing
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not an warning or error]
-- replacing outliers and refitting for 2646 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not an warning or error]
fitting model and testing
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not an warning or error]
Following are my specific comments.
design= ~ Type) | ||
|
||
|
||
sub.deseqdataset$Type <- factor(sub.deseqdataset$Type, levels=c(GTEX_filtered[J], histology_filtered[I])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to note somewhere in the module or result table that the log fold change in the result is GTEX / cancer_group.
Co-authored-by: Yuanchao Zhang <logstar@users.noreply.github.com>
Co-authored-by: Yuanchao Zhang <logstar@users.noreply.github.com>
Co-authored-by: Yuanchao Zhang <logstar@users.noreply.github.com>
@logstar Now the script does not use this.path, and as you suggested, the bash script provides the path |
Hi @sangeetashukla, I looked over the code. It's currently written to loop through every comparison of cancer_group vs GTEx through the nested for loop even though you take the Histology index and the GTEx index. Are you still thinking of running this sequentially or in parallel? If in parallel then the nested for loop is unnecessary. Also, I made a more robust function to convert a data frame directly to jsonl (without going to json first) taking into considerations @logstar concerns:
@logstar , if you have time, can you give it a shot and let me know your thoughts. Then concatenate all the jsonl files into one huge jsonl file as @logstar commented above. Alternatively, you can follow @logstar suggestions above, if it's no issue with creating the docker image and getting up on Cavatica.
|
@sangeetashukla I looked at the code and ran it on isilon and reviewed the results (ticket #164). For the all_cohorts comparison, the cohort column should also have the value "All Cohorts". The below line should changed:
Can you kindly change alter the code so "All Cohorts" appear in the code in the all_cohorts comparisons and the individual cohorts (eg PBTA, GMKF) show up on the for the cohort level comparisons. |
Purpose/implementation Section
Created a new directory and script for implementing DESeq analysis
What scientific question is your analysis addressing?
Perform DE between each cancer type vs. all GTEX as well as each individual GTEX tissue (subgroup).
What was your approach?
Use DESeq2 package
What GitHub issue does your pull request address?
d3b-center/ticket-tracker-OPC#26
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
This script is contains a few lines of code to test if the data is loaded from files correctly, and may be removed later when the histology and gene-counts files are finalised.
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes. The results will be saved in a txt file for review.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.