-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate TPM mean/z-score/SD/quantile summary statistics within each cancer group and cohort #27
Conversation
d3b-center/ticket-tracker-OPC#51 Calculate TPM summary statistics within each cancer group and cohort
Add rna-seq-expression-summary-stats module in the analysis module table. d3b-center/ticket-tracker-OPC#51
We will need the following for integration with OT platform: Please use Ensembl ENSG IDs as an option for row names. OT uses Ensembl and Uniprot. |
Add sample metadata table to list the number of samples and sample IDs in each cancer group and cohort. Add gene Ensembl IDs as a column in the summary statistics tables. If one gene symbol matches to multiple Ensembl IDs, output a comma separated lsit of Ensembl IDs. Combine CBTN and PNOC into one cohort, PBTA, for this analysis, as suggested by @jharenza at d3b-center/ticket-tracker-OPC#51 (comment)
Updates to this PR:
The first two columns of the TPM mean/SD/z-score/quantile tables are gene symbols and Ensembl ENSG IDs. If one gene symbol matches to multiple Ensembl IDs, the value of the Ensembl ID column is a comma separated list of all Ensembl IDs, e.g. The
|
Denote the (n_genes, n_cancer_groups/n_cancer_group_cohorts) mean_TPM_vector combined matrix as mean_TPM_matrix. Generate z-scores across all cancer_groups/cancer_group_cohorts as z_score_matrix = (mean_TPM_matrix - rowMeans(mean_TPM_matrix)) / rowSD(mean_TPM_matrix). Call these z-scores as gene_wise_mean_tpm_z_scores in the filenames. This is suggested by @jharenza at d3b-center/ticket-tracker-OPC#51 (comment) Gene-wise mean TPM z-score tables: - results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv - results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv
Updates to this PR:
Denote the ( Generate z-scores across all Gene-wise mean TPM z-score tables:
|
In the filenames of mean TPM z-score and quantile tables, move "mean_tpm" before "gene_wise"/"cancer_group_wise". Renamed: results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_quantiles.tsv results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_z_scores.tsv results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_gene_wise_z_scores.tsv results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_quantiles.tsv results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_z_scores.tsv results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_gene_wise_z_scores.tsv
Updates to this PR:
In the filenames of mean TPM z-score and quantile tables, moved Renamed:
|
per our 8 am meeting, will also want a |
Hi @jharenza . Thank you for the update. I will use |
Select independent RNA-seq samples using `independent-specimens.rnaseq.primary-plus-polya.tsv` and `independent-specimens.rnaseq.primary-plus-stranded.tsv` in the results of the `independent-samples` analysis module. Use independent RNA-seq samples to compute TPM means, standard deviations, z-scores, and ranks.
Updates to this PR:
Select independent RNA-seq samples using Use independent RNA-seq samples to compute TPM means, standard deviations, z-scores, and quantiles. Note: the independent RNA-seq sample list may be changed to combine poly-A and stranded lists into one, as suggested by @jharenza at #26 (comment). I will update the results when the updated independent RNA-seq sample list is available. |
Select independent RNA-seq samples using `independent-specimens.rnaseq.primary-plus.tsv` in the results of the `independent-samples` analysis module. `independent-specimens.rnaseq.primary-plus.tsv` has both poly-A and standed samples.
Updates to this PR:
Select independent RNA-seq samples using Use independent RNA-seq samples to compute TPM means, standard deviations, z-scores, and quantiles. |
Add descriptions about Ensembl IDs.
Need to use analyses/independent-samples/results/independent-specimens.rnaseq.primary-plus.tsv
Use `independent-specimens.rnaseq.primary-plus.tsv` directly from the results of the `independent-samples` analysis module. `independent-specimens.rnaseq.primary-plus.tsv` is available at the `dev` branch after merging <d3b-center#26>. TPM z-score/mean/SD/quantile results are not changed.
Updates to this PR:
|
Revised comments and refactored some procedures to improve the readability of the code in analyses/rna-seq-expression-summary-stats/01-tpm-summary-stats.R.
per slack conversation, @taylordm requests that every line be one gene in one disease with columns for each metric: TPM, N, z-score, quantile (eg ~50K genes x n cancer groups) |
Generate long-format tables with each row as a JSON record, as suggested by @jharenza and @taylordm at d3b-center#27 (comment) Two long tables are generated, each has gene_wise_zscore or group_wise_zscore respectively. - results/long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz - results/long_n_tpm_mean_sd_quantile_group_wise_zscore.tsv.gz
Update to this PR:
Generate long summary statistic tables for converting to JSON format, with each row as a tab-delimited record of the following columns, as suggested by @jharenza and @taylordm at #27 (comment).
The following long tables are generated using the wide tables.
If the record is generated using all cohorts, the Different from the wide tables, the
|
Hi @jharenza and @taylordm. I planning to re-generate the tables using the updated independent sample list, which is available in this PR #30 that is merged an hour ago. I wonder which independent sample list, in https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/independent-samples/results, should I use for re-generating the tables. I used primary-plus independent sample list before, which was the only available one. In the updated independent sample lists, RNA-seq now also has primary, primary-plus, and relapse lists. |
you can just use primary for now, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor points, but I think we can merge this soon. We can also probably merge without d3b-center/ticket-tracker-OPC#89, and submit an updated analysis ticket once that is in.
if [[ -d results ]]; then | ||
rm -r results | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if [[ -d results ]]; then | |
rm -r results | |
fi | |
if [[ -d results ]]; then | |
rm -r results | |
fi |
I don't think this is necessary, and could potentially be unwanted, for instance, if we add another step or another analysis in which we add another type of file, rather than replace all files. With new runs, replacement should be sufficient so I think you can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion. I agree. I will remove this part.
Indeed, if a user manually runs a part of this script from a wrong working directory, it will delete the results
directory of their current working directory, which might be catastrophic.
I will also remove the same part in the snv-frequencies
module.
|
||
### Methods | ||
|
||
Select independent RNA-seq samples using `independent-specimens.rnaseq.primary.tsv` in the results of the `independent-samples` analysis module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Select independent RNA-seq samples using `independent-specimens.rnaseq.primary.tsv` in the results of the `independent-samples` analysis module. | |
Select independent RNA-seq samples using `independent-specimens.rnaseq.primary.tsv` in the results of the `independent-samples` analysis module. |
Note to update this after d3b-center/ticket-tracker-OPC#89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will update when the new list is available.
|
||
Generate z-scores across all `cancer_groups`/`cancer_group_cohorts` as `z_score_matrix = (mean_TPM_matrix - rowMeans(mean_TPM_matrix)) / rowSD(mean_TPM_matrix)`. Call these z-scores as `mean_tpm_gene_wise_z_scores` in the filenames. | ||
|
||
Call aforementioned (`n_genes`, `n_cancer_groups`/`n_cancer_group_cohorts`) tables as wide tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible we can remove the wide version of the tables in the future, but we can leave in for now and make an update later if the long tables look to be what is desired.
|
||
#### All cohort summary statistics tables | ||
|
||
The following wide tables are generated using the methods described above. Rows are genes. Columns are `cancer_group`s, except that the first two columns are gene symbol and gene Ensembl ID. If one gene symbol matches to multiple Ensembl IDs, the value of the Ensembl ID column is a comma separated list of all Ensembl IDs, e.g. `ENSG00000206952.3,ENSG00000281910.1`. In `inpiut/ens_symbol.tsv`, `SNORA50A` is mapped to both `ENSG00000206952.3` and `ENSG00000281910.1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure you will be updating this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the note. Will update in the next patch.
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Removed `rm -r results` part in run-rna-seq-expression-summary-stats.sh, as suggested by @jharenza at d3b-center#27 (comment) .
In all result tables, replace AllCohorts with PedOT, as suggested by @jharenza at d3b-center#27 (comment)
Thank you for the review @jharenza . Updates to this PR:
|
Update README.md for recent patches described in d3b-center#27 .
Add a note that the `NA`/`NaN`s in result tables are represented with blank string `''`s.
Removed the `rm -r results` part in run-rna-seq-expression-summary-stats.sh, as suggested by @jharenza at d3b-center#27 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the additional update ask - can you change PedOT
to all_cohorts
per this slack thread?
No problem at all. Sorry that I just saw your comment here. I will update accordingly, after getting the current commit done for the gene-level SNV frequency tables. |
In all result tables, change PedOT to all_cohorts, as suggested by @jharenza and at d3b-center#27 (review)
Update to this PR:
|
Convert results/*json files to JSON Lines format. See run-rna-seq-expression-summary-stats.sh for more details.
Update to this PR: Converted JSON files to JSON Lines (JSONL) format with
If the JSONL files work for OT database and ETL, I will update the Docker file in another PR to add a command to install Rationale that
Regarding the other two JSONL format requirements:
|
Merge the updated independent sample lists.
Use independent-specimens.rnaseq.primary.eachcohort.tsv from the independent-samples module to select independent samples.
Update to this PR: Changed independent sample list to v6 Previously used independent sample list is v5 In v6, the differences between
- Patient and sample IDs
|
Assert no NA before paste(x). paste(c(NA)) returns 'NA'.
Install json processor jq for converting JSON format to JSONL format. Related comment: <#27 (comment)>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the JSONL format was approved by FNL, I think we can merge this and once the annotation module is in, we can update later.
Thank you for the review! |
Purpose/implementation Section
What scientific question is your analysis addressing?
Within each cancer group and cohort, calculate TPM means, standard deviations, z-scores, and ranks.
What was your approach?
For each
cancer_group
, select one of the following two sets of samples:cohort
s, e.g. CBTN, GMKF, and PNOC.cohort
.If >= 5 samples are selected, generate the following summary statistics:
mean_TPM_vector
.z_score_vector = (mean_TPM_vector - mean(mean_TPM_vector)) / sd(mean_TPM_vector)
.Highest expressed 25%
,Expression between upper quartile and median
,Expression between median and lower quartile
, andLowest expressed 25%
. If multiple genes have the same mean TPM value, their tied rank is the lowest rank, in order to be conservative on the description of their expression levels.Combine each type of the summary statistics vectors into a table, with rows as genes, and columns as
cancer_group_cohort
.Update 22-Jun-21 (DT):
Please use Ensembl ENSG IDs in row names.
Please provide lists of sample IDs used in each cancer summary statistic.
Please provide the number of tumor samples used in each summarized group, within another list is fine.
What GitHub issue does your pull request address?
d3b-center/ticket-tracker-OPC#51
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
The function to generate summary statistics tables for each group of samples:
Update 22-Jun-21 (YZ): The function is updated to add z-scores across all cancer groups.
(link to the code)
Is there anything that you want to discuss further?
Do we need more quantiles? We currently have the following four values:
Highest expressed 25%
,Expression between upper quartile and median
,Expression between median and lower quartile
, andLowest expressed 25%
.Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes.
Results
What types of results are included (e.g., table, figure)?
Tables.
What is your summary of the results?
All cohort summary statistics tables
The following tables are generated using the methods described above. Rows are genes. Columns are
cancer_group
s, except that the first column is gene symbol.results/cancer_group_all_cohort_mean_tpm.tsv.gz
results/cancer_group_all_cohort_standard_deviation_tpm.tsv.gz
results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_z_scores.tsv.gz
results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_quantiles.tsv.gz
Individual cohort summary statistics tables
The following tables are generated using the methods described above. Rows are genes. Columns are
cancer_group_cohort
s, except that the first column is gene symbol. Acancer_group_cohort
is a string that concatenates acancer_group
and acohort
by___
, e.g.Meningioma___CBTN
,Neuroblastoma___GMKF
, andDiffuse midline glioma___PNOC
.results/cancer_group_individual_cohort_mean_tpm.tsv.gz
results/cancer_group_individual_cohort_standard_deviation_tpm.tsv.gz
results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_z_scores.tsv.gz
results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_quantiles.tsv.gz
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.