Calculate TPM mean/z-score/SD/quantile summary statistics within each cancer group and cohort #27

logstar · 2021-06-22T20:29:12Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Within each cancer group and cohort, calculate TPM means, standard deviations, z-scores, and ranks.

What was your approach?

For each cancer_group, select one of the following two sets of samples:

Samples from all cohorts, e.g. CBTN, GMKF, and PNOC.
Samples from each individual cohort.

If >= 5 samples are selected, generate the following summary statistics:

TPM means of each gene across all selected samples, and denote this vector as mean_TPM_vector.
TPM standard deviations of each gene across all selected samples.
z-scores of each gene across all genes, computed as z_score_vector = (mean_TPM_vector - mean(mean_TPM_vector)) / sd(mean_TPM_vector).
Ranks of mean TPM of each gene across all selected samples, which takes one of the following four values: Highest expressed 25%, Expression between upper quartile and median, Expression between median and lower quartile, and Lowest expressed 25%. If multiple genes have the same mean TPM value, their tied rank is the lowest rank, in order to be conservative on the description of their expression levels.

Combine each type of the summary statistics vectors into a table, with rows as genes, and columns as cancer_group_cohort.

Update 22-Jun-21 (DT):

Please use Ensembl ENSG IDs in row names.
Please provide lists of sample IDs used in each cancer summary statistic.
Please provide the number of tumor samples used in each summarized group, within another list is fine.

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#51

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The function to generate summary statistics tables for each group of samples:

Update 22-Jun-21 (YZ): The function is updated to add z-scores across all cancer groups.

# Generate means, standard deviations, z-scores, and ranks within each group.
#
# Args:
# - exp_df: (n_genes, n_samples) expression level numeric data frame
# - groups: character vector of length n_samples, which is used for grouping
#   the samples.
#
# Returns a list of (n_genes, n_groups) summary statistics tables.
get_expression_summary_stats <- function(exp_df, groups) {
  # unique groups to check that the computing steps do not modify the groups.
  check_groups <- sort(unique(groups))
  # gene symbols to check that the computing steps do not modify
  # the rownames of the exp_df.
  check_gids <- rownames(exp_df)


  # set check.names = FALSE and check.rows = FALSE to avoid R from
  # changing the rownames or colnames implicitly
  c_exp_df <- data.frame(t(exp_df), check.names = FALSE, check.rows = FALSE)
  c_exp_df$sample_group <- groups


  res_list <- list()




  # computed group means
  print('Compute means...')
  cg_mean_exp_df <- c_exp_df %>%
    group_by(sample_group) %>%
    summarise_all(mean) %>%
    column_to_rownames('sample_group')


  cg_mean_exp_out_df <- data.frame(
    t(cg_mean_exp_df), check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_mean_exp_out_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(cg_mean_exp_out_df)), check_groups))
  res_list$mean_df <- cg_mean_exp_out_df




  # compute group standard deviations
  print('Compute standard deviations...')
  cg_sd_exp_df <- c_exp_df %>%
    group_by(sample_group) %>%
    summarise_all(sd) %>%
    column_to_rownames('sample_group')


  cg_sd_exp_out_df <- data.frame(
    t(cg_sd_exp_df), check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_sd_exp_out_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(cg_sd_exp_out_df)), check_groups))
  res_list$sd_df <- cg_sd_exp_out_df


  # input is a numeric matrix
  # procedure adapted from @kgaonkar6's code at
  # <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/
  #     0a85a711709b5adc1e56a26a397d238cb3ebbb58/analyses/
  #     fusion_filtering/03-Calc-zscore-annotate.R#L115-L121>
  row_wise_zscores <- function(num_mat) {
    row_means <- rowMeans(num_mat)
    row_sds <- apply(num_mat, 1, sd)
    # row-wise z-score
    row_wise_zscore_mat <- sweep(num_mat, 1, row_means, FUN = '-')
    row_wise_zscore_mat <- sweep(row_wise_zscore_mat, 1, row_sds, FUN = '/')


    return(row_wise_zscore_mat)
  }


  # compute z-scores
  print('Compute group-wise z-scores...')
  # cg_mean_exp_df is (n_groups, n_genes)
  cg_mean_exp_mat <- as.matrix(cg_mean_exp_df)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(colnames(cg_mean_exp_mat), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(rownames(cg_mean_exp_mat)), check_groups))
  # group wise z-scores
  cg_mean_cgw_zscore_mat <- row_wise_zscores(cg_mean_exp_mat)


  cg_mean_cgw_zscore_df <- data.frame(
    t(cg_mean_cgw_zscore_mat), check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_mean_cgw_zscore_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(cg_mean_cgw_zscore_df)), check_groups))
  res_list$group_wise_zscore_df <- cg_mean_cgw_zscore_df




  # compute z-scores
  print('Compute gene-wise z-scores...')
  # cg_mean_exp_df is (n_groups, n_genes)
  # so gr_cg_mean_exp_mat is (n_genes, n_groups)
  # gr = gene rows
  gr_cg_mean_exp_mat <- t(cg_mean_exp_df)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(gr_cg_mean_exp_mat), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(gr_cg_mean_exp_mat)), check_groups))
  cg_mean_gene_wise_zscore_mat <- row_wise_zscores(gr_cg_mean_exp_mat)


  cg_mean_gene_wise_zscore_df <- data.frame(
    cg_mean_gene_wise_zscore_mat, check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_mean_gene_wise_zscore_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(
    sort(colnames(cg_mean_gene_wise_zscore_df)), check_groups))
  res_list$gene_wise_zscore_df <- cg_mean_gene_wise_zscore_df




  # compute ranks
  print('Compute quantiles...')
  # cg_mean_exp_df is (n_groups, n_genes)
  cg_mean_exp_mat <- as.matrix(cg_mean_exp_df)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(colnames(cg_mean_exp_mat), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(rownames(cg_mean_exp_mat)), check_groups))
  # group wise ranks
  cg_mean_cgw_rank_mat <- apply(cg_mean_exp_mat, 1,
                                function(x) rank(x, ties.method='min'))
  # describes which quartile the genes are in
  cg_mean_cgw_d_mat <- cg_mean_cgw_rank_mat


  p0p25_idc <- cg_mean_cgw_rank_mat > (nrow(cg_mean_cgw_rank_mat) * 0.75)
  cg_mean_cgw_d_mat[p0p25_idc] <- 'Highest expressed 25%'


  p25p50_idc <- cg_mean_cgw_rank_mat > (nrow(cg_mean_cgw_rank_mat) * 0.5) &
    cg_mean_cgw_rank_mat <= (nrow(cg_mean_cgw_rank_mat) * 0.75)
  # paste0 to make the line shorter
  cg_mean_cgw_d_mat[p25p50_idc] <- paste0(
    'Expression between upper quartile and median')


  p50p75_idc <- cg_mean_cgw_rank_mat > (nrow(cg_mean_cgw_rank_mat) * 0.25) &
    cg_mean_cgw_rank_mat <= (nrow(cg_mean_cgw_rank_mat) * 0.5)
  cg_mean_cgw_d_mat[p50p75_idc] <- paste0(
    'Expression between median and lower quartile')


  p75p100_idc <- cg_mean_cgw_rank_mat <= (nrow(cg_mean_cgw_rank_mat) * 0.25)
  cg_mean_cgw_d_mat[p75p100_idc] <- 'Lowest expressed 25%'
  # assert all entries have description values
  stopifnot(identical(
    sort(unique(as.vector(cg_mean_cgw_d_mat))),
    c("Expression between median and lower quartile",
      "Expression between upper quartile and median", 
      "Highest expressed 25%", "Lowest expressed 25%")
  ))


  cg_mean_cgw_quant_out_df <- data.frame(
    cg_mean_cgw_d_mat, check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_mean_cgw_quant_out_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(cg_mean_cgw_quant_out_df)), check_groups))
  res_list$quant_df <- cg_mean_cgw_quant_out_df




  return(res_list)
}

(link to the code)

Is there anything that you want to discuss further?

Do we need more quantiles? We currently have the following four values: Highest expressed 25%, Expression between upper quartile and median, Expression between median and lower quartile, and Lowest expressed 25%.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

Tables.

What is your summary of the results?

All cohort summary statistics tables

The following tables are generated using the methods described above. Rows are genes. Columns are cancer_groups, except that the first column is gene symbol.

results/cancer_group_all_cohort_mean_tpm.tsv.gz
results/cancer_group_all_cohort_standard_deviation_tpm.tsv.gz
results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_z_scores.tsv.gz
results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_quantiles.tsv.gz

Individual cohort summary statistics tables

The following tables are generated using the methods described above. Rows are genes. Columns are cancer_group_cohorts, except that the first column is gene symbol. A cancer_group_cohort is a string that concatenates a cancer_group and a cohort by ___, e.g. Meningioma___CBTN, Neuroblastoma___GMKF, and Diffuse midline glioma___PNOC.

results/cancer_group_individual_cohort_mean_tpm.tsv.gz
results/cancer_group_individual_cohort_standard_deviation_tpm.tsv.gz
results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_z_scores.tsv.gz
results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_quantiles.tsv.gz

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

d3b-center/ticket-tracker-OPC#51 Calculate TPM summary statistics within each cancer group and cohort

Add rna-seq-expression-summary-stats module in the analysis module table. d3b-center/ticket-tracker-OPC#51

taylordm · 2021-06-22T21:06:37Z

We will need the following for integration with OT platform:

Please use Ensembl ENSG IDs as an option for row names. OT uses Ensembl and Uniprot.
Please provide lists of sample IDs used in each cancer summary statistic so someone can ask what samples (provenance) was used to calculate the statistic.
Please provide the number of tumor samples used in each summarized group, within another list/table/sheet is fine.

@jharenza

Add sample metadata table to list the number of samples and sample IDs in each cancer group and cohort. Add gene Ensembl IDs as a column in the summary statistics tables. If one gene symbol matches to multiple Ensembl IDs, output a comma separated lsit of Ensembl IDs. Combine CBTN and PNOC into one cohort, PBTA, for this analysis, as suggested by @jharenza at d3b-center/ticket-tracker-OPC#51 (comment)

logstar · 2021-06-23T00:57:06Z

Updates to this PR:

Added sample metadata table to list the number of samples and sample IDs in each cancer group and cohort.

results/cancer_group_all_cohort_sample_metadata.tsv and results/cancer_group_individual_cohort_sample_metadata.tsv are sample metadata tables. The columns are 1) cancer_group/cancer_group_cohort, 2) the number of samples in the cancer_group/cancer_group_cohort, and 3) the comma separated list of Kids_First_Biospecimen_IDs of the samples in the cancer_group/cancer_group_cohort.

Added gene Ensembl IDs as a column in the summary statistics tables.

The first two columns of the TPM mean/SD/z-score/quantile tables are gene symbols and Ensembl ENSG IDs. If one gene symbol matches to multiple Ensembl IDs, the value of the Ensembl ID column is a comma separated list of all Ensembl IDs, e.g. ENSG00000206952.3,ENSG00000281910.1. In ens_symbol.tsv, SNORA50A is mapped to both ENSG00000206952.3 and ENSG00000281910.1.

The ens_symbol.tsv table lists the mapping between gene symbols and Ensembl IDs, which is shared by @kgaonkar6 . The Ensembl IDs are all unique, but certain gene symbols are mapped to multiple Ensembl IDs.

Combined CBTN and PNOC into one cohort, PBTA, for this analysis, as suggested by @jharenza at Proposed Analysis: Calculate TPM/z-score/SD/quantile summary statistics within each cancer group and cohort ticket-tracker-OPC#51 (comment)

@jharenza

Denote the (n_genes, n_cancer_groups/n_cancer_group_cohorts) mean_TPM_vector combined matrix as mean_TPM_matrix. Generate z-scores across all cancer_groups/cancer_group_cohorts as z_score_matrix = (mean_TPM_matrix - rowMeans(mean_TPM_matrix)) / rowSD(mean_TPM_matrix). Call these z-scores as gene_wise_mean_tpm_z_scores in the filenames. This is suggested by @jharenza at d3b-center/ticket-tracker-OPC#51 (comment) Gene-wise mean TPM z-score tables: - results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv - results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv

logstar · 2021-06-23T02:23:36Z

Updates to this PR:

Generate z-scores across all cancer_groups/cancer_group_cohorts

Denote the (n_genes, n_cancer_groups/n_cancer_group_cohorts) mean_TPM_vector combined matrix as mean_TPM_matrix.

Generate z-scores across all cancer_groups/cancer_group_cohorts as z_score_matrix = (mean_TPM_matrix - rowMeans(mean_TPM_matrix)) / rowSD(mean_TPM_matrix). Call these z-scores as gene_wise_mean_tpm_z_scores in the filenames. This is suggested by @jharenza at d3b-center/ticket-tracker-OPC#51 (comment)

Gene-wise mean TPM z-score tables:

results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv
results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv

In the filenames of mean TPM z-score and quantile tables, move "mean_tpm" before "gene_wise"/"cancer_group_wise". Renamed: results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_quantiles.tsv results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_z_scores.tsv results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_gene_wise_z_scores.tsv results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_quantiles.tsv results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_z_scores.tsv results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_gene_wise_z_scores.tsv

logstar · 2021-06-23T12:53:31Z

Updates to this PR:

Renamed mean TPM z-score and quantile tables

In the filenames of mean TPM z-score and quantile tables, moved mean_tpm before gene_wise/cancer_group_wise.

Renamed:

results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_quantiles.tsv
results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_z_scores.tsv
results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_gene_wise_z_scores.tsv
results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_quantiles.tsv
results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_z_scores.tsv
results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_gene_wise_z_scores.tsv

jharenza · 2021-06-23T13:03:05Z

per our 8 am meeting, will also want a cancer_group level analysis here, but it will also require subsetting the samples per the independent-samples module which is being updated in #26

logstar · 2021-06-23T16:27:58Z

per our 8 am meeting, will also want a cancer_group level analysis here, but it will also require subsetting the samples per the independent-samples module which is being updated in #26

Hi @jharenza . Thank you for the update. I will use independent-specimens.rnaseq.primary-plus-polya.tsv and independent-specimens.rnaseq.primary-plus-stranded.tsv in #26 to subset the samples.

Select independent RNA-seq samples using `independent-specimens.rnaseq.primary-plus-polya.tsv` and `independent-specimens.rnaseq.primary-plus-stranded.tsv` in the results of the `independent-samples` analysis module. Use independent RNA-seq samples to compute TPM means, standard deviations, z-scores, and ranks.

logstar · 2021-06-23T18:14:04Z

Updates to this PR:

Subset independent RNA-seq samples for computation

Select independent RNA-seq samples using independent-specimens.rnaseq.primary-plus-polya.tsv and independent-specimens.rnaseq.primary-plus-stranded.tsv in the results of the independent-samples analysis module.

Use independent RNA-seq samples to compute TPM means, standard deviations, z-scores, and quantiles.

Note: the independent RNA-seq sample list may be changed to combine poly-A and stranded lists into one, as suggested by @jharenza at #26 (comment). I will update the results when the updated independent RNA-seq sample list is available.

Select independent RNA-seq samples using `independent-specimens.rnaseq.primary-plus.tsv` in the results of the `independent-samples` analysis module. `independent-specimens.rnaseq.primary-plus.tsv` has both poly-A and standed samples.

logstar · 2021-06-23T19:22:56Z

Updates to this PR:

Change RNA-seq independent sample list

Select independent RNA-seq samples using independent-specimens.rnaseq.primary-plus.tsv in the results of the independent-samples analysis module.

Use independent RNA-seq samples to compute TPM means, standard deviations, z-scores, and quantiles.

Add descriptions about Ensembl IDs.

Need to use analyses/independent-samples/results/independent-specimens.rnaseq.primary-plus.tsv

Use `independent-specimens.rnaseq.primary-plus.tsv` directly from the results of the `independent-samples` analysis module. `independent-specimens.rnaseq.primary-plus.tsv` is available at the `dev` branch after merging <d3b-center#26>. TPM z-score/mean/SD/quantile results are not changed.

logstar · 2021-06-23T20:45:19Z

Updates to this PR:

Use independent-specimens.rnaseq.primary-plus.tsv directly from the results of the independent-samples analysis module. independent-specimens.rnaseq.primary-plus.tsv is available at the dev branch after merging updated independent samples module #26. TPM z-score/mean/SD/quantile results are not changed.

Revised comments and refactored some procedures to improve the readability of the code in analyses/rna-seq-expression-summary-stats/01-tpm-summary-stats.R.

jharenza · 2021-06-24T23:34:21Z

per slack conversation, @taylordm requests that every line be one gene in one disease with columns for each metric: TPM, N, z-score, quantile (eg ~50K genes x n cancer groups)

@jharenza

Generate long-format tables with each row as a JSON record, as suggested by @jharenza and @taylordm at d3b-center#27 (comment) Two long tables are generated, each has gene_wise_zscore or group_wise_zscore respectively. - results/long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz - results/long_n_tpm_mean_sd_quantile_group_wise_zscore.tsv.gz

logstar · 2021-06-25T19:19:04Z

Update to this PR:

Output long tables for JSON conversion

Generate long summary statistic tables for converting to JSON format, with each row as a tab-delimited record of the following columns, as suggested by @jharenza and @taylordm at #27 (comment).

gene_symbol
gene_id
cancer_group
cohort
tpm_mean
tpm_sd
tpm_mean_cancer_group_wise_zscore/tpm_mean_gene_wise_zscore
tpm_mean_cancer_group_wise_quantiles
n_samples

The following long tables are generated using the wide tables.

long_n_tpm_mean_sd_quantile_group_wise_zscore.tsv.gz
long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz

If the record is generated using all cohorts, the cohort column takes the value of AllCohorts.

Different from the wide tables, the gene_id column does not contain comma-separated ENSG IDs. If one gene symbol matches to multiple Ensembl IDs, each Ensembl gene ID will become one row in the long table. For example:

Wide table:

# cancer_group_individual_cohort_mean_tpm.tsv
gene_symbol    gene_id    Adamantimomatous craniopharyngioma___PBTA    Atypical meningioma___PBTA    Atypical Teratoid Rhabdoid Tumor___PBTA
CDR1    ENSG00000184258.6,ENSG00000281508.1    111.1765    31.932000000000002    332.2896153846154

Long table:

# long_n_tpm_mean_sd_quantile_group_wise_zscore.tsv.gz
CDR1    ENSG00000184258.6    Adamantimomatous craniopharyngioma    PBTA    111.1765    181.70448082054318    0.11408800038186823    Highest expressed 25%    20
CDR1    ENSG00000281508.1    Adamantimomatous craniopharyngioma    PBTA    111.1765    181.70448082054318    0.11408800038186823    Highest expressed 25%    20
CDR1    ENSG00000184258.6    Atypical meningioma    PBTA    31.932000000000002    50.10366423725914    0.016467672549703015    Highest expressed 25%    5
CDR1    ENSG00000281508.1    Atypical meningioma    PBTA    31.932000000000002    50.10366423725914    0.016467672549703015    Highest expressed 25%    5
CDR1    ENSG00000184258.6    Atypical Teratoid Rhabdoid Tumor    PBTA    332.2896153846154    1075.378786603049    0.37106139758788453    Highest expressed 25%    26
CDR1    ENSG00000281508.1    Atypical Teratoid Rhabdoid Tumor    PBTA    332.2896153846154    1075.378786603049    0.37106139758788453    Highest expressed 25%    26

logstar · 2021-06-28T19:58:57Z

Hi @jharenza and @taylordm. I planning to re-generate the tables using the updated independent sample list, which is available in this PR #30 that is merged an hour ago.

I wonder which independent sample list, in https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/independent-samples/results, should I use for re-generating the tables. I used primary-plus independent sample list before, which was the only available one. In the updated independent sample lists, RNA-seq now also has primary, primary-plus, and relapse lists.

jharenza · 2021-06-28T20:43:15Z

I wonder which independent sample list, in https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/independent-samples/results, should I use for re-generating the tables. I used primary-plus independent sample list before, which was the only available one. In the updated independent sample lists, RNA-seq now also has primary, primary-plus, and relapse lists.

you can just use primary for now, thanks!

jharenza

Just a few minor points, but I think we can merge this soon. We can also probably merge without d3b-center/ticket-tracker-OPC#89, and submit an updated analysis ticket once that is in.

jharenza · 2021-07-06T23:29:42Z

analyses/rna-seq-expression-summary-stats/run-rna-seq-expression-summary-stats.sh

+if [[ -d results ]]; then
+    rm -r results
+fi


Suggested change

if [[ -d results ]]; then

rm -r results

fi

if [[ -d results ]]; then

rm -r results

fi

I don't think this is necessary, and could potentially be unwanted, for instance, if we add another step or another analysis in which we add another type of file, rather than replace all files. With new runs, replacement should be sufficient so I think you can remove this.

Thank you for the suggestion. I agree. I will remove this part.

Indeed, if a user manually runs a part of this script from a wrong working directory, it will delete the results directory of their current working directory, which might be catastrophic.

I will also remove the same part in the snv-frequencies module.

jharenza · 2021-07-06T23:34:44Z