Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate TPM mean/z-score/SD/quantile summary statistics within each cancer group and cohort #27

Merged
merged 38 commits into from
Jul 14, 2021

Conversation

logstar
Copy link

@logstar logstar commented Jun 22, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

Within each cancer group and cohort, calculate TPM means, standard deviations, z-scores, and ranks.

What was your approach?

For each cancer_group, select one of the following two sets of samples:

  • Samples from all cohorts, e.g. CBTN, GMKF, and PNOC.
  • Samples from each individual cohort.

If >= 5 samples are selected, generate the following summary statistics:

  • TPM means of each gene across all selected samples, and denote this vector as mean_TPM_vector.
  • TPM standard deviations of each gene across all selected samples.
  • z-scores of each gene across all genes, computed as z_score_vector = (mean_TPM_vector - mean(mean_TPM_vector)) / sd(mean_TPM_vector).
  • Ranks of mean TPM of each gene across all selected samples, which takes one of the following four values: Highest expressed 25%, Expression between upper quartile and median, Expression between median and lower quartile, and Lowest expressed 25%. If multiple genes have the same mean TPM value, their tied rank is the lowest rank, in order to be conservative on the description of their expression levels.

Combine each type of the summary statistics vectors into a table, with rows as genes, and columns as cancer_group_cohort.

Update 22-Jun-21 (DT):

Please use Ensembl ENSG IDs in row names.
Please provide lists of sample IDs used in each cancer summary statistic.
Please provide the number of tumor samples used in each summarized group, within another list is fine.

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#51

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The function to generate summary statistics tables for each group of samples:

Update 22-Jun-21 (YZ): The function is updated to add z-scores across all cancer groups.

# Generate means, standard deviations, z-scores, and ranks within each group.
#
# Args:
# - exp_df: (n_genes, n_samples) expression level numeric data frame
# - groups: character vector of length n_samples, which is used for grouping
#   the samples.
#
# Returns a list of (n_genes, n_groups) summary statistics tables.
get_expression_summary_stats <- function(exp_df, groups) {
  # unique groups to check that the computing steps do not modify the groups.
  check_groups <- sort(unique(groups))
  # gene symbols to check that the computing steps do not modify
  # the rownames of the exp_df.
  check_gids <- rownames(exp_df)


  # set check.names = FALSE and check.rows = FALSE to avoid R from
  # changing the rownames or colnames implicitly
  c_exp_df <- data.frame(t(exp_df), check.names = FALSE, check.rows = FALSE)
  c_exp_df$sample_group <- groups


  res_list <- list()




  # computed group means
  print('Compute means...')
  cg_mean_exp_df <- c_exp_df %>%
    group_by(sample_group) %>%
    summarise_all(mean) %>%
    column_to_rownames('sample_group')


  cg_mean_exp_out_df <- data.frame(
    t(cg_mean_exp_df), check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_mean_exp_out_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(cg_mean_exp_out_df)), check_groups))
  res_list$mean_df <- cg_mean_exp_out_df




  # compute group standard deviations
  print('Compute standard deviations...')
  cg_sd_exp_df <- c_exp_df %>%
    group_by(sample_group) %>%
    summarise_all(sd) %>%
    column_to_rownames('sample_group')


  cg_sd_exp_out_df <- data.frame(
    t(cg_sd_exp_df), check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_sd_exp_out_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(cg_sd_exp_out_df)), check_groups))
  res_list$sd_df <- cg_sd_exp_out_df


  # input is a numeric matrix
  # procedure adapted from @kgaonkar6's code at
  # <https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/
  #     0a85a711709b5adc1e56a26a397d238cb3ebbb58/analyses/
  #     fusion_filtering/03-Calc-zscore-annotate.R#L115-L121>
  row_wise_zscores <- function(num_mat) {
    row_means <- rowMeans(num_mat)
    row_sds <- apply(num_mat, 1, sd)
    # row-wise z-score
    row_wise_zscore_mat <- sweep(num_mat, 1, row_means, FUN = '-')
    row_wise_zscore_mat <- sweep(row_wise_zscore_mat, 1, row_sds, FUN = '/')


    return(row_wise_zscore_mat)
  }


  # compute z-scores
  print('Compute group-wise z-scores...')
  # cg_mean_exp_df is (n_groups, n_genes)
  cg_mean_exp_mat <- as.matrix(cg_mean_exp_df)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(colnames(cg_mean_exp_mat), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(rownames(cg_mean_exp_mat)), check_groups))
  # group wise z-scores
  cg_mean_cgw_zscore_mat <- row_wise_zscores(cg_mean_exp_mat)


  cg_mean_cgw_zscore_df <- data.frame(
    t(cg_mean_cgw_zscore_mat), check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_mean_cgw_zscore_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(cg_mean_cgw_zscore_df)), check_groups))
  res_list$group_wise_zscore_df <- cg_mean_cgw_zscore_df




  # compute z-scores
  print('Compute gene-wise z-scores...')
  # cg_mean_exp_df is (n_groups, n_genes)
  # so gr_cg_mean_exp_mat is (n_genes, n_groups)
  # gr = gene rows
  gr_cg_mean_exp_mat <- t(cg_mean_exp_df)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(gr_cg_mean_exp_mat), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(gr_cg_mean_exp_mat)), check_groups))
  cg_mean_gene_wise_zscore_mat <- row_wise_zscores(gr_cg_mean_exp_mat)


  cg_mean_gene_wise_zscore_df <- data.frame(
    cg_mean_gene_wise_zscore_mat, check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_mean_gene_wise_zscore_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(
    sort(colnames(cg_mean_gene_wise_zscore_df)), check_groups))
  res_list$gene_wise_zscore_df <- cg_mean_gene_wise_zscore_df




  # compute ranks
  print('Compute quantiles...')
  # cg_mean_exp_df is (n_groups, n_genes)
  cg_mean_exp_mat <- as.matrix(cg_mean_exp_df)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(colnames(cg_mean_exp_mat), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(rownames(cg_mean_exp_mat)), check_groups))
  # group wise ranks
  cg_mean_cgw_rank_mat <- apply(cg_mean_exp_mat, 1,
                                function(x) rank(x, ties.method='min'))
  # describes which quartile the genes are in
  cg_mean_cgw_d_mat <- cg_mean_cgw_rank_mat


  p0p25_idc <- cg_mean_cgw_rank_mat > (nrow(cg_mean_cgw_rank_mat) * 0.75)
  cg_mean_cgw_d_mat[p0p25_idc] <- 'Highest expressed 25%'


  p25p50_idc <- cg_mean_cgw_rank_mat > (nrow(cg_mean_cgw_rank_mat) * 0.5) &
    cg_mean_cgw_rank_mat <= (nrow(cg_mean_cgw_rank_mat) * 0.75)
  # paste0 to make the line shorter
  cg_mean_cgw_d_mat[p25p50_idc] <- paste0(
    'Expression between upper quartile and median')


  p50p75_idc <- cg_mean_cgw_rank_mat > (nrow(cg_mean_cgw_rank_mat) * 0.25) &
    cg_mean_cgw_rank_mat <= (nrow(cg_mean_cgw_rank_mat) * 0.5)
  cg_mean_cgw_d_mat[p50p75_idc] <- paste0(
    'Expression between median and lower quartile')


  p75p100_idc <- cg_mean_cgw_rank_mat <= (nrow(cg_mean_cgw_rank_mat) * 0.25)
  cg_mean_cgw_d_mat[p75p100_idc] <- 'Lowest expressed 25%'
  # assert all entries have description values
  stopifnot(identical(
    sort(unique(as.vector(cg_mean_cgw_d_mat))),
    c("Expression between median and lower quartile",
      "Expression between upper quartile and median", 
      "Highest expressed 25%", "Lowest expressed 25%")
  ))


  cg_mean_cgw_quant_out_df <- data.frame(
    cg_mean_cgw_d_mat, check.names = FALSE, check.rows = FALSE)
  # assert gene symbols are the same as input data frame
  stopifnot(identical(rownames(cg_mean_cgw_quant_out_df), check_gids))
  # assert groups are the same as input data frame
  stopifnot(identical(sort(colnames(cg_mean_cgw_quant_out_df)), check_groups))
  res_list$quant_df <- cg_mean_cgw_quant_out_df




  return(res_list)
}

(link to the code)

Is there anything that you want to discuss further?

Do we need more quantiles? We currently have the following four values: Highest expressed 25%, Expression between upper quartile and median, Expression between median and lower quartile, and Lowest expressed 25%.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

Tables.

What is your summary of the results?

All cohort summary statistics tables

The following tables are generated using the methods described above. Rows are genes. Columns are cancer_groups, except that the first column is gene symbol.

  • results/cancer_group_all_cohort_mean_tpm.tsv.gz
  • results/cancer_group_all_cohort_standard_deviation_tpm.tsv.gz
  • results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_z_scores.tsv.gz
  • results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_quantiles.tsv.gz
Individual cohort summary statistics tables

The following tables are generated using the methods described above. Rows are genes. Columns are cancer_group_cohorts, except that the first column is gene symbol. A cancer_group_cohort is a string that concatenates a cancer_group and a cohort by ___, e.g. Meningioma___CBTN, Neuroblastoma___GMKF, and Diffuse midline glioma___PNOC.

  • results/cancer_group_individual_cohort_mean_tpm.tsv.gz
  • results/cancer_group_individual_cohort_standard_deviation_tpm.tsv.gz
  • results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_z_scores.tsv.gz
  • results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_quantiles.tsv.gz

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

d3b-center/ticket-tracker-OPC#51

Calculate TPM summary statistics within each cancer group and cohort
Add rna-seq-expression-summary-stats module in the analysis module
table.

d3b-center/ticket-tracker-OPC#51
@taylordm
Copy link

taylordm commented Jun 22, 2021

We will need the following for integration with OT platform:

Please use Ensembl ENSG IDs as an option for row names. OT uses Ensembl and Uniprot.
Please provide lists of sample IDs used in each cancer summary statistic so someone can ask what samples (provenance) was used to calculate the statistic.
Please provide the number of tumor samples used in each summarized group, within another list/table/sheet is fine.

Add sample metadata table to list the number of samples and sample IDs
in each cancer group and cohort.

Add gene Ensembl IDs as a column in the summary statistics tables. If
one gene symbol matches to multiple Ensembl IDs, output a comma
separated lsit of Ensembl IDs.

Combine CBTN and PNOC into one cohort, PBTA, for this analysis,
as suggested by @jharenza at
d3b-center/ticket-tracker-OPC#51 (comment)
@logstar
Copy link
Author

logstar commented Jun 23, 2021

Updates to this PR:

  • Added sample metadata table to list the number of samples and sample IDs in each cancer group and cohort.

results/cancer_group_all_cohort_sample_metadata.tsv and results/cancer_group_individual_cohort_sample_metadata.tsv are sample metadata tables. The columns are 1) cancer_group/cancer_group_cohort, 2) the number of samples in the cancer_group/cancer_group_cohort, and 3) the comma separated list of Kids_First_Biospecimen_IDs of the samples in the cancer_group/cancer_group_cohort.

  • Added gene Ensembl IDs as a column in the summary statistics tables.

The first two columns of the TPM mean/SD/z-score/quantile tables are gene symbols and Ensembl ENSG IDs. If one gene symbol matches to multiple Ensembl IDs, the value of the Ensembl ID column is a comma separated list of all Ensembl IDs, e.g. ENSG00000206952.3,ENSG00000281910.1. In ens_symbol.tsv, SNORA50A is mapped to both ENSG00000206952.3 and ENSG00000281910.1.

The ens_symbol.tsv table lists the mapping between gene symbols and Ensembl IDs, which is shared by @kgaonkar6 . The Ensembl IDs are all unique, but certain gene symbols are mapped to multiple Ensembl IDs.

@logstar logstar changed the title Calculate TPM summary statistics within each cancer group and cohort Calculate TPM mean/z-score/SD/quantile summary statistics within each cancer group and cohort Jun 23, 2021
Denote the (n_genes, n_cancer_groups/n_cancer_group_cohorts)
mean_TPM_vector combined matrix as mean_TPM_matrix.

Generate z-scores across all cancer_groups/cancer_group_cohorts as
z_score_matrix = (mean_TPM_matrix - rowMeans(mean_TPM_matrix)) /
rowSD(mean_TPM_matrix). Call these z-scores as
gene_wise_mean_tpm_z_scores in the filenames. This is suggested by
@jharenza at
d3b-center/ticket-tracker-OPC#51 (comment)

Gene-wise mean TPM z-score tables:

- results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv
- results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv
@logstar
Copy link
Author

logstar commented Jun 23, 2021

Updates to this PR:

  • Generate z-scores across all cancer_groups/cancer_group_cohorts

Denote the (n_genes, n_cancer_groups/n_cancer_group_cohorts) mean_TPM_vector combined matrix as mean_TPM_matrix.

Generate z-scores across all cancer_groups/cancer_group_cohorts as z_score_matrix = (mean_TPM_matrix - rowMeans(mean_TPM_matrix)) / rowSD(mean_TPM_matrix). Call these z-scores as gene_wise_mean_tpm_z_scores in the filenames. This is suggested by @jharenza at d3b-center/ticket-tracker-OPC#51 (comment)

Gene-wise mean TPM z-score tables:

  • results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv
  • results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv

In the filenames of mean TPM z-score and quantile tables, move "mean_tpm"
before "gene_wise"/"cancer_group_wise".

Renamed:

results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_quantiles.tsv
results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_z_scores.tsv
results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_gene_wise_z_scores.tsv
results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_quantiles.tsv
results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_z_scores.tsv
results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_gene_wise_z_scores.tsv
@logstar
Copy link
Author

logstar commented Jun 23, 2021

Updates to this PR:

  • Renamed mean TPM z-score and quantile tables

In the filenames of mean TPM z-score and quantile tables, moved mean_tpm before gene_wise/cancer_group_wise.

Renamed:

  • results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_quantiles.tsv
  • results/cancer_group_all_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_cancer_group_wise_z_scores.tsv
  • results/cancer_group_all_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_all_cohort_mean_tpm_gene_wise_z_scores.tsv
  • results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_quantiles.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_quantiles.tsv
  • results/cancer_group_individual_cohort_cancer_group_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_cancer_group_wise_z_scores.tsv
  • results/cancer_group_individual_cohort_gene_wise_mean_tpm_z_scores.tsv -> results/cancer_group_individual_cohort_mean_tpm_gene_wise_z_scores.tsv

@jharenza
Copy link
Member

per our 8 am meeting, will also want a cancer_group level analysis here, but it will also require subsetting the samples per the independent-samples module which is being updated in #26

@logstar
Copy link
Author

logstar commented Jun 23, 2021

per our 8 am meeting, will also want a cancer_group level analysis here, but it will also require subsetting the samples per the independent-samples module which is being updated in #26

Hi @jharenza . Thank you for the update. I will use independent-specimens.rnaseq.primary-plus-polya.tsv and independent-specimens.rnaseq.primary-plus-stranded.tsv in #26 to subset the samples.

Select independent RNA-seq samples using
`independent-specimens.rnaseq.primary-plus-polya.tsv` and
`independent-specimens.rnaseq.primary-plus-stranded.tsv` in the results
of the `independent-samples` analysis module.

Use independent RNA-seq samples to compute TPM means, standard
deviations, z-scores, and ranks.
@logstar
Copy link
Author

logstar commented Jun 23, 2021

Updates to this PR:

  • Subset independent RNA-seq samples for computation

Select independent RNA-seq samples using independent-specimens.rnaseq.primary-plus-polya.tsv and independent-specimens.rnaseq.primary-plus-stranded.tsv in the results of the independent-samples analysis module.

Use independent RNA-seq samples to compute TPM means, standard deviations, z-scores, and quantiles.

Note: the independent RNA-seq sample list may be changed to combine poly-A and stranded lists into one, as suggested by @jharenza at #26 (comment). I will update the results when the updated independent RNA-seq sample list is available.

Select independent RNA-seq samples using
`independent-specimens.rnaseq.primary-plus.tsv` in the results of the
`independent-samples` analysis module.

`independent-specimens.rnaseq.primary-plus.tsv` has both poly-A and
standed samples.
@logstar
Copy link
Author

logstar commented Jun 23, 2021

Updates to this PR:

  • Change RNA-seq independent sample list

Select independent RNA-seq samples using independent-specimens.rnaseq.primary-plus.tsv in the results of the independent-samples analysis module.

Use independent RNA-seq samples to compute TPM means, standard deviations, z-scores, and quantiles.

Add descriptions about Ensembl IDs.
Need to use
analyses/independent-samples/results/independent-specimens.rnaseq.primary-plus.tsv
Use `independent-specimens.rnaseq.primary-plus.tsv` directly from the
results of the `independent-samples` analysis module.
`independent-specimens.rnaseq.primary-plus.tsv` is available at the
`dev` branch after merging
<d3b-center#26>.

TPM z-score/mean/SD/quantile results are not changed.
@logstar
Copy link
Author

logstar commented Jun 23, 2021

Updates to this PR:

  • Use independent-specimens.rnaseq.primary-plus.tsv directly from the results of the independent-samples analysis module. independent-specimens.rnaseq.primary-plus.tsv is available at the dev branch after merging updated independent samples module #26. TPM z-score/mean/SD/quantile results are not changed.

Revised comments and refactored some procedures to improve the
readability of the code in
analyses/rna-seq-expression-summary-stats/01-tpm-summary-stats.R.
@jharenza
Copy link
Member

per slack conversation, @taylordm requests that every line be one gene in one disease with columns for each metric: TPM, N, z-score, quantile (eg ~50K genes x n cancer groups)

Generate long-format tables with each row as a JSON record, as suggested
by @jharenza and @taylordm at
d3b-center#27 (comment)

Two long tables are generated, each has gene_wise_zscore or
group_wise_zscore respectively.

- results/long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz
- results/long_n_tpm_mean_sd_quantile_group_wise_zscore.tsv.gz
@logstar
Copy link
Author

logstar commented Jun 25, 2021

Update to this PR:

  • Output long tables for JSON conversion

Generate long summary statistic tables for converting to JSON format, with each row as a tab-delimited record of the following columns, as suggested by @jharenza and @taylordm at #27 (comment).

  • gene_symbol
  • gene_id
  • cancer_group
  • cohort
  • tpm_mean
  • tpm_sd
  • tpm_mean_cancer_group_wise_zscore/tpm_mean_gene_wise_zscore
  • tpm_mean_cancer_group_wise_quantiles
  • n_samples

The following long tables are generated using the wide tables.

  • long_n_tpm_mean_sd_quantile_group_wise_zscore.tsv.gz
  • long_n_tpm_mean_sd_quantile_gene_wise_zscore.tsv.gz

If the record is generated using all cohorts, the cohort column takes the value of AllCohorts.

Different from the wide tables, the gene_id column does not contain comma-separated ENSG IDs. If one gene symbol matches to multiple Ensembl IDs, each Ensembl gene ID will become one row in the long table. For example:

  • Wide table:
# cancer_group_individual_cohort_mean_tpm.tsv
gene_symbol    gene_id    Adamantimomatous craniopharyngioma___PBTA    Atypical meningioma___PBTA    Atypical Teratoid Rhabdoid Tumor___PBTA
CDR1    ENSG00000184258.6,ENSG00000281508.1    111.1765    31.932000000000002    332.2896153846154
  • Long table:
# long_n_tpm_mean_sd_quantile_group_wise_zscore.tsv.gz
CDR1    ENSG00000184258.6    Adamantimomatous craniopharyngioma    PBTA    111.1765    181.70448082054318    0.11408800038186823    Highest expressed 25%    20
CDR1    ENSG00000281508.1    Adamantimomatous craniopharyngioma    PBTA    111.1765    181.70448082054318    0.11408800038186823    Highest expressed 25%    20
CDR1    ENSG00000184258.6    Atypical meningioma    PBTA    31.932000000000002    50.10366423725914    0.016467672549703015    Highest expressed 25%    5
CDR1    ENSG00000281508.1    Atypical meningioma    PBTA    31.932000000000002    50.10366423725914    0.016467672549703015    Highest expressed 25%    5
CDR1    ENSG00000184258.6    Atypical Teratoid Rhabdoid Tumor    PBTA    332.2896153846154    1075.378786603049    0.37106139758788453    Highest expressed 25%    26
CDR1    ENSG00000281508.1    Atypical Teratoid Rhabdoid Tumor    PBTA    332.2896153846154    1075.378786603049    0.37106139758788453    Highest expressed 25%    26

@logstar
Copy link
Author

logstar commented Jun 28, 2021

Hi @jharenza and @taylordm. I planning to re-generate the tables using the updated independent sample list, which is available in this PR #30 that is merged an hour ago.

I wonder which independent sample list, in https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/independent-samples/results, should I use for re-generating the tables. I used primary-plus independent sample list before, which was the only available one. In the updated independent sample lists, RNA-seq now also has primary, primary-plus, and relapse lists.

@jharenza
Copy link
Member

I wonder which independent sample list, in https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/independent-samples/results, should I use for re-generating the tables. I used primary-plus independent sample list before, which was the only available one. In the updated independent sample lists, RNA-seq now also has primary, primary-plus, and relapse lists.

you can just use primary for now, thanks!

Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor points, but I think we can merge this soon. We can also probably merge without d3b-center/ticket-tracker-OPC#89, and submit an updated analysis ticket once that is in.

Comment on lines 19 to 21
if [[ -d results ]]; then
rm -r results
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if [[ -d results ]]; then
rm -r results
fi
if [[ -d results ]]; then
rm -r results
fi

I don't think this is necessary, and could potentially be unwanted, for instance, if we add another step or another analysis in which we add another type of file, rather than replace all files. With new runs, replacement should be sufficient so I think you can remove this.

Copy link
Author

@logstar logstar Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. I agree. I will remove this part.

Indeed, if a user manually runs a part of this script from a wrong working directory, it will delete the results directory of their current working directory, which might be catastrophic.

I will also remove the same part in the snv-frequencies module.


### Methods

Select independent RNA-seq samples using `independent-specimens.rnaseq.primary.tsv` in the results of the `independent-samples` analysis module.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Select independent RNA-seq samples using `independent-specimens.rnaseq.primary.tsv` in the results of the `independent-samples` analysis module.
Select independent RNA-seq samples using `independent-specimens.rnaseq.primary.tsv` in the results of the `independent-samples` analysis module.

Note to update this after d3b-center/ticket-tracker-OPC#89

Copy link
Author

@logstar logstar Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will update when the new list is available.

analyses/rna-seq-expression-summary-stats/README.md Outdated Show resolved Hide resolved

Generate z-scores across all `cancer_groups`/`cancer_group_cohorts` as `z_score_matrix = (mean_TPM_matrix - rowMeans(mean_TPM_matrix)) / rowSD(mean_TPM_matrix)`. Call these z-scores as `mean_tpm_gene_wise_z_scores` in the filenames.

Call aforementioned (`n_genes`, `n_cancer_groups`/`n_cancer_group_cohorts`) tables as wide tables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible we can remove the wide version of the tables in the future, but we can leave in for now and make an update later if the long tables look to be what is desired.


#### All cohort summary statistics tables

The following wide tables are generated using the methods described above. Rows are genes. Columns are `cancer_group`s, except that the first two columns are gene symbol and gene Ensembl ID. If one gene symbol matches to multiple Ensembl IDs, the value of the Ensembl ID column is a comma separated list of all Ensembl IDs, e.g. `ENSG00000206952.3,ENSG00000281910.1`. In `inpiut/ens_symbol.tsv`, `SNORA50A` is mapped to both `ENSG00000206952.3` and `ENSG00000281910.1`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you will be updating this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note. Will update in the next patch.

logstar and others added 3 commits July 6, 2021 20:11
Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
Removed `rm -r results` part in run-rna-seq-expression-summary-stats.sh,
as suggested by @jharenza at
d3b-center#27 (comment) .
In all result tables, replace AllCohorts with PedOT, as suggested by
@jharenza at
d3b-center#27 (comment)
@logstar
Copy link
Author

logstar commented Jul 7, 2021

Thank you for the review @jharenza .

Updates to this PR:

  • Replaced AllCohorts with PedOT in all result tables, by modifying the cohort identifier for all-cohorts analysis in 01-tpm-summary-stats.R.
  • Removed the rm -r results part in run-rna-seq-expression-summary-stats.sh.
  • Updated README.md for recent patches described above, like JSON, EFO, MONDO, etc.

Update README.md for recent patches described in
d3b-center#27 .
Add a note that the `NA`/`NaN`s in result tables are represented with
blank string `''`s.
logstar added a commit to logstar/OpenPedCan-analysis that referenced this pull request Jul 7, 2021
Removed the `rm -r results` part in run-rna-seq-expression-summary-stats.sh,
as suggested by @jharenza at
d3b-center#27 (comment)
@logstar logstar mentioned this pull request Jul 7, 2021
5 tasks
Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the additional update ask - can you change PedOT to all_cohorts per this slack thread?

@logstar
Copy link
Author

logstar commented Jul 7, 2021

sorry for the additional update ask - can you change PedOT to all_cohorts per this slack thread?

No problem at all. Sorry that I just saw your comment here. I will update accordingly, after getting the current commit done for the gene-level SNV frequency tables.

In all result tables, change PedOT to all_cohorts, as suggested by
@jharenza and at
d3b-center#27 (review)
@logstar
Copy link
Author

logstar commented Jul 7, 2021

Update to this PR:

  • Changed PedOT to all_cohorts in result tables.
  • Updated README.md accordingly.

Convert results/*json files to JSON Lines format. See
run-rna-seq-expression-summary-stats.sh for more details.
@logstar
Copy link
Author

logstar commented Jul 8, 2021

Update to this PR:

Converted JSON files to JSON Lines (JSONL) format with jq. JSONL files:

  • results/long_n_tpm_mean_sd_quantile_group_wise_zscore.jsonl.gz
  • results/long_n_tpm_mean_sd_quantile_gene_wise_zscore.jsonl.gz

If the JSONL files work for OT database and ETL, I will update the Docker file in another PR to add a command to install jq.

Rationale that jq --compact-output '.[]' table.json > table.jsonl works:

  • jsonlite::write_json writes a tibble or data frame as a JSON array of objects, i.e. [{k11:v11,k12:v12,...}, {k21:v21,k22:v22,...}, ...], which is the input for jq.
  • jq parameters explained as following. Reference: https://stedolan.github.io/jq/manual/.
    • --compact-output: output each JSON object on a single line, which conforms to the JSONL format requirements.
    • '.[]': this returns all of the elements of an array.
  • With the input and parameters, jq outputs a file with each line as a JSON object, i.e. {k11:v11,k12:v12,...}, which conforms to the JSONL format requirement that "Each Line is a Valid JSON Value".

Regarding the other two JSONL format requirements:

  • UTF-8 Encoding: I think the jq output conforms to this requirement, as the output tables only have ASCII characters. Following are the output of file -i (found it at https://stackoverflow.com/a/805474/4638182):
$ file -i ./*jsonl
./long_n_tpm_mean_sd_quantile_gene_wise_zscore.jsonl:  text/plain; charset=us-ascii
./long_n_tpm_mean_sd_quantile_group_wise_zscore.jsonl: text/plain; charset=us-ascii
  • Line Separator is '\n': The jq output in our docker container uses '\n' as line separator.

Merge the updated independent sample lists.
Use independent-specimens.rnaseq.primary.eachcohort.tsv from the
independent-samples module to select independent samples.
@logstar
Copy link
Author

logstar commented Jul 8, 2021

Update to this PR:

Changed independent sample list to v6 independent-specimens.rnaseq.primary.eachcohort.tsv, which is obtained from the recently updated independent-samples module.

Previously used independent sample list is v5 independent-specimens.rnaseq.primary.tsv.

In v6, the differences between independent-specimens.rnaseq.primary.eachcohort.tsv and independent-specimens.rnaseq.primary.tsv are:

  • Number of samples:
$ wc -l independent-specimens.rnaseq.primary.*tsv
 12109 independent-specimens.rnaseq.primary.eachcohort.tsv
 12093 independent-specimens.rnaseq.primary.tsv
- Patient and sample IDs

$ sort independent-specimens.rnaseq.primary.eachcohort.tsv > independent-specimens.rnaseq.primary.eachcohort.tsv.sorted
$ sort independent-specimens.rnaseq.primary.tsv > independent-specimens.rnaseq.primary.tsv.sorted
$ diff independent-specimens.rnaseq.primary.tsv.sorted independent-specimens.rnaseq.primary.eachcohort.tsv.sorted
165c165
< 156   TCGA-06-0156-01A-03R-1849-01
---
> 156   TCGA-06-0156-01A-02R-1849-01
365c365
< 211   TCGA-06-0211-01A-01R-1849-01
---
> 211   TCGA-06-0211-01B-01R-1849-01
449c449
< 2656  TCGA-44-2656-01A-02R-0946-07
---
> 2656  TCGA-44-2656-01B-06R-A277-07
454,455c454,455
< 2665  TCGA-44-2665-01B-06R-A277-07
< 2666  TCGA-44-2666-01B-02R-A277-07
---
> 2665  TCGA-44-2665-01A-01R-0946-07
> 2666  TCGA-44-2666-01A-01R-0946-07
458c458
< 2674  TCGA-A6-2674-01A-02R-0821-07
---
> 2674  TCGA-A6-2674-01B-04R-A277-07
461c461
< 2677  TCGA-A6-2677-01B-02R-A277-07
---
> 2677  TCGA-A6-2677-01A-01R-0821-07
468c468
< 2684  TCGA-A6-2684-01A-01R-1410-07
---
> 2684  TCGA-A6-2684-01A-01R-A278-07
752,753c752,753
< 3809  TCGA-A6-3809-01A-01R-A278-07
< 3810  TCGA-A6-3810-01A-01R-1022-07
---
> 3809  TCGA-A6-3809-01A-01R-1022-07
> 3810  TCGA-A6-3810-01B-04R-A277-07
807,808c807,808
< 3917  TCGA-44-3917-01A-01R-A278-07
< 3918  TCGA-44-3918-01A-01R-1107-07
---
> 3917  TCGA-44-3917-01B-02R-A277-07
> 3918  TCGA-44-3918-01B-02R-A277-07
811c811
< 3923  TCGA-B2-3923-01A-02R-A277-07
---
> 3923  TCGA-B2-3923-01A-02R-1325-07
1667c1667
< 5635  TCGA-B2-5635-01B-04R-A277-07
---
> 5635  TCGA-B2-5635-01A-01R-1541-07
1676c1676
< 5656  TCGA-A6-5656-01A-21R-A278-07
---
> 5656  TCGA-A6-5656-01A-21R-1839-07
1678c1678
< 5659  TCGA-A6-5659-01A-01R-A278-07
---
> 5659  TCGA-A6-5659-01A-01R-1653-07
1680c1680
< 5661  TCGA-A6-5661-01A-01R-1653-07
---
> 5661  TCGA-A6-5661-01B-05R-2302-07
2193c2193
< 6650  TCGA-A6-6650-01A-11R-A278-07
---
> 6650  TCGA-A6-6650-01B-02R-A277-07
2254c2254
< 6775  TCGA-44-6775-01A-11R-1858-07
---
> 6775  TCGA-44-6775-01A-11R-A278-07
2259c2259
< 6780  TCGA-A6-6780-01B-04R-A277-07
---
> 6780  TCGA-A6-6780-01A-11R-A278-07
2836c2836
< 7740  TCGA-HC-7740-01A-11R-2118-07
---
> 7740  TCGA-HC-7740-01B-04R-2302-07
3151c3151
< 8258  TCGA-HC-8258-01A-11R-2263-07
---
> 8258  TCGA-HC-8258-01B-05R-2302-07
3718c3718
< A0C8  TCGA-BL-A0C8-01A-11R-A10U-07
---
> A0C8  TCGA-BL-A0C8-01A-11R-A277-07
3720c3720
< A0CA  TCGA-BK-A0CA-01A-21R-A118-07
---
> A0CA  TCGA-BK-A0CA-01A-21R-A277-07
3750c3750
< A0DB  TCGA-A7-A0DB-01A-11R-A00Z-07
---
> A0DB  TCGA-A7-A0DB-01C-02R-A277-07
4231c4231
< A139  TCGA-BK-A139-01A-11R-A118-07
---
> A139  TCGA-BK-A139-01C-08R-A277-07
4234c4234
< A13D  TCGA-A7-A13D-01A-13R-A277-07
---
> A13D  TCGA-A7-A13D-01B-04R-A277-07
4239c4239
< A13I  TCGA-BL-A13I-01B-04R-A277-07
---
> A13I  TCGA-BL-A13I-01A-11R-A13Y-07
4932c4932
< A26E  TCGA-A7-A26E-01A-11R-A277-07
---
> A26E  TCGA-A7-A26E-01A-11R-A169-07
4937,4938c4937,4938
< A26J  TCGA-A7-A26J-01A-11R-A277-07
< A26L  TCGA-BK-A26L-01A-11R-A277-07
---
> A26J  TCGA-A7-A26J-01B-02R-A277-07
> A26L  TCGA-BK-A26L-01C-04R-A277-07
6375c6375
< A4W0  TCGA-DV-A4W0-05A-11R-A266-07
---
> A4W0  TCGA-DV-A4W0-01A-11R-A266-07
6794c6794
< A5NY  TCGA-P7-A5NY-05A-11R-A35K-07
---
> A5NY  TCGA-P7-A5NY-01A-12R-A35K-07
9366c9366
< AAFG  TCGA-2G-AAFG-01A-11R-A430-07
---
> AAFG  TCGA-2G-AAFG-05A-11R-A430-07
9389c9389
< AAGI  TCGA-2G-AAGI-05A-11R-A430-07
---
> AAGI  TCGA-2G-AAGI-01A-11R-A430-07
9413c9413
< AAHP  TCGA-2G-AAHP-01A-12R-A430-07
---
> AAHP  TCGA-2G-AAHP-05A-11R-A430-07
9426c9426
< AAKG  TCGA-2G-AAKG-05A-11R-A430-07
---
> AAKG  TCGA-2G-AAKG-01A-11R-A430-07
9690c9690
< PABLDZ        TARGET-20-PABLDZ-09A-03R
---
> PABLDZ        TARGET-20-PABLDZ-09A-04R_2
9694c9694
< PADYCE        TARGET-52-PADYCE-01A-01R_1
---
> PADYCE        TARGET-52-PADYCE-01A-01R_2
9701c9701
< PAEAFC        TARGET-20-PAEAFC-09A-03R
---
> PAEAFC        TARGET-20-PAEAFC-09A-01R
9712c9712
< PAEFHC        TARGET-20-PAEFHC-09A-01R
---
> PAEFHC        TARGET-20-PAEFHC-09A-02R
9734c9734
< PAJLRA        TARGET-52-PAJLRA-01A-01R_1
---
> PAJLRA        TARGET-52-PAJLRA-01A-01R_2
9833c9833
< PAKPEW        TARGET-52-PAKPEW-01A-01R_2
---
> PAKPEW        TARGET-52-PAKPEW-01A-01R_1
9888c9888
< PALHVV        TARGET-20-PALHVV-09A-01R
---
> PALHVV        TARGET-20-PALHVV-09A-02R
9890c9890
< PALIIN        TARGET-30-PALIIN-01A-01R
---
> PALIIN        TARGET-30-PALIIN-01B-99R
9909c9909
< PALWVJ        TARGET-30-PALWVJ-01A-01R
---
> PALWVJ        TARGET-30-PALWVJ-01B-99R
9924c9924
< PAMNLH        TARGET-30-PAMNLH-01B-99R
---
> PAMNLH        TARGET-30-PAMNLH-01A-01R
9953c9953
< PANDER        TARGET-20-PANDER-09A-01R
---
> PANDER        TARGET-20-PANDER-09A-02R_2
9968c9968
< PANJGR        TARGET-20-PANJGR-09A-03R_2
---
> PANJGR        TARGET-20-PANJGR-09A-01R
9978,9979c9978,9979
< PANKFE        TARGET-30-PANKFE-01A-01R
< PANKFG        TARGET-20-PANKFG-09A-02R_2
---
> PANKFE        TARGET-30-PANKFE-01B-99R
> PANKFG        TARGET-20-PANKFG-09A-01R
9984c9984
< PANKNB        TARGET-20-PANKNB-09A-03R_2
---
> PANKNB        TARGET-20-PANKNB-09A-01R
9986c9986
< PANLIC        TARGET-10-PANLIC-09A-02R
---
> PANLIC        TARGET-10-PANLIC-09A-01R_2
9988c9988
< PANLIZ        TARGET-20-PANLIZ-09A-01R
---
> PANLIZ        TARGET-20-PANLIZ-09A-03R_2
9994c9994
< PANLXM        TARGET-20-PANLXM-09A-03R_2
---
> PANLXM        TARGET-20-PANLXM-09A-01R
9999c9999
< PANNMS        TARGET-30-PANNMS-01B-99R
---
> PANNMS        TARGET-30-PANNMS-01A-01R
10011,10012c10011,10012
< PANSBH        TARGET-20-PANSBH-09A-04R
< PANSBN        TARGET-30-PANSBN-01B-99R
---
> PANSBH        TARGET-20-PANSBH-09A-05R_2
> PANSBN        TARGET-30-PANSBN-01A-01R
10020c10020
< PANSJB        TARGET-20-PANSJB-09A-01R
---
> PANSJB        TARGET-20-PANSJB-09A-03R_2
10026c10026
< PANTPW        TARGET-20-PANTPW-09A-01R
---
> PANTPW        TARGET-20-PANTPW-09A-02R_2
10036c10036
< PANUTB        TARGET-20-PANUTB-09A-01R
---
> PANUTB        TARGET-20-PANUTB-09A-05R_2
10050c10050
< PANWHP        TARGET-20-PANWHP-09A-01R
---
> PANWHP        TARGET-20-PANWHP-09A-02R_2
10069c10069
< PANZLR        TARGET-21-PANZLR-09A-03R
---
> PANZLR        TARGET-21-PANZLR-41A-02R
10087c10087
< PAPBGH        TARGET-30-PAPBGH-01B-99R
---
> PAPBGH        TARGET-30-PAPBGH-01A-01R
10090c10090
< PAPBZI        TARGET-30-PAPBZI-01B-99R
---
> PAPBZI        TARGET-30-PAPBZI-01A-01R
10093c10093
< PAPCUR        TARGET-10-PAPCUR-09A-01R_2
---
> PAPCUR        TARGET-10-PAPCUR-09A-01R_1
10097c10097
< PAPEAV        TARGET-30-PAPEAV-01A-01R
---
> PAPEAV        TARGET-30-PAPEAV-01B-99R
10103c10103
< PAPEWB        TARGET-10-PAPEWB-09A-01R_2
---
> PAPEWB        TARGET-10-PAPEWB-09A-01R_1
10114c10114
< PAPHZT        TARGET-10-PAPHZT-09A-01R_2
---
> PAPHZT        TARGET-10-PAPHZT-09B-01R
10128c10128
< PAPKXS        TARGET-30-PAPKXS-01B-99R
---
> PAPKXS        TARGET-30-PAPKXS-01A-01R
10134c10134
< PAPNNX        TARGET-10-PAPNNX-09A-02R
---
> PAPNNX        TARGET-10-PAPNNX-09A-01R_2
10219c10219
< PARBIU        TARGET-20-PARBIU-09A-03R_2
---
> PARBIU        TARGET-20-PARBIU-09A-02R
10236,10237c10236,10237
< PARDDA        TARGET-20-PARDDA-09A-01R
< PARDDY        TARGET-20-PARDDY-09A-05R_2
---
> PARDDA        TARGET-20-PARDDA-09A-02R
> PARDDY        TARGET-20-PARDDY-09A-04R
10245c10245
< PAREAT        TARGET-15-PAREAT-09B-01R_1
---
> PAREAT        TARGET-15-PAREAT-09B-01R_2
10256c10256
< PARFGK        TARGET-20-PARFGK-09A-03R
---
> PARFGK        TARGET-20-PARFGK-09A-01R
10322c10322
< PARLSL        TARGET-21-PARLSL-41A-02R
---
> PARLSL        TARGET-21-PARLSL-09A-02R
10339,10340c10339,10340
< PARMZF        TARGET-20-PARMZF-09A-01R
< PARNAW        TARGET-21-PARNAW-41A-01R
---
> PARMZF        TARGET-20-PARMZF-09A-02R_2
> PARNAW        TARGET-21-PARNAW-09A-01R
10383c10383
< PARTKH        TARGET-52-PARTKH-01A-01R_1
---
> PARTKH        TARGET-52-PARTKH-01A-01R_2
10408c10408
< PARUTH        TARGET-20-PARUTH-09A-04R_2
---
> PARUTH        TARGET-20-PARUTH-09A-01R
10410c10410
< PARUUB        TARGET-20-PARUUB-09A-02R_2
---
> PARUUB        TARGET-20-PARUUB-09A-01R
10421c10421
< PARVSF        TARGET-20-PARVSF-09A-01R
---
> PARVSF        TARGET-20-PARVSF-09A-02R
10432c10432
< PARWPU        TARGET-15-PARWPU-09B-01R_2
---
> PARWPU        TARGET-15-PARWPU-09B-01R_1
10436c10436
< PARXBT        TARGET-20-PARXBT-09A-02R_2
---
> PARXBT        TARGET-20-PARXBT-09A-01R
10452c10452
< PARYGA        TARGET-20-PARYGA-09A-03R_2
---
> PARYGA        TARGET-20-PARYGA-09A-01R
10461a10462
> PARZCJ        TARGET-30-PARZCJ-01A-01R
10465c10466
< PARZIA        TARGET-21-PARZIA-09A-01R
---
> PARZIA        TARGET-21-PARZIA-41A-02R
10468,10469c10469,10470
< PARZRH        TARGET-52-PARZRH-01A-02R_2
< PARZUU        TARGET-20-PARZUU-09A-01R
---
> PARZRH        TARGET-52-PARZRH-01A-02R_1
> PARZUU        TARGET-20-PARZUU-09A-04R_2
10471c10472
< PARZWH        TARGET-20-PARZWH-09A-03R_2
---
> PARZWH        TARGET-20-PARZWH-09A-01R
10534c10535
< PASFHK        TARGET-21-PASFHK-09A-01R
---
> PASFHK        TARGET-20-PASFHK-09A-01R
10535a10537
> PASFIC        BS_DYHEFXJC
10639c10641
< PASLZE        TARGET-21-PASLZE-41A-02R
---
> PASLZE        TARGET-21-PASLZE-09A-01R
10657c10659
< PASNED        TARGET-52-PASNED-01A-01R_1
---
> PASNED        TARGET-52-PASNED-01A-01R_2
10665c10667
< PASNKZ        TARGET-21-PASNKZ-09A-01R
---
> PASNKZ        TARGET-21-PASNKZ-41A-02R
10677c10679
< PASPGA        TARGET-20-PASPGA-09A-01R
---
> PASPGA        TARGET-20-PASPGA-09A-02R_2
10681c10683
< PASPLU        TARGET-20-PASPLU-09A-01R
---
> PASPLU        TARGET-20-PASPLU-09A-02R_2
10690a10693
> PASREY        BS_6Y5F3QGV
10709c10712
< PASSLT        TARGET-21-PASSLT-41A-02R
---
> PASSLT        TARGET-21-PASSLT-09A-01R
10719a10723
> PASSWW        BS_K9VNW7JD
10742c10746
< PASTZK        TARGET-21-PASTZK-41A-03R
---
> PASTZK        TARGET-21-PASTZK-09A-01R
10783c10787
< PASWAT        TARGET-20-PASWAT-09A-02R
---
> PASWAT        TARGET-20-PASWAT-09A-01R_2
10803a10808
> PASWYR        BS_MPE34NYZ
10813a10819
> PASXHE        TARGET-30-PASXHE-01A-01R
10814a10821
> PASXIE        TARGET-30-PASXIE-01A-01R
10825a10833
> PASXRG        TARGET-30-PASXRG-01A-01R
10826a10835
> PASXRJ        TARGET-30-PASXRJ-01A-01R
10850c10859
< PASYWA        TARGET-21-PASYWA-09A-01R
---
> PASYWA        TARGET-21-PASYWA-41A-01R
10862a10872
> PASZKE        TARGET-30-PASZKE-01A-01R
10874c10884
< PATAIJ        TARGET-21-PATAIJ-09A-01R
---
> PATAIJ        TARGET-20-PATAIJ-09A-01R
10899a10910
> PATBMM        TARGET-30-PATBMM-01A-01R
10913a10925
> PATCFL        TARGET-30-PATCFL-01A-01R
11060c11072
< PATKKJ        TARGET-21-PATKKJ-41A-01R
---
> PATKKJ        TARGET-20-PATKKJ-09A-01R
11066c11078
< PATKWH        TARGET-21-PATKWH-09A-01R
---
> PATKWH        TARGET-21-PATKWH-41A-02R
11340a11353
> PT_2E552BAR   BS_RQSKD7PN
11408c11421
< PT_59D00MBQ   BS_TM9MH0RP
---
> PT_59D00MBQ   BS_ZVWE73JZ
11439c11452
< PT_6MWPJ96F   BS_NPQJ20KX
---
> PT_6MWPJ96F   BS_VR7FR1NE
11460a11474
> PT_7JQ24F35   BS_1TPWFSK6
11493c11507
< PT_8RB7TPS2   BS_HC44ZA0V
---
> PT_8RB7TPS2   BS_3F4FQJMR
11644c11658
< PT_EYWDFKA7   BS_J4GQPZS0
---
> PT_EYWDFKA7   BS_TYKJ8G2G
11668c11682
< PT_FZHGKJ0H   BS_2SHWPB5P
---
> PT_FZHGKJ0H   BS_JBGZ7HHZ
11698c11712
< PT_H45M7M2T   BS_HX0S30NE
---
> PT_H45M7M2T   BS_YB2RXRHT
11736c11750
< PT_JT9HH7M6   BS_HE0WJRW6
---
> PT_JT9HH7M6   BS_HWGWYCY7
11747c11761
< PT_K8ZV7APT   BS_893A91ZM
---
> PT_K8ZV7APT   BS_3Y0QXZQB
11798c11812
< PT_NESAQHB1   BS_QHS3V0GP
---
> PT_NESAQHB1   BS_QT3NZ2YZ
11877a11892
> PT_S4YNE17X   BS_B7QS4DHK
11878a11894
> PT_S4YNE17X   BS_VXEFKW5P
11941,11942c11957,11958
< PT_W17NV5YG   BS_TCGEZJ5F
< PT_W5GP3F6B   BS_KABQQA0T
---
> PT_W17NV5YG   BS_V4W81SFC
> PT_W5GP3F6B   BS_7WM3MNZ0
11946c11962
< PT_WE1CHTWK   BS_JGA9BP3A
---
> PT_WE1CHTWK   BS_1HQ76V6D
11965c11981
< PT_X7HC5YCY   BS_FN07P04C
---
> PT_X7HC5YCY   BS_W4H1D4Y6
12023c12039
< SJMPAL011914  TARGET-15-SJMPAL011914-09B-01R_2
---
> SJMPAL011914  TARGET-15-SJMPAL011914-09A-01R
12027c12043
< SJMPAL012419  TARGET-15-SJMPAL012419-09B-01R_2
---
> SJMPAL012419  TARGET-15-SJMPAL012419-09B-01R_1
12033c12049
< SJMPAL016342  TARGET-15-SJMPAL016342-09B-01R_1
---
> SJMPAL016342  TARGET-15-SJMPAL016342-09A-01R
12043c12059
< SJMPAL040025  TARGET-15-SJMPAL040025-09A-01R
---
> SJMPAL040025  TARGET-15-SJMPAL040025-09B-01R
12050,12054c12066,12070
< SJMPAL040037  TARGET-15-SJMPAL040037-09B-01R_1
< SJMPAL040038  TARGET-15-SJMPAL040038-09A-01R
< SJMPAL040039  TARGET-15-SJMPAL040039-09B-01R_2
< SJMPAL040459  TARGET-15-SJMPAL040459-09B-01R_1
< SJMPAL041117  TARGET-15-SJMPAL041117-09A-01R
---
> SJMPAL040037  TARGET-15-SJMPAL040037-09A-01R
> SJMPAL040038  TARGET-15-SJMPAL040038-09B-01R_2
> SJMPAL040039  TARGET-15-SJMPAL040039-09A-01R
> SJMPAL040459  TARGET-15-SJMPAL040459-09B-01R_2
> SJMPAL041117  TARGET-15-SJMPAL041117-09B-01R
12057c12073
< SJMPAL041120  TARGET-15-SJMPAL041120-09B-01R_2
---
> SJMPAL041120  TARGET-15-SJMPAL041120-09A-01R
12059,12060c12075,12076
< SJMPAL042787  TARGET-15-SJMPAL042787-09A-01R
< SJMPAL042791  TARGET-15-SJMPAL042791-09B-01R_2
---
> SJMPAL042787  TARGET-15-SJMPAL042787-09B-01R_1
> SJMPAL042791  TARGET-15-SJMPAL042791-09A-01R
12062,12063c12078,12079
< SJMPAL042793  TARGET-15-SJMPAL042793-09B-01R_1
< SJMPAL042794  TARGET-15-SJMPAL042794-09A-01R
---
> SJMPAL042793  TARGET-15-SJMPAL042793-09B-01R_2
> SJMPAL042794  TARGET-15-SJMPAL042794-09B-01R_1
12066,12067c12082,12083
< SJMPAL042798  TARGET-15-SJMPAL042798-09B-01R
< SJMPAL042799  TARGET-15-SJMPAL042799-09B-01R
---
> SJMPAL042798  TARGET-15-SJMPAL042798-09A-01R
> SJMPAL042799  TARGET-15-SJMPAL042799-09A-01R
12074,12075c12090,12091
< SJMPAL043511  TARGET-15-SJMPAL043511-09B-01R_1
< SJMPAL043512  TARGET-15-SJMPAL043512-09B-01R_1
---
> SJMPAL043511  TARGET-15-SJMPAL043511-09B-01R_2
> SJMPAL043512  TARGET-15-SJMPAL043512-09B-01R_2
12080c12096
< SJMPAL043771  TARGET-15-SJMPAL043771-09B-01R_2
---
> SJMPAL043771  TARGET-15-SJMPAL043771-09B-01R_1

Assert no NA before paste(x). paste(c(NA)) returns 'NA'.
logstar added a commit that referenced this pull request Jul 14, 2021
Install json processor jq for converting JSON format to JSONL format. Related comment: <#27 (comment)>.
Copy link
Member

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the JSONL format was approved by FNL, I think we can merge this and once the annotation module is in, we can update later.

@logstar
Copy link
Author

logstar commented Jul 14, 2021

Since the JSONL format was approved by FNL, I think we can merge this and once the annotation module is in, we can update later.

Thank you for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants