Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

#837 updated: RNA QC compare stranded vs polya matched sample_ids #930

Merged
merged 7 commits into from
Feb 10, 2021
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions analyses/tp53_nf1_score/02-qc-rna_expression_score.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ library("broom")
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
data_dir <- file.path(root_dir, "data")
results_dir <- "results"

clinical <- read_tsv(file.path(data_dir,"pbta-histologies.tsv"))
```

### Classifier scores
Expand Down Expand Up @@ -128,3 +130,57 @@ negative correlation estimate -0.139 at p value 0.297
Overall distribution of expression and classifier score shows
negligible correlation so we cannot directly use the expression/classifier
score to infer functionality/phenotype of TP53 inactivation in samples

### Sample_id matched polya and stranded

```{r}

sample_id_multi_library <- clinical %>%
filter(experimental_strategy=="RNA-Seq") %>%
group_by(sample_id) %>%
summarise(counts = length(unique(RNA_library)),
RNA_library = toString(RNA_library)) %>%
filter(counts > 1)

sample_id_multi_library

```

```{r}

sample_id_multi_library_scores_df <- clinical %>%
left_join(score_stranded_df , by=c("Kids_First_Biospecimen_ID"="sample_id")) %>%
left_join(score_polya_df,by=c("Kids_First_Biospecimen_ID"="sample_id"),
suffix = c("_stranded","_polya")) %>%
filter(sample_id %in% sample_id_multi_library$sample_id) %>%
group_by(sample_id,) %>%
# because 7316-85 has multiple stranded I'm taking a mean here
# A tibble: 8 x 2
# Kids_First_Biospecimen_ID tp53_score_stranded
# <chr> <dbl>
# 1 BS_59ZJWJTF 0.299
# 2 BS_QYPHA40N 0.0665
# 3 BS_SB12W1XT 0.229
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, you are better off showing this output in a chunk rather than including it in a comment that could easily get out of date.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review! That's a great point, I've now added a chunk to display the output

summarise( tp53_score_stranded = mean(tp53_score_stranded[!is.na(tp53_score_stranded)]),
tp53_score_polya = mean(tp53_score_polya[!is.na(tp53_score_polya)])) %>%
reshape2::melt()


```


Plotting the scores
```{r}

ggplot(sample_id_multi_library_scores_df,
aes(y=value,x=as.numeric(factor(sample_id_multi_library_scores_df$variable)))) +
geom_point()+
geom_line(aes(color=sample_id)) +
xlim(c("stranded",
"polya")) +
xlab("RNA_library") + ylab("tp53_score")
```

On average matched polya samples have lower tp53 classifier scores, should we only keep stranded scores when we can?

We could consider utilizing batch correction once completed via [#919](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/919) to rerun the classifier for this dataset. We can then check whether the scores would be more similar for polyA and stranded. Instead of choosing one over the other without some additional investigation, perhaps we take an average. For the most part, while there is a bias, the scores agree and for 7316-161, although the polyA score is just under 0.5, it is still considered a loss via evidence. Once we investigate all of the evidence, we might consider a re-thresholding of the scores for what determines inactivation/oncogenic TP53 (discussed earlier).
Loading