scripts/get-stress_response-genes-transcriptome.Rmd

---
title: "get-stress-genes"
output: html_document
---
Rmd to get list of stress response genes from the transcriptome v 3.1 with `blast` output and uniprot-SP-Go annotation. 

```{r}
library(dplyr)
library(tidyverse)
```

Read in the annotation from the transcriptome v 3.1 blast output:
```{r}
uniprot <- read.delim("../data/uniprotKB-transcrblast3.1-GO.tab", sep = '\t', header = TRUE)
head(uniprot)
```
12931 annotations 

Rename columns:
```{r}
colnames(uniprot) <- c("uniprot_acc_ID", "Entry.name", "Status", "Protein.names", "Gene.names", "Organism", "Length", "Gene.ontology.biological.process", "Gene.ontology.cellular.component", "Gene.ontology.GO", "Gene.ontology.molecular.function", "Gene.ontology.IDs")
head(uniprot)
```

Read in `blast` output from transcriptome v 3.1 - unique Trinity IDs (48551):
```{r}
blast <- read.delim("../analyses/BLAST-to-GOslim/_blast-sep.tab", header = FALSE)
head(blast)
```

Rename first and third column for `join`-ing purposes later on. 
```{r}
colnames(blast) <- c("Trinity_ID", "swissprot", "uniprot_acc_ID", "gene_id", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")
head(blast)
```

`join` the `blast` output with the uniprot-SP-GO by uniprot_acc_ID:
```{r}
blastgo <- left_join(blast, uniprot, by = "uniprot_acc_ID")
head(blastgo)
```
48551 rows with 25 columns

Write out annotated blast go output:
```{r}
#write.table(blastgo, "../analyses/BLAST-to-GOslim/transcriptomev3.1-blast-uniprotGO.tab", sep = '\t', quote = FALSE, row.names = FALSE)
```
Wrote out 02/11/2021. Commented out code. 


Read in the table from the BLAST to GOslim jupyter notebook called "Blastquery-GOslim.tab":
```{r}
blastquery <- read.delim("../analyses/BLAST-to-GOslim/Blastquery-GOslim.tab", sep = '\t', header = FALSE)
head(blastquery)
```
456451 rows.

Add column names:
```{r}
colnames(blastquery) <- c("Trinity_ID", "GO_ID", "GOslim", "biological_process")
head(blastquery)
```

Subset the GOslim terms that have the word "stress" in them. Only the term "stress response" has the term "stress":
```{r}
stress <- blastquery %>%
  filter(str_detect(GOslim, "stress"))
head(stress)
```
10910 Trinity IDs.

`join` the stress list with the annotated `blast` output with uniprot (blastgo):
```{r}
stress_annot <- left_join(stress, blastgo, by = "Trinity_ID")
head(stress_annot)
```
10910 rows. 24 columns.       


write out the annotated stress response table:
```{r}
#write.table(stress_annot, "../analyses/BLAST-to-GOslim/transcriptome-stress-genes.tab", sep = '\t', quote = FALSE, row.names = FALSE)
```
Wrote out 02/11/2021. Commented out code. 

#### find unique Trinity_IDs (tells how many contigs fall under the category "Stress Response": 

```{r}
distinct_stress <- unique(stress_annot$Trinity_ID)
head(distinct_stress)
```
6191 rows.