Role | Name | GitHub | |
---|---|---|---|
Project Lead: | Scott Teresi | Personal GitHub | teresisc@msu.edu |
PI: | Patrick Edger | NA | edgerpat@msu.edu |
Generate gene expression data for a blueberry genome, to be used in a genomics project later.
- Align RNA-Seq data to said genome
- Generate gene expression tables (FPKM, TPM)
- Generate lists of differentially expressed genes (DEGs).
The workflow is recapitulated in the Makefile
.
- Index the genome with STAR
- Trim the reads
- Align the reads with STAR
- Quantify the reads with HTSeq
- Collate the reads into one table of counts
- Run EdgeR to determine differentially expressed genes
First, the genome was indexed using STAR v2.6.1
, the genome FASTA file, and the gene annotation file.
The FASTA file and annotation file were derived from the genome publication.
The commands used to perform this can be found in the src/genome_index_STAR.sb
script.
Illumina adapters were removed from the raw reads using Trimmomatic v0.38
. More details can be found in the src/trim_all.sb
script.
Filtered reads were then aligned to the genome using STAR v2.6.1
and the script associated with this command may be found at src/STAR_map.sb
. Multimapping reads were discarded.
Count files were calculated using HTSeq v0.12.4
.
Individual count files were then collated using the custom Python script at src/count_collate.py
.
This was then used as input to EdgeR v3.30.3
(R v4.0.2)
to determine which genes are differentially expressed in each condition comparison.
An FDR correction using the Benjamini-Hochberg method was utilized. The script associated with this analysis may be found at src/EdgeR/EdgeR_Blueberry.R
.
This project continues with Network Analysis. There I identify orthologs and examine gene network differences between the two blueberry cultivars in this dataset.
Refer to the requirements/
folder. Major packages used are STAR v2.6.1, Trimmomatic v0.38, SAMtools v1.9, R v4.0.2, edgeR_3.30.3, and HTSeq v0.12.4.
This was my first experience working on an RNA-Seq project, and I started this during the first year of my PhD.
I learned a lot about operating on the computing cluster, in particular running array jobs.
If I had the chance to go back and do it again, I would modify the array jobs to read from a manifest file, where each row would be a job.
I think this would be more legible and reproducible than ls
-ing a directory and feeding that to the job with sed
; the manifest file could easily be tracked with Git.
I also think the EdgeR script is a little too hard-coded, and would be a pain to refactor or use for a similar project. If I had the change to re-do the DEG analysis, I would also consider generalizing the EdgeR script to operate on one pair of gene expression columns at a time, reading those inputs from files already saved on disk. Most of the work (and a major source of hard-coding) was the product of just trying to iterate over my gene expression table, get the name for the comparison, and then perform the comparison (which is pretty simple on its own from a code standpoint).