Contact Information:

Role	Name	GitHub	Email
Project Lead:	Scott Teresi	Personal GitHub	teresisc@msu.edu
PI:	Patrick Edger	NA	edgerpat@msu.edu

Purpose:

Generate gene expression data for a blueberry genome, to be used in a genomics project later.

Align RNA-Seq data to said genome
Generate gene expression tables (FPKM, TPM)
Generate lists of differentially expressed genes (DEGs).

Solution:

The workflow is recapitulated in the Makefile.

Index the genome with STAR
Trim the reads
Align the reads with STAR
Quantify the reads with HTSeq
Collate the reads into one table of counts
Run EdgeR to determine differentially expressed genes

Indexing the genome with STAR

First, the genome was indexed using STAR v2.6.1, the genome FASTA file, and the gene annotation file. The FASTA file and annotation file were derived from the genome publication. The commands used to perform this can be found in the src/genome_index_STAR.sb script.

Trimming the reads:

Illumina adapters were removed from the raw reads using Trimmomatic v0.38. More details can be found in the src/trim_all.sb script.

Alignment of reads:

Filtered reads were then aligned to the genome using STAR v2.6.1 and the script associated with this command may be found at src/STAR_map.sb. Multimapping reads were discarded.

Quantification of counts and calculation of differentially expressed genes:

Count files were calculated using HTSeq v0.12.4. Individual count files were then collated using the custom Python script at src/count_collate.py. This was then used as input to EdgeR v3.30.3 (R v4.0.2) to determine which genes are differentially expressed in each condition comparison. An FDR correction using the Benjamini-Hochberg method was utilized. The script associated with this analysis may be found at src/EdgeR/EdgeR_Blueberry.R.

Follow Up Project:

This project continues with Network Analysis. There I identify orthologs and examine gene network differences between the two blueberry cultivars in this dataset.

Version Control:

Refer to the requirements/ folder. Major packages used are STAR v2.6.1, Trimmomatic v0.38, SAMtools v1.9, R v4.0.2, edgeR_3.30.3, and HTSeq v0.12.4.

Context and Future Considerations:

This was my first experience working on an RNA-Seq project, and I started this during the first year of my PhD. I learned a lot about operating on the computing cluster, in particular running array jobs. If I had the chance to go back and do it again, I would modify the array jobs to read from a manifest file, where each row would be a job. I think this would be more legible and reproducible than ls-ing a directory and feeding that to the job with sed; the manifest file could easily be tracked with Git.

I also think the EdgeR script is a little too hard-coded, and would be a pain to refactor or use for a similar project. If I had the change to re-do the DEG analysis, I would also consider generalizing the EdgeR script to operate on one pair of gene expression columns at a time, reading those inputs from files already saved on disk. Most of the work (and a major source of hard-coding) was the product of just trying to iterate over my gene expression table, get the name for the comparison, and then perform the comparison (which is pretty simple on its own from a code standpoint).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contact Information:

Purpose:

Solution:

Indexing the genome with STAR

Trimming the reads:

Alignment of reads:

Quantification of counts and calculation of differentially expressed genes:

Follow Up Project:

Version Control:

Context and Future Considerations:

About

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
requirements		requirements
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

sjteresi/Blueberry_RNA_Seq_Expression_Analysis

Folders and files

Latest commit

History

Repository files navigation

Contact Information:

Purpose:

Solution:

Indexing the genome with STAR

Trimming the reads:

Alignment of reads:

Quantification of counts and calculation of differentially expressed genes:

Follow Up Project:

Version Control:

Context and Future Considerations:

About

Resources

Stars

Watchers

Forks

Releases

Languages