Pipeline for RNA-seq scripts used by the Essigmann Lab.
- Create environment from
yml
file:conda env create -f rnaseq_env.yml
- Activate environment:
source activate rnaseq
- Download genome from UCSC Genome Browser:
chromFa.tar.gz
- Unzip FASTA:
tar -xvzf chromFa.tar.gz
- Remove mitochondrial chromosome and other noncanonical chromosomes (
chr#_#########_random
) from directory - Compile chromosomal FASTA files to single file:
cat *.fa > mm10.fa
- If necessary, modify FASTA file to match naming convention for GTF file:
sed -i 's/chr//g' mm10.fa
- Index reference:
hisat2-build -f mm10.fa mm10
- Download GTF from Ensembl:
Mus_musculus.GRCm38.93.gtf.gz
- Unzip GTF:
tar -xzvf Mus_musculus.GRCm38.93.gtf.gz
- (Optional) Rename file:
mv Mus_musculus.GRCm38.93.gtf mm10.gtf
- Format known splice junctions to format used by HISAT2:
hisat2_extract_splice_sites.py mm10.gtf > mm10.gtf.ss
- Trim adapter sequences and ends:
trimmomatic-0.38.jar SE $seq.fastq ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:30 LEADING:30 TRAILING:30 MINLEN:25
- Map to whole genome, accounting for known splice sites:
hisat2 --dta -x ref/mm10 --known-splicesite-infile ref/mm10.gtf.ss -U $trimmed.fastq -S $sample.hisat2.sam
- Convert to BAM:
samtools view -bS $sample.hisat2.sam > $sample.hisat2.unsorted.bam
- Sort BAM file:
samtools sort -o $sample.hisat2.bam $sample.hisat2.unsorted.bam
- Estimate abundances for differential expression analysis:
stringtie -e -B -G ref/mm10.gtf -A $sample\_abund.tab -o $directory/$sample/$sample.gtf $sample.hisat2.bam
- Note: This is considered StringTie's "alternate" workflow, relying on a well-annotated reference; it will not search for novel isoforms. Suggested by the StringTie creator here.
- Historical note: Originally the pipeline used StringTie's recommended workflow, but identifying gene names caused trouble as many
gene_id
values were givenMSTRG
assignments. StringTie author made note of it here.
- Download Python script (prepDE.py) provided by StringTie developers
- Run script to extract read count information from StringTie outputs:
python2.7 prepDE.py -l 40 [-i $directory]
- Note: Without the
-i
parameter, this assumes default directory structure created by StringTie, with aballgown
folder in the working directory. In our case, use the-i
parameter to denote the directory where outputs are contained. - Note: The script requires a Python version between 2.7 and 3.
- Note: The
-l
parameter takes in average read length. While this doesn't affect relative transcript levels, it will impact your absolute values! The default parameter for-l
is75
.
- Note: Without the
- R environment must have the following Bioconductor packages installed: DESeq2, org.Mm.eg.db, and biomaRt.
- If not installed, install Bioconductor:
source("https://bioconductor.org/biocLite.R")
- Install packages:
biocLite("DESeq2"); biocLite("org.Mm.eg.db"); biocLite("biomaRt")
- Call each library in the R environment with the
library()
function.
- If not installed, install Bioconductor:
- Read in expression gdata calculated by StringTie:
pheno_data <- read.table("180711Ess_phenodata.csv", header=TRUE, sep=","); count_data <- read.table("gene_count_matrix.csv", header=TRUE, sep=",", row.names=1)
- Retrieve gene IDs and gene names by creating a
biomaRt
instance:ensembl <- useMart("ensembl", dataset="mmusculus_gene_ensembl")
- Create key-value pairs for Ensembl IDs to gene names:
gene_names <- getBM(c("ensembl_gene_id", "external_gene_name"), filters="ensembl_gene_id", values=rownames(count_data), mart=ensembl); colnames(gene_names) <- c("ensembl_id", "gene_name"); gene_set <- setNames(as.list(gene_names$gene_name), as.list(gene_names$ensembl_id))
- Read in all data to a
DESeqDataSet
:dds <- DESeqDataSetFromMatrix(countData=count_data, colData=pheno_data, design~1)
- Filter to remove low-count genes:
keep.rows <- rowSums(counts(dds)) >= 10; dds <- dds[keep.rows,]
- Use in-house R function
dea.group
to query between either sexes or treatment groups. For sex differences, the function requires the DESeq data structure and experimental treatment group to query; TCPOBOP addition changes require the DESeq data structure,the base group (i.e. non-TCPOBOP treatment), and the sex to query.- Conduct differential gene analysis on DMSO-treated mice, by sex:
dds.sex_dmsona <- dea.group(dds, "sex", "DMSO")
- Conduct differential gene analysis on DMSO-treated males, by TCPOBOP addition:
dds.tc_dmsom <- dea.group(dds, "TCPOBOP", "DMSO", "M")
- Conduct differential gene analysis on DMSO-treated mice, by sex: