This R script is used to analyze microarray data acquired by an Agilent SureScan Microarray Scanner. Internally the package limma
by Gordon K. Smyth et al. [1] is used to read and analyze the data.
This script is used to analyze the raw readouts of two biological experiments. In both experiments, the cell culture media were supplemented with three different pharmaceuticals, resulting in four samples per time point; three treatments, and one negative control. The data should show whether treatment with different pharmaceuticals affects gene expression and, if so, how this influence changes over time. Furthermore, we are interested in the specific genes that changed over time and how these genes relate to each treatment, i.e., whether a particular gene is differentially expressed for only one treatment or multiple treatments and whether a gene that is differentially expressed for multiple treatments is always up- or down-regulated.
Cells were harvested at different time points, and RNA was isolated using a modified protocol of Qiagen's RNeasy Protect Cell Mini Kit Protocol.[2] Further sample preparation and microarray-based gene expression analysis was performed according to Agilent's protocol.[3] The microarray used was a SurePrint G3 Human Gene Expression v3 8x60K (P/N G4851C). Each glass slide carries eight high-definition 60K arrays containing cDNA for 26,803 unique Entrez genes and 30,606 unique lncRNAs, and 3000 replicates.[4]
The Agilent SureScan generates a QC Report and a .txt
file with various data and metadata for each sample. Eight samples could be analyzed per glass slide; the files of these eight samples were stored in a folder named after the serial number of the glass slide. (See data/*
)
After importing the raw data via limma
's read.maimages
function [5], the data was annotated with the ensembl gene IDs. Background correction was done via the backgroundCorrect
function from limma
using the normexp
method.[6] After background correction, the data were filtered; here, the data that could not be annotated with data from ensembl and whose expression level was not significantly above the background were excluded. The remaining data were normalized with the limma
function normalizeBetweenArrays
using the quantile
method to ensure a similar distribution of expression levels between the different arrays.[7]
A polynomial trend was allowed for the baseline and the individual treatments to examine the effect of each treatment over the treatment period. These polynomial trends were used to create a design matrix fitted to the data using the lmFit
function from limma
.[9] Professor Gordon K. Smyth of the Walter and Eliza Hall Institute of Medical Research in Melbourne, creator of the limma
package, advised this procedure.[8]
The fitted data were then statistically analyzed using empirical Bayes statistics for differential expression. The eBayes
function from limma
was used for this purpose.[10]
The generated data were filtered according to different criteria (logFC, p-value, and p-value with logFC). For all criteria, lists of genes that were differentially expressed in multiple or all treatments were generated; genes that were differentially expressed in only one treatment were also flagged.
For more precise information on the evaluation, you may look at the code and refer to the individual functions in the manual of limma
[1] or the respective R package.
This project is part of Teresa Hardy's Master's thesis, which was conducted under the supervision of Sam Thilmany at the Federal Institute for Drugs and Medical Devices, Bonn, Germany.
The concept for the data analysis was a joint effort of Teresa Hardy and Sam Thilmany; Sam Thilmany did the programming.