These materials are for the course Analysis of Gene Expression taught at the University of Chemistry and Technology in Prague, and guaranteed by Department of Informatics and Chemistry in the study programme Bioinformatics (available for the bachelor, master, and PhD. levels).
The authors are from the Laboratory of Genomics and Bioinformatics at the Institute of Molecular Genetics of the Czech Academy of Sciences:
- Michal Kolar <kolarmi@img.cas.cz> (guarantor, theoretical lectures)
- Jiri Novotny <jiri.novotny@img.cas.cz> (exercises)
In case of suggestions or problems, create a new issue. We will be happy to answer your questions, integrate new ideas, or resolve any problems 😊
Recordings and materials for theoretical lectures are stored at school MS Teams, and currently available only to course participants.
We expect all participants to have a basic knowledge of base R and Linux shell (bash). Links to relevant materials can be found in E01 - Intro.
We are using virtual machines (VMs) with images based on Debian 10 and including all the necessary software
(R 4.0, RStudio Server, conda
, and various tools).
We gratefully thank to the Metacentrum Cloud
team for a great assistance with virtual machines ❤️
However, it is possible to install all the stuff in order to have the same environment as our VMs offer (or be very close to it). Generally, we recommend to work on Linux-based system (our tip: Linux Mint).
Just download and unzip this repository. Additional data files for E07 - RNA-seq must be downloaded, see the relevant section below.
You need R 4.0+ and Bioconductor 3.12+ installed. We recommend to use RStudio IDE for programming.
A lockfile for renv is included -
it captures all packages needed to run the exercises. Moreover, renv
ensures all packages
are installed to a local R library, and thus, the installation doesn't pollute the system library.
To start the installation of required packages:
- Create a new RStudio project in
Exercises/
directory. If you are not using RStudio, just change R's working directory toExercises/
. - Start R.
- Run
renv::init()
. This will create a new project-specific library and install packages fromrenv.lock
. Ifrenv
is not available, install it first byinstall.packages("renv")
.
Other tools could be installed through your OS package manager or the conda
tool (see E01 - Intro).
The latter is recommended for bioinformatics tools, which are mainly used during RNA-seq exercises.
Due to educational purposes, those are held in a private repository and available upon request.
- Some information about our virtual machines and files.
sshfs
- mount directory on a remote servertmux
- termimal multiplexerfish
- a friendly, interactive shellconda
- package and virtual environment manager- Links to beginner base R tutorials and other useful stuff.
- Introduction to RMarkdown (Rmd).
- Reproducible R (project-oriented workflow, consistent paths using here(), namespace conflicts, renv, etc.).
- Installing R packages.
- Debugging R.
- Writing your own functions.
- Vectorized operations, avoiding for loops, parallelization.
- Introduction to tidyverse
- Overview of tidy data and non-standard/tidy evaluation.
magrittr
- pipe operatortibble
- enhanced data.framedplyr
- data manipulationtidyr
- tools for tidy datastringr
- consistent wrappers for common string operationsglue
- string interpolationpurrr
- functional programming toolsggplot2
- Basic philosophy and usage.
- Libraries extending the
ggplot2
. - Additional themes.
- Other useful libraries
janitor
- table summariesplotly
- interactive HTML plotsheatmaply
- interactive HTML heatmapspheatmap
- pretty heatmaps in base RComplexHeatmap
- introduction (Rmd)BiocParallel
- parallelizedlapply()
and others
- Main purpose of this exercise is to practice basic R on a small dataset and to implement a basic set of (mainly visualization) functions, which will be used later for microarray and RNA-seq data.
- Implemented functions are located in age_library.R, skeletons are in age_library_empty.R.
- Exercise on Affymetrix microarray analysis.
- Reading in data, technical and biological quality control, normalization, differential expression, reporting.
- Demonstration of multiple testing issue correction methods on fair/skewed coins.
E06 - IGV browser - Michal Kolar
- Files for practising IGV usage.
E07 - RNA-seq - Jiri Novotny
- This exercise is using experimental data
from human airway smooth muscle cells treatment, and is largely based on a great tutorial
RNA-Seq workflow: gene-level exploratory analysis and differential expression,
from which preprocessed R data are later used (starting from
03 - exploratory analysis
part).
Additional data files must be downloaded prior from here.
If you are working on a remote server, you can use wget
for downloading: wget https://onco.img.cas.cz/novotnyj/age/AGE2021_data.tar
.
Then decompress the downloaded archive to Exercises/
directory, e.g. tar xzf AGE2021_data.tar -C /path/to/Exercises
.
(These data actually include also the output from this exercise, and so they are so large. TODO: also provide data only needed to begin this exercise - reference FASTAs and GTF, sample FASTQs etc.)
- Downloading from SRA (
fasterq-dump
). - Technical quality control (
FastQC
,MultiQC
). - Read trimming (
Trimmomatic
).
- Downloading reference files (genome, annotation, etc.).
- Filtering out rRNA and tRNA (
SortMeRNA
). - Two quantification pipelines:
- Aligning to genome (
GSNAP
), quality control of the alignment (RSeQC
,preseq
) and counting overlaps (featureCounts
). - Mapping to transcriptome (
Salmon
).
- Aligning to genome (
- Importing count matrix to R (
tximport
,DESeq2
). - Using
DESeqDataSet
.
- Running
DESeq2
. - Gene annotation.
- Count transformations, TPM calculation.
- PCA, hierarchical clustering, boxplots.
- Using
DESeq2
- contrasts, interactions, independent filtering, LFC shrinkage. - Reporting results: MA plot, volcano plot, boxplots,
ReportingTools
.
- Gene set databases.
- Data preparation.
- ORA (
goseq
). - GSEA by Subramanian (
clusterProfiler
) + visualization. - Signaling pathway impact analysis (
SPIA
). - Viewing data in KEGG (
pathview
). - Online tools.
- Introduction, software overview, and links to tutorials, lists and other readings.