Skip to content
/ CAMPAREE Public

Configurable And Modular Program Allowing RNA Expression Emulation

License

Notifications You must be signed in to change notification settings

itmat/CAMPAREE

Repository files navigation

CAMPAREE

Introduction

CAMPAREE is a RNA expression simulator that is primed using real data to give realistic output. CAMPAREE needs as input a reference genome with transcript annotations as well as fastq files of samples of the species to base the output on. For each sample, CAMPAREE outputs a simulated set of RNA transcripts mimicking expression levels with in the fastq files and accounting for isoform-level expression and allele-specific expression. It also outputs simulated diploid genomes and their corresponding annotations with phased SNP and indel calls in the transcriptome from fastq reads. Additionally the simulation outputs the underlying distributions used for expressing the transcripts.

Documentation

Our documentation is on ReadTheDocs.

Quick Start Guide

This guide will walk you through basic installation and usage of CAMPAREE running a simulation on a simplified dataset consisting of a mouse genome truncated to about 6 million bases and two samples of reads that align there.

Installation

Make sure you have the following installed on your system:

  • git
  • python version 3.10
  • Java 1.8

Pull the git repo for CAMPAREE and BEERS_UTILS the into a convenient location:

git clone https://github.com/itmat/CAMPAREE.git
git clone https://github.com/itmat/BEERS_UTILS.git

Create a Python virtual environment to install required Python libraries to:

cd CAMPAREE
python3 -m venv ./venv_camparee

And activate the environment:

source ./venv_camparee/bin/activate

Install required libraries:

pip install -r requirements.txt
pip install -r ../BEERS_UTILS/requirements.txt

Install CAMPAREE package in your Python environment:

pip install -e .

Next, install the BEERS_UTILS package that CAMPAREE uses:

pip install -e ../BEERS_UTILS

Note that we currently require the use of the -e flag during these installs, otherwise CAMPAREE will not run successfully.

Baby Genome

The "baby genome" is a truncated version of mm10 consisting of segments of length at most 1 million bases chosen from chromosomes 1, 2, 3, X, Y, and MT.

Create an STAR index for alignment to the baby genome:

bin/create_star_index_for_baby_genome.sh

Perform Test Run

We are now ready to run CAMPAREE on a two small sample fastq files aligning to the baby genome. If you have not already done so for installation, activate the python environment:

source ./venv_camparee/bin/activate

The default config file for the baby genome has CAMPAREE run all operations serially on a single machine. To perform the test run with these defaults run:

bin/run_camparee.py -c config/baby.config.yaml -r 1

The argument -r 1 indicates that the run number is 1. If you run this again, you must either remove the output directory test_data/results/run_1/ or specify a new run number.

It is also possible to test deployment to a cluster. For LSF clusters run:

bin/run_camparee.py -c config/baby.config.yaml -r 1 -m lsf

For SGE clusters run:

bin/run_camparee.py -c config/baby.config.yaml -r 1 -m sge

Check Results

When the run completes, output will be created in CAMPAREE/test_data/results/run_1/. The final outputs will be in the text files test_data/results/run_1/CAMPAREE/data/sample1/molecule_file and test_data/results/run_1/CAMPAREE/data/sample2/molecule_file. Each line (after the header line) corresponds to a sequence of a single molecule in a tab-separated format. The default config file outputs 10000 molecules.

Packaged Software

CAMPAREE includes (in the 'third_party_software' directory) binary executables from the following pieces of software. Please cite the listed papers when using CAMPAREE.

  • STAR
    • Version: 2.5.2a
    • Source code: https://github.com/alexdobin/STAR/tree/2.5.2a
    • Citation: Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1):15-21. doi: 10.1093/bioinformatics/bts635
  • BEAGLE
    • Version: 5.0 (28Sep18.793)
    • Source code: https://faculty.washington.edu/browning/beagle/beagle.html
    • Citation: Browning SR, Browning BL. Rapid and accurate haplotype phasing and missing data inference for whole genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007 Nov;81(5):1084-97. doi:10.1086/521987
  • Kallisto
  • Bowtie2

Funding

Work on CAMPAREE is supported by R21-LM012763-01A1: “The Next Generation of RNA-Seq Simulators for Benchmarking Analyses” (PI: Gregory R. Grant).