crispr-DART is a pipeline to process, analyse, and report about the CRISPR-Cas9 induced genome editing outcomes from high-throughput sequencing of target regions of interest.
crispr-DART has been developed as part of the study "Parallel genetics of regulatory sequences using scalable genome editing in vivo" and is now published at Cell Reports: Froehlich, J. & Uyar, B. et al, Cell Reports, 2021.
Here is also the news coverage of our story: Scaling up genome editing big in tiny worms.
The pipeline allows single/paired-end Illumina reads or long PacBio reads from both DNA and RNA samples.
The pipeline consists of the following steps:
- Quality control (fastqc/multiqc) and improvement (TrimGalore!) of raw reads
- Mapping the reads to the genome of interest (BBMap)
- Extracting statistics about the detected insertions and deletions (various R libraries including GenomicAlignments and RSamtools)
- Reporting of the editing outcomes in interactive reports organized into a website. (rmarkdown::render_site)
The HTML reports produced by the pipeline are automatically organised as a website. Example report website can be browsed here
You can find below some example screenshots from the HTML reports:
- Download the source code:
> git clone https://github.com/BIMSBbioinfo/crispr_DART.git
- Create a guix profile with dependencies
> mkdir -p $HOME/guix-profiles/crispr_dart
> guix package --manifest=guix.scm --profile=$HOME/guix-profiles/crispr_dart
# activate env
> source ~/guix-profiles/crispr_dart/etc/profile
- Test the installation on sample data
> snakemake -s snakefile.py --configfile sample_data/settings.yaml --cores 4 --printshellcmds
The pipeline currently requires four different input files.
- A sample sheet file, which describes the samples, associated fastq files, the sets of sgRNAs used in the sample and the list of regions of interest.
Please see the example sample sheet file under sample_data/sample_sheet.csv
.
- A BED file containing the genomic coordinates of all the sgRNAs used in this project.
Please see the example BED file for sgRNA target sites under sample_data/cut_sites.bed
- A comparisons table, which is used for comparing pairs of samples in terms of genome editing outcomes.
Please see the example table under sample_data/comparisons.tsv
- A settings file, which combines all the information from the other input files and additional configurations for resource requirements of tools.
Please see the example file under sample_data/settings.yaml
The sample_data/fasta
folder contains fasta format sequence files that are used as the target genome sequence.
The sample_data/reads
folder contains sample read files (fastq.gz files from Illumina and PacBio sequenced samples).
Once the settings.yaml
file is configured with paths to all the other required files, the pipeline can simply be run using the bash script run.sh
requesting 2 cpus.
> snakemake -s snakefile.py --configfile */path/to/settings.yaml* --cores 4 --printshellcmds
If you would like to do a dry-run, meaning that the list of jobs are created but not executed, you can do
> snakemake -s snakefile.py --configfile */path/to/settings.yaml* --cores 4 --dryrun --printshellcmds
See the publication on Cell Reports
The software has been developed by Bora Uyar from the Akalin Lab with significant conceptual contributions by Jonathan Froehlich from the N.Rajewsky Lab at the Berlin Institute of Medical Systems Biology of the Max-Delbruck-Center for Molecular Medicine.