Home

Welcome to the iCLIP wiki!

Getting Started

You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake.

Clone the github repository to your local filesystem.

# Clone Repository from Github
git clone https://github.com/RBL-NCI/iCLIP.git

# Change your working directory to the iCLIP repo
cd iCLIP/

Load Snakemake to your environment.

# Recommend running snakemake>=5.19
module load snakemake/5.24.1

Preparing Configs and Manifests

There are four input requirements for this pipeline, that must be found in the iCLIP/config directory. These files are:

cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- source_dir: path to snakemake file, within the cloned iCLIP repostiory; example: '/path/to/iCLIP/workflow'
- out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
- multiplex_manifest: path to multiplex manifest (see specific details below; example: '/path/to/multiplex_manifest.tsv'
- sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
- fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- novoalign_reference: selection of reference database ['hg38', 'mm10']; example: 'mm10'
- minimum_count: integer value, of the minimum number of peaks for count; example: 2
multiplex_manifest.tsv - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'SIM_iCLIP_S1.fastq'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sample_manifest.tsv file. example: 'SIM_iCLIP_S1'
```
An example multplex_manifest.tsv file:

file_name                 multiplex
SIM_iCLIP_S1.fastq        SIM_iCLIP_S1
SIM_iCLIP_S2.fastq        SIM_iCLIP_S2
```
samples_manifest.tsv
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
```
An example sample.tsv file:

multiplex       sample           group       barcode     adaptor
SIM_iCLIP_S1    Ro_Clip          CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S1    Control_Clip     CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S2    Ro_Clip2         CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S2    Control_Clip2    CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
```

Running Pipeline

Dry-Run

sh run_snakemake.sh dry-run

Execute pipeline

sh run_snakemake.sh dry-run

Unlock directory (partial run)

sh run_snakemake.sh unlock

Expected Outputs

The following directories are created under the output_directory:

log: slurm output files
01_renamed: demultiplexed files, renamed to match sampleid
02_adaptor: sampleid files with adaptors removed
03_unzip: unzipped sampleid files, with adaptors removed
04_split: unzipped sampleid files, split into smaller files to increase processing speed
05_sam: split sam files, aligned to reference
06_reads: header, unique and multi-mapped text files
07_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
07_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
08_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
09_bam_merged: merged sorted, indexe unique and multi-mapped bam files
10_dedup_bam: unsorted, sorted, and indexed deduplicated merged bam file
11_dedup_split: unsorted, sorted, and indexed split deduplicated into unique and multi-mapped files
12_bed: unique and multi-mapped bed files
13_peaks: txt and SAF peak files
14_gff: GTF and GFF3 peak files

Troubleshooting

Check your email for an email stating that the pipeline failed
Review the logs to determine what rule failed (logs are named by Snakemake rule)

cd /path/to/output/dir/log

Address the error, unlock the directory (Step 3 in Running Pipeline), execute pipelined (Step 2 in Running Pipeline)

Provide feedback

Saved searches