Home

Welcome to the iCLIP wiki!

Getting Started

You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake.

Clone the github repository to your local filesystem.

# Clone Repository from Github
git clone https://github.com/RBL-NCI/iCLIP.git

# Change your working directory to the iCLIP repo
cd iCLIP/

Load Snakemake to your environment.

# Recommend running snakemake>=5.19
module load snakemake/5.24.1

Preparing Configs and Manifests

There are three config requirements for this pipeline, that must be found in the /path/to/iCLIP/config directory. These files are:

cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- source_dir: path to snakemake file, within the cloned iCLIP repository; example: '/path/to/iCLIP/'
- out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
- multiplex_manifest: path to multiplex manifest (see specific details below; example: '/path/to/multiplex_manifest.tsv'
- fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- container_dir: path to docker containers, and other programs; example '/path/to/container/'
- mismatch_allowance: number of nt mismatches allowed in barcodes; options [1,2]
- split_value: integer value indicating the number of sequences to split fastq file; recommend 3000 for small test files and 2000000 for study files; will default to 1000000 if value not given
- novoalign_reference: selection of reference database ['hg38', 'mm10']
- splice_aware: whether to run splice_aware part of the pipeline ['y', 'n']
- splice_bp_length: length of splice index to use [50, 75, 150]
- minimum_count: integer value, of the minimum number of matches to count as a peak [1,2,3]
- nt_merge: minimum distance of nucleotides to merge peaks [10,20,30,40,50,60]
- peak_id: report peaks for unique peaks only or unique and fractional mm ["unique","all"]
- DE_method: choose DE method ["manorm","none"]
index_config.yaml - this file will contain directory paths for index files that should follow the structure:
- organism:
  - std: '/path/to/index/'
  - spliceaware:
    - valuebp1: '/path/to/index1/'
    - valuebp2: '/path/to/index2/'

There are two manifest requirements for this pipeline, with paths identified in the snakemake_config.yaml file (#2) above. These files are:

multiplex_manifest.tsv - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'SIM_iCLIP_S1.fastq'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sample_manifest.tsv file. example: 'SIM_iCLIP_S1'
```
An example multplex_manifest.tsv file:

file_name                 multiplex
SIM_iCLIP_S1.fastq        SIM_iCLIP_S1
SIM_iCLIP_S2.fastq        SIM_iCLIP_S2
```
samples_manifest.tsv
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
```
An example sample.tsv file:

multiplex       sample           group       barcode     adaptor
SIM_iCLIP_S1    Ro_Clip          CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S1    Control_Clip     CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S2    Ro_Clip2         CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
SIM_iCLIP_S2    Control_Clip2    CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
```

Running Pipeline

Dry-Run

sh run_snakemake.sh dry-run

Execute pipeline on the cluster

sh run_snakemake.sh cluster

Execute pipeline locally

sh run_snakemake.sh local

Unlock directory (after failed partial run)

sh run_snakemake.sh unlock

Expected Outputs

The following directories are created under the output_directory:

log: slurm output files, copies of config and manifest files
qc: MultiQC report, QC Troubleshooting report for all samples
01_remove_adaptor: zipped, fastq files with adaptors removed
02_unzip: unzipped fastq files, with adaptors removed
03_split: unzipped fastq files, split into smaller files to increase processing speed
if splice_aware:
- 04_sam_splice: intermediate split sam files, aligned to reference
- 04_sam_genomic: converted transcriptome coordinates to genomic coordinates sam files, zipped
04_sam: split sam files, aligned to reference
05_reads: header, unique and multi-mapped text files
06_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
06_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
07_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
08_bam_merged: merged sorted, indexed unique and multi-mapped bam files
09_dedup_bam: sorted and indexed deduplicated merged bam file; logs and headers
10_dedup_split: unsorted and indexed split deduplicated into unique files
11_bed: unique bed files
12_SAF: unique peak SAF annotation files
13_counts: feature counts for all and unique reads
multiplexid

00_qc_pre: fastqc reports for each sample (summarized in /qc/multiqc_report.html)

00_qc_post: barcode statistics for each sample (summarized in /qc/qc_report.html)

01_renamed: demultiplexed files, renamed to match sampleid

Troubleshooting

Check your email for an email regarding pipeline failure
Review the logs to determine what rule failed (logs are named by Snakemake rule)
Review /qc/qc_report.html to determine if poor performance was related to barcode mismatching or alignment

cd /path/to/output/dir/log

Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)

Provide feedback

Saved searches