Skip to content
Samantha edited this page Nov 20, 2020 · 24 revisions

Welcome to the iCLIP wiki!

Getting Started

You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake.

  1. Clone the github repository to your local filesystem.
# Clone Repository from Github
git clone https://github.com/RBL-NCI/iCLIP.git

# Change your working directory to the iCLIP repo
cd iCLIP/
  1. Load Snakemake to your environment.
# Recommend running snakemake>=5.19
module load snakemake/5.24.1

Preparing Configs and Manifests

There are four input requirements for this pipeline, that must be found in the iCLIP/config directory. These files are:

  1. cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.

  2. snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;

    • source_dir: path to snakemake file, within the cloned iCLIP repostiory; example: '/path/to/iCLIP/workflow'
    • out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
    • multiplex_manifest: path to multiplex manifest (see specific details below; example: '/path/to/multiplex_manifest.tsv'
    • sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
    • fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
    • novoalign_reference: selection of reference database ['hg38', 'mm10']; example: 'mm10'
    • minimum_count: integer value, of the minimum number of peaks for count; example: 2
  3. multiplex_manifest.tsv - this manifest will include information to map fastq files to their multiple sample ID

    • file_name: the full file name of the multiplexed sample, which must be unique; example: 'SIM_iCLIP_S1.fastq'
    • multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sample_manifest.tsv file. example: 'SIM_iCLIP_S1'
    An example multplex_manifest.tsv file:
    
    file_name                 multiplex
    SIM_iCLIP_S1.fastq        SIM_iCLIP_S1
    SIM_iCLIP_S2.fastq        SIM_iCLIP_S2
    
  4. samples_manifest.tsv

    • multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
    • sample: the final sample name; this column must be unique. example: 'Ro_Clip'
    • barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
    • adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
    • group: groupings for samples, may or may not be unique values. example: 'CNTRL'
    An example sample.tsv file:
    
    multiplex       sample           group       barcode     adaptor
    SIM_iCLIP_S1    Ro_Clip          CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
    SIM_iCLIP_S1    Control_Clip     CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
    SIM_iCLIP_S2    Ro_Clip2         CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
    SIM_iCLIP_S2    Control_Clip2    CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
    

Running Pipeline

  1. Dry-Run
sh run_snakemake.sh dry-run
  1. Execute pipeline
sh run_snakemake.sh dry-run
  1. Unlock directory (partial run)
sh run_snakemake.sh unlock

Expected Outputs

The following directories are created under the output_directory:

  • log: slurm output files
  • 01_renamed: demultiplexed files, renamed to match sampleid
  • 02_adaptor: sampleid files with adaptors removed
  • 03_unzip: unzipped sampleid files, with adaptors removed
  • 04_split: unzipped sampleid files, split into smaller files to increase processing speed
  • 05_sam: split sam files, aligned to reference
  • 06_reads: header, unique and multi-mapped text files
  • 07_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
  • 07_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
  • 08_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
  • 09_bam_merged: merged sorted, indexe unique and multi-mapped bam files
  • 10_dedup_bam: unsorted, sorted, and indexed deduplicated merged bam file
  • 11_dedup_split: unsorted, sorted, and indexed split deduplicated into unique and multi-mapped files
  • 12_bed: unique and multi-mapped bed files
  • 13_peaks: txt and SAF peak files
  • 14_gff: GTF and GFF3 peak files

Troubleshooting

  • Check your email for an email stating that the pipeline failed
  • Review the logs to determine what rule failed (logs are named by Snakemake rule)
cd /path/to/output/dir/log
  • Address the error, unlock the directory (Step 3 in Running Pipeline), execute pipelined (Step 2 in Running Pipeline)
Clone this wiki locally