Skip to content
Samantha edited this page Jan 30, 2021 · 24 revisions

Welcome to the iCLIP wiki!

Getting Started

You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake.

  1. Clone the github repository to your local filesystem.
# Clone Repository from Github
git clone https://github.com/RBL-NCI/iCLIP.git

# Change your working directory to the iCLIP repo
cd iCLIP/
  1. Load Snakemake to your environment.
# Recommend running snakemake>=5.19
module load snakemake/5.24.1

Preparing Configs and Manifests

There are three config requirements for this pipeline, that must be found in the /path/to/iCLIP/config directory. These files are:

  1. cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
  2. snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
    • source_dir: path to snakemake file, within the cloned iCLIP repository; example: '/path/to/iCLIP/'
    • out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
    • sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
    • multiplex_manifest: path to multiplex manifest (see specific details below; example: '/path/to/multiplex_manifest.tsv'
    • fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
    • container_dir: path to docker containers, and other programs; example '/path/to/container/'
    • mismatch_allowance: number of nt mismatches allowed in barcodes; options [1,2]
    • split_value: integer value indicating the number of sequences to split fastq file; recommend 3000 for small test files and 2000000 for study files; will default to 1000000 if value not given
    • novoalign_reference: selection of reference database ['hg38', 'mm10']
    • splice_aware: whether to run splice_aware part of the pipeline ['y', 'n']
    • splice_bp_length: length of splice index to use [50, 75, 150]
    • minimum_count: integer value, of the minimum number of matches to count as a peak [1,2,3]
    • nt_merge: minimum distance of nucleotides to merge peaks [10,20,30,40,50,60]
    • peak_id: report peaks for unique peaks only or unique and fractional mm ["unique","all"]
    • DE_method: choose DE method ["manorm","none"]
  3. index_config.yaml - this file will contain directory paths for index files that should follow the structure:
    • organism:
      • std: '/path/to/index/'
      • spliceaware:
        • valuebp1: '/path/to/index1/'
        • valuebp2: '/path/to/index2/'

There are two manifest requirements for this pipeline, with paths identified in the snakemake_config.yaml file (#2) above. These files are:

  1. multiplex_manifest.tsv - this manifest will include information to map fastq files to their multiple sample ID

    • file_name: the full file name of the multiplexed sample, which must be unique; example: 'SIM_iCLIP_S1.fastq'
    • multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sample_manifest.tsv file. example: 'SIM_iCLIP_S1'
    An example multplex_manifest.tsv file:
    
    file_name                 multiplex
    SIM_iCLIP_S1.fastq        SIM_iCLIP_S1
    SIM_iCLIP_S2.fastq        SIM_iCLIP_S2
    
  2. samples_manifest.tsv

    • multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
    • sample: the final sample name; this column must be unique. example: 'Ro_Clip'
    • barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
    • adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
    • group: groupings for samples, may or may not be unique values. example: 'CNTRL'
    An example sample.tsv file:
    
    multiplex       sample           group       barcode     adaptor
    SIM_iCLIP_S1    Ro_Clip          CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
    SIM_iCLIP_S1    Control_Clip     CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
    SIM_iCLIP_S2    Ro_Clip2         CLIP        NNNTGGCNN   AGATCGGAAGAGCGGTTCAG
    SIM_iCLIP_S2    Control_Clip2    CNTRL       NNNCGGANN   AGATCGGAAGAGCGGTTCAG
    

Running Pipeline

  1. Dry-Run
sh run_snakemake.sh dry-run
  1. Execute pipeline on the cluster
sh run_snakemake.sh cluster
  1. Execute pipeline locally
sh run_snakemake.sh local
  1. Unlock directory (after failed partial run)
sh run_snakemake.sh unlock

Expected Outputs

The following directories are created under the output_directory:

  • log: slurm output files, copies of config and manifest files
  • qc: MultiQC report, QC Troubleshooting report for all samples
  • 01_remove_adaptor: zipped, fastq files with adaptors removed
  • 02_unzip: unzipped fastq files, with adaptors removed
  • 03_split: unzipped fastq files, split into smaller files to increase processing speed
  • if splice_aware:
    • 04_sam_splice: intermediate split sam files, aligned to reference
    • 04_sam_genomic: converted transcriptome coordinates to genomic coordinates sam files, zipped
  • 04_sam: split sam files, aligned to reference
  • 05_reads: header, unique and multi-mapped text files
  • 06_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
  • 06_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
  • 07_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
  • 08_bam_merged: merged sorted, indexed unique and multi-mapped bam files
  • 09_dedup_bam: sorted and indexed deduplicated merged bam file; logs and headers
  • 10_dedup_split: unsorted and indexed split deduplicated into unique files
  • 11_bed: unique bed files
  • 12_SAF: unique peak SAF annotation files
  • 13_counts: feature counts for all and unique reads
  • multiplexid
  • 00_qc_pre: fastqc reports for each sample (summarized in /qc/multiqc_report.html)
  • 00_qc_post: barcode statistics for each sample (summarized in /qc/qc_report.html)
  • 01_renamed: demultiplexed files, renamed to match sampleid

Troubleshooting

  • Check your email for an email regarding pipeline failure
  • Review the logs to determine what rule failed (logs are named by Snakemake rule)
  • Review /qc/qc_report.html to determine if poor performance was related to barcode mismatching or alignment
cd /path/to/output/dir/log
  • Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)
Clone this wiki locally