-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Samantha edited this page Nov 20, 2020
·
24 revisions
Welcome to the iCLIP wiki!
You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake.
- Clone the github repository to your local filesystem.
# Clone Repository from Github
git clone https://github.com/RBL-NCI/iCLIP.git
# Change your working directory to the iCLIP repo
cd iCLIP/
- Load Snakemake to your environment.
# Recommend running snakemake>=5.19
module load snakemake/5.24.1
There are four input requirements for this pipeline, that must be found in the iCLIP/config directory. These files are:
-
cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
-
snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- source_dir: path to snakemake file, within the cloned iCLIP repostiory; example: '/path/to/iCLIP/workflow'
- out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
- multiplex_manifest: path to multiplex manifest (see specific details below; example: '/path/to/multiplex_manifest.tsv'
- sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
- fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- novoalign_reference: selection of reference database ['hg38', 'mm10']; example: 'mm10'
- minimum_count: integer value, of the minimum number of peaks for count; example: 2
-
multiplex_manifest.tsv - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'SIM_iCLIP_S1.fastq'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sample_manifest.tsv file. example: 'SIM_iCLIP_S1'
An example multplex_manifest.tsv file: file_name multiplex SIM_iCLIP_S1.fastq SIM_iCLIP_S1 SIM_iCLIP_S2.fastq SIM_iCLIP_S2
-
samples_manifest.tsv
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
An example sample.tsv file: multiplex sample group barcode adaptor SIM_iCLIP_S1 Ro_Clip CLIP NNNTGGCNN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S1 Control_Clip CNTRL NNNCGGANN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S2 Ro_Clip2 CLIP NNNTGGCNN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S2 Control_Clip2 CNTRL NNNCGGANN AGATCGGAAGAGCGGTTCAG
- Dry-Run
sh run_snakemake.sh dry-run
- Execute pipeline
sh run_snakemake.sh dry-run
- Unlock directory (partial run)
sh run_snakemake.sh unlock
The following directories are created under the output_directory:
- log: slurm output files
- 01_renamed: demultiplexed files, renamed to match sampleid
- 02_adaptor: sampleid files with adaptors removed
- 03_unzip: unzipped sampleid files, with adaptors removed
- 04_split: unzipped sampleid files, split into smaller files to increase processing speed
- 05_sam: split sam files, aligned to reference
- 06_reads: header, unique and multi-mapped text files
- 07_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
- 07_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
- 08_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
- 09_bam_merged: merged sorted, indexe unique and multi-mapped bam files
- 10_dedup_bam: unsorted, sorted, and indexed deduplicated merged bam file
- 11_dedup_split: unsorted, sorted, and indexed split deduplicated into unique and multi-mapped files
- 12_bed: unique and multi-mapped bed files
- 13_peaks: txt and SAF peak files
- 14_gff: GTF and GFF3 peak files
- Check your email for an email stating that the pipeline failed
- Review the logs to determine what rule failed (logs are named by Snakemake rule)
cd /path/to/output/dir/log
- Address the error, unlock the directory (Step 3 in Running Pipeline), execute pipelined (Step 2 in Running Pipeline)