-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Samantha edited this page Jan 30, 2021
·
24 revisions
Welcome to the iCLIP wiki!
You'll need to prepare your local filesystem by cloning the github repository for iCLIP and loading Snakemake.
- Clone the github repository to your local filesystem.
# Clone Repository from Github
git clone https://github.com/RBL-NCI/iCLIP.git
# Change your working directory to the iCLIP repo
cd iCLIP/
- Load Snakemake to your environment.
# Recommend running snakemake>=5.19
module load snakemake/5.24.1
There are three config requirements for this pipeline, that must be found in the /path/to/iCLIP/config directory. These files are:
- cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
- snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- source_dir: path to snakemake file, within the cloned iCLIP repository; example: '/path/to/iCLIP/'
- out_dir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sample_manifest: path to multiplex manifest (see specific details below; example:'/path/to/sample_manifest.tsv'
- multiplex_manifest: path to multiplex manifest (see specific details below; example: '/path/to/multiplex_manifest.tsv'
- fastq_dir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- container_dir: path to docker containers, and other programs; example '/path/to/container/'
- mismatch_allowance: number of nt mismatches allowed in barcodes; options [1,2]
- split_value: integer value indicating the number of sequences to split fastq file; recommend 3000 for small test files and 2000000 for study files; will default to 1000000 if value not given
- novoalign_reference: selection of reference database ['hg38', 'mm10']
- splice_aware: whether to run splice_aware part of the pipeline ['y', 'n']
- splice_bp_length: length of splice index to use [50, 75, 150]
- minimum_count: integer value, of the minimum number of matches to count as a peak [1,2,3]
- nt_merge: minimum distance of nucleotides to merge peaks [10,20,30,40,50,60]
- peak_id: report peaks for unique peaks only or unique and fractional mm ["unique","all"]
- DE_method: choose DE method ["manorm","none"]
- index_config.yaml - this file will contain directory paths for index files that should follow the structure:
- organism:
- std: '/path/to/index/'
- spliceaware:
- valuebp1: '/path/to/index1/'
- valuebp2: '/path/to/index2/'
- organism:
There are two manifest requirements for this pipeline, with paths identified in the snakemake_config.yaml file (#2) above. These files are:
-
multiplex_manifest.tsv - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'SIM_iCLIP_S1.fastq'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sample_manifest.tsv file. example: 'SIM_iCLIP_S1'
An example multplex_manifest.tsv file: file_name multiplex SIM_iCLIP_S1.fastq SIM_iCLIP_S1 SIM_iCLIP_S2.fastq SIM_iCLIP_S2
-
samples_manifest.tsv
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
An example sample.tsv file: multiplex sample group barcode adaptor SIM_iCLIP_S1 Ro_Clip CLIP NNNTGGCNN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S1 Control_Clip CNTRL NNNCGGANN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S2 Ro_Clip2 CLIP NNNTGGCNN AGATCGGAAGAGCGGTTCAG SIM_iCLIP_S2 Control_Clip2 CNTRL NNNCGGANN AGATCGGAAGAGCGGTTCAG
- Dry-Run
sh run_snakemake.sh dry-run
- Execute pipeline on the cluster
sh run_snakemake.sh cluster
- Execute pipeline locally
sh run_snakemake.sh local
- Unlock directory (after failed partial run)
sh run_snakemake.sh unlock
The following directories are created under the output_directory:
- log: slurm output files, copies of config and manifest files
- qc: MultiQC report, QC Troubleshooting report for all samples
- 01_remove_adaptor: zipped, fastq files with adaptors removed
- 02_unzip: unzipped fastq files, with adaptors removed
- 03_split: unzipped fastq files, split into smaller files to increase processing speed
- if splice_aware:
- 04_sam_splice: intermediate split sam files, aligned to reference
- 04_sam_genomic: converted transcriptome coordinates to genomic coordinates sam files, zipped
- 04_sam: split sam files, aligned to reference
- 05_reads: header, unique and multi-mapped text files
- 06_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
- 06_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
- 07_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
- 08_bam_merged: merged sorted, indexed unique and multi-mapped bam files
- 09_dedup_bam: sorted and indexed deduplicated merged bam file; logs and headers
- 10_dedup_split: unsorted and indexed split deduplicated into unique files
- 11_bed: unique bed files
- 12_SAF: unique peak SAF annotation files
- 13_counts: feature counts for all and unique reads
- multiplexid
- 00_qc_pre: fastqc reports for each sample (summarized in /qc/multiqc_report.html)
- 00_qc_post: barcode statistics for each sample (summarized in /qc/qc_report.html)
- 01_renamed: demultiplexed files, renamed to match sampleid
- Check your email for an email regarding pipeline failure
- Review the logs to determine what rule failed (logs are named by Snakemake rule)
- Review /qc/qc_report.html to determine if poor performance was related to barcode mismatching or alignment
cd /path/to/output/dir/log
- Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)