-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the iCLIP wiki!
The iCLIP github repository is stored locally, and will be used for project deployment. Multiple projects can be deployed from this one point simultaneously, without concern.
- Change working directory to the iCLIP repository
cd /data/RBL_NCI/Pipelines/iCLIP/[version number]
- Load Snakemake to your environment.
# Recommend running snakemake>=5.19
module load snakemake
The Snakemake workflow has a multiple options: Usage: /home/sevillas2/git/iCLIP/run_snakemake.sh -p pipeline -p options: initialize, checks, dry-run, cluster, local, unlock, git, DAG, report Usage: -o output_dir -o path to output directory
Example initiation:
#Initialize Pipeline
sh run_snakemake.sh -p initialize -o /path/to/output/dir
#Check manifests
sh run_snakemake.sh -p check -o /path/to/output/dir
#Dry-Run
sh run_snakemake.sh -p dry -o /path/to/output/dir
#Execute pipeline on the cluster
sh run_snakemake.sh -p cluster -o /path/to/output/dir
#Execute pipeline locally
sh run_snakemake.sh -p local -o /path/to/output/dir
#Unlock directory (after failed partial run)
sh run_snakemake.sh -p unlock -o /path/to/output/dir
#GIT Action
sh run_snakemake.sh -p git -o /path/to/output/dir
#Create report
sh run_snakemake.sh -p report -o /path/to/output/dir
#Create DAG of pipeline
sh run_snakemake.sh -p DAG -o /path/to/output/dir
Explanation of pre-processing steps:
- initialize (required): This must be performed before any Snakemake run (dry, local, cluster) can be performed. This will copy the necessary config, manifest and Snakefiles needed to run the pipeline to the provided output directory.
- checks (optional): This is an optional step, to be performed before any Snakemake run (dry, local, cluster). This will check for errors in the snakemake_config files, as well as your input manifests. If there are errors they will be printed to the command line OR printed in a text file to your output dir.
- dry-run (optional): This is an optional step, to be performed before any Snakemake run (local, cluster). This will check for errors within the pipeline, and ensure that you have read/write access to the files needed to run the full pipeline.
Explanation of processing steps:
- local - This will run the pipeline on a local node. NOTE: This should only be performed on an interactive node.
- cluster - This will submit a master job to the cluster, and subsequent sub-jobs as needed to complete the workflow. An email will be sent when the pipeline begins, if there are any errors, and when it completes.
Explanation of other steps:
- unlock - This will unlock the pipeline if an error caused it to stop in the middle of a run.
- git - This is only utilized for GITHUB Actions testing.
- DAG - This will produce a DAG of the workflow and dependencies, saved to the /output/dir/log directory
- report - This will produce a report generated from the snakemake statistics produced by your pipeline, saved to the /output/dir/log directory.
There are three config requirements for this pipeline, found in the /output/dir/config directory, after initialization. These files are:
-
cluster_config.yml - this file will contain the config default settings for analysis. This file does not require edits, unless processing requirements dictate it.
-
snakemake_config.yaml - this file will contain directory paths and user parameters for analysis;
- sourceDir: path to repo; this does not need to be changed if running local version; example: '/data/RBL_NCI/Pipelines/iCLIP/v2.0'
- outputDir: path to created output directory, where output will be stored; example: '/path/to/output/'
- sampleManifest: path to multiplex manifest (see specific details below(; example:'/path/to/sample_manifest.tsv'
- multiplexManifest: path to multiplex manifest (see specific details below); example: '/path/to/multiplex_manifest.tsv'
- contrastManifest: path to contrast manifest (see specific details below); example: '/path/to/contrast_manifest.tsv'
- fastqDir: path to gzipped multiplexed fastq files; example: '/path/to/raw/fastq/files'
- reference: selection of reference database ['hg38', 'mm10']
- filterlength: minimum read length to include in analysis [any int >20]
- spliceaware: whether to run splice_aware part of the pipeline ['y', 'n']
- includerRNA: whether to include refseq rRNA's in annotations ["Y", "N"]
- splice_bp_length: length of splice index to use [50, 75, 150]
- multiplexflag: whether samples are multiplexed ["Y","N"]
- mismatch: number of bp mismatches allowed in demultiplexing [1,2,3]
- mincount: integer value, of the minimum number of matches to count as a peak [1,2,3]
- ntmerge: minimum distance of nucleotides to merge peaks [10,20,30,40,50,60]
- peakid: report peaks for unique peaks only or unique and fractional mm ["unique","all"]
- DEmethod: choose DE method ["manorm","none"]
- splicejunction: "Y" #include splice junctions in peak calls: "manorm"
- condenseexon: "Y" #whether to collapse exons
-
index_config.yaml - this file will contain directory paths for index files. This file does not require edits, unless processing requirements dictate it.
- organism:
- std: '/path/to/index/'
- spliceaware:
- valuebp1: '/path/to/index1/'
- valuebp2: '/path/to/index2/'
- organism:
There are two required, and one optional, manifests for this pipeline. The paths of these files are defined in the snakemake_config.yaml file. Example files are placed in the /output/dir/manifest directory after initialization. You can edit these example files or choose to create your own. These files are:
-
multiplexManifest (required) - this manifest will include information to map fastq files to their multiple sample ID
- file_name: the full file name of the multiplexed sample, which must be unique; example: 'test_1.fastq.gz'
- multiplex: the multiplexID associated the fastq file, which must be unique. These names must match the multiplex column of the sampleManifest. example: 'test_1'
An example multplex_manifest.tsv file: file_name multiplex test_1.fastq.gz test_1 test_2.fastq.gz test_2
-
samplesManifest (required)
- multiplex: the multiplexID associated with the fasta file, and will not be unique. These names must match the multiplex column of the multiplex_manifest.tsv file. example: 'SIM_iCLIP_S1'
- sample: the final sample name; this column must be unique. example: 'Ro_Clip'
- barcode: the barcode to identify multiplexed sample; this must be unique per each multiplex sample name but can repeat between multiplexid's. example: 'NNNTGGCNN'
- adaptor: the adaptor sequence, to be removed from sample; this may or may not be unique. example: 'AGATCGGAAGAGCGGTTCAG'
- group: groupings for samples, may or may not be unique values. example: 'CNTRL'
An example sampleManifest file with multiplexing of one sample. Notice that the multiplexID test_1 is repeated, as Ro_Clip and Control_Clip are both found in the same fastq file, whereas test_2 is not multiplexed: multiplex sample group barcode adaptor test_1 Ro_Clip CLIP NNNTGGCNN AGATCGGAAGAGCGGTTCAG test_1 Control_Clip CNTRL NNNCGGANN AGATCGGAAGAGCGGTTCAG test_2 Ro_Clip2 CLIP NNNCGTANN AGATCGGAAGAGCGGTTCAG
-
contrastManifest (Optional - required with DE_Method of MANORM)
- contrast_1: the sample name, identified in the samplesManifest [sample] column, of the sample to compare. example: 'Ro_Clip'
- contrast_2: the sample name, identified in the samplesManifest [sample] column, of the background to remove. example: 'Control_Clip'
An example contrastManifest file: contrast_1,contrast_2 Ro_Clip,Control_Clip
The following directories are created under the output_directory:
- 01_remove_adaptor: zipped, fastq files with adaptors removed
- 02_unzip: unzipped fastq files, with adaptors removed
- 03_split: unzipped fastq files, split into smaller files to increase processing speed
- 04_sam:
- 01_alignment: intermediate split sam files, aligned to reference
- if splice_aware: 02_cleanup:
- if splice_aware: 03_genomic: converted transcriptome coordinates to genomic coordinates sam files, zipped
- 04_unmapped
- 05_reads: header, unique and multi-mapped text files
- 06_bam_unique: unsorted, sorted, and indexed bam files generated from unique split sam files
- 06_bam_mm: unsorted, sorted, and indexed bam files generated from multi-mapped split sam files
- 07_bam_merged_splits: merged sorted, indexed splits of unique and multi-mapped bam files
- 08_bam_merged: merged sorted, indexed unique and multi-mapped bam files
- 09_dedup: sorted and indexed deduplicated merged bam file; logs and headers
- 10_bed: unique bed files
- 11_SAF: unique peak SAF annotation files
- 12_counts: feature counts for all and unique reads
- 13_annotation:
- 01_project
- 02_peaks
- [sample].html
- [sample].txt
- 14_MAnorm:
- input:
- [multiplexid]
- 00_qc_pre: fastqc reports for each sample (summarized in /qc/multiqc_report.html)
- 00_qc_post: barcode statistics for each sample (summarized in /qc/qc_report.html)
- 01_renamed: demultiplexed files, renamed to match sampleid
- qc:
- 00_qc_post
- 00_qc_screen_species
- multiqc_data
- qc_report.html
- 00_qc_screen_rrna
- manifest_clean.txt
- multiqc_report.html
- split_params.tsv
- log: slurm output files, copies of config and manifest files
- workflow: saved Snakefile(s) used in run(s)
- Check your email for an email regarding pipeline failure
- Review the logs to determine what rule failed (logs are named by Snakemake rule)
- Review /qc/qc_report.html to determine if poor performance was related to barcode mismatching or alignment
cd /path/to/output/dir/log
- Address the error, unlock the directory (Step 4 in Running Pipeline), and re-execute pipeline (Step 2 or 3 in Running Pipeline)